Posted by Rob Knies

Organizers of the Microsoft Research Data Science Summer School

Today’s burgeoning interest in big data offers tremendous potential for driving services that promise to transform our future. That promise, though, doesn’t come without significant effort.

Harnessing the power of big data requires an unprecedented understanding of complex systems. Scalable computational tools are a necessity, as is the ability to comprehend and devise the sorts of scientific questions to extract meaning from masses of data.

But that’s not all, says Sharad Goel, senior researcher at Microsoft Research New York City.

“Computer scientists are a relatively homogenous group,” Goel observes, “and a longstanding goal of the computer-science community is to increase diversity, broadly defined, within our ranks.”

Hence, the Microsoft Research Data Science Summer School, an eight-week effort at the New York City lab to provide introduction to large-scale data analysis for undergraduate students in the New York City area interested in attending graduate school in computer science and related fields.

In particular, the organizers of the summer school—Goel, Fernando Diaz, Jake Hofman, Justin Rao, and Hanna Wallach, all of Microsoft Research New York City—are committed to pursuing a more diversified computer-science ecosystem.

The summer school, therefore, is encouraging applications from women, minorities, individuals with disabilities, and students from smaller colleges. The application deadline for the school, to run from June 16 through Aug. 8, is April 18.

“The five of us felt that we could concretely contribute to this goal,” Goel explains, “so we decided to just go for it. I don’t think that there is anything unique about this particular moment, other than we felt ‘now’ is better than ‘later.’”

The school will choose eight upper-level undergraduate students from under-represented race, gender, and socioeconomic groups for whom program participation would bolster significantly their professional trajectories. Included in that target group are undergrads at smaller institutions, which can’t always offer access to the resources such students can use to ensure that they reach their human potential.

Applicants need to have:

  • Taken core undergraduate computer-science classes.
  • Some experience with programming.
  • A desire to attend graduate school.

Selected applicants will receive a laptop and a $5,000 stipend.

The summer school’s courses will introduce key tools and techniques for working with large data sets. The instruction will focus on how tools can help solve real-world problems, unlike traditional coursework, for which students often receive prepackaged data sets obtained by a third party and prepared for a specific exercise.

Real-world research often requires tasks such as data cleaning, preparing messy data for use, and data acquisition, perhaps through use of an application-programming interface.

In addition, the school will offer an introduction to problems in the areas of applied statistics and machine learning. Students will learn the theory underpinning simple, effective methods of supervised and unsupervised learning—with an emphasis on formulating real-world modeling and prediction tasks as optimization problems. Those chosen for the program will be taught how to compare methods for practical efficacy and scalability, as well as to learn to evaluate models for applications including spam filtering and recommendation systems.

The structure for the school:

  • Students will self-select into a pair of topic-oriented groups, each working on a collaborative research project and each directed by a couple of Microsoft Research domain experts. Goel and Wallach will lead the track on computational social science, while Diaz and Rao will provide guidance in information retrieval and systems. Both groups will learn to apply tools to answer substantive scientific questions.
  • Each group will be required to produce a technical report and/or a demonstration to share its findings. These projects should serve as a key differentiator for grad-school applications and for those seeking research jobs, and a particularly successful project could lead to a scientific publication.
  • The first four weeks of the school will include an explanatory analysis to gain a preliminary understanding of the data set. Included will be an introduction to scripting, both on the command line and with Python and R, as well as direct experience in acquiring and modeling data from online sources. The course structure will include a morning lecture, discussion, or lab, with breakout meetings in the afternoon, followed by work as a group or independently.
  • The final four weeks will consist of work on the group research projects, with ad hoc mentor check-ins.

Goel and Rao will provide project coordination, and Wallach, Rao, and Hofman will serve as instructors.

In addition to increasing diversity in computer science, the Microsoft Research Data Science Summer School also will provide the ancillary benefit of building long-term interactions between Microsoft Research and the most talented young students from diverse backgrounds that the New York City area has to offer.

“Our primary goal for the summer,” Goel concludes, “is to get the students excited about computer science in general and to show them the creative, research side of the discipline, which they may not have encountered in their classes.

“In the process, we hope also to help prepare them for their future careers in computer science.”