Data Science is a Team Sport



In 2013, Ron Howard directed and released the movie Rush, a film that captured the rivalry between James Hunt and Niki Lauda during the 1976 Formula One racing season. It’s a vivid portrait of the drivers and their personalities—a pretty typical, if captivating focus on the drivers as heroes of the race. But it does something deeper and more interesting as well. The film looks into the essence of Formula One—a true team sport.

“Formula” in Formula One refers to the set of rules to which all participants' cars must conform. Formula One rules were agreed upon in 1946, on the heels of World War II. Modern Formula One cars are open cockpit, single-seat vehicles. The cornering speed of a car comes from “wings” mounted at the front and rear of the vehicle. The tires also play a major role in the cornering speed of a car. Carbon disc brakes are used to increase performance. Engines have evolved to turbocharged V6’s. All these components are integrated to provide precision and performance, and to win the race. However, the precision and design of the vehicle is useless, without the right team.

In Formula One, an “entrant” is the person who registers a car and driver for the race, and maintains the vehicle. The “constructor” is the person who builds the engine or chassis and owns the intellectual rights to the design. The “pit crew” is the team that prepares and maintains the vehicle before, during, and after the race. The cameras focus on the driver, with a couple of obligatory shots of the pit crew scrambling to change tires. But the real story is the collaboration of the complete team: experts working together to make the difference between success and failure.

***

Since the turn of the century, enterprises around the world have been on a journey to master data science and analytics. We have fewer camera crews, and no cool uniforms, but the goal is no less difficult to achieve. Said simply, we want the right information, at the right moment, to make better decisions. Despite years of effort, organizations have achieved inconsistent results. Some are building competitive moats with machine learning on a large corpus of data, but others are only reducing their costs by 3%, using some new tools. This is best viewed on an enterprise maturity curve:


Why are some organizations able to achieve differentiated results, while others struggle to set up a Hadoop cluster?

***

Spark is the Analytics Operating System for the modern enterprise. Anyone using data, starting right now, will be leveraging Spark. Spark enables universal access to data in an organization.

Today, we are announcing the Data Science Experience, the first enterprise app available for the Analytics Operating System. This is the first integrated development environment for real-time, high performance Analytics, designed to blend emerging data technologies and machine learning into existing architectures.

An IDE for data science is a collaborative environment; it brings data scientists together to make data science and machine learning available to everyone. Today, data science is an individual sport. If you are a data scientist at a retailer, for example, you have to choose your own tool or flavor, work on your own, and, with any luck, you produce a meaningful insight. Anything you learn stays with you—it’s self-contained, because it is built in your own lingua-franca.

Now, with the Data Science Experience, you can use any language you want—R, Python, Scala, etc.—and share your models with other data scientists in your organization.

We have made data science a team sport.

In Formula One parlance, Spark is the chassis, holding everything together. The Data Science Experience (the IDE) is the integrated components, acting as one, to drive precision and performance. And the data science discipline now has a driver, a pit crew, a constructor, and a coach, that incredible vehicle whose sum is greater than its parts: a team.

The Data Science Experience is born on the cloud. It adapts to open source innovation. And the Data Science Experience grows stronger as more and more data scientists around the globe create solutions based on Spark. Further, the ecosystem for The Data Science Experience is open and available. We are proud to have partners like H20, RStudio, Lightbend, and Galvanize, to name a few.

With Data Science Experience, the discipline of data science can now accomplish exponentially greater outcomes. It’s the difference between a shiny car sitting in a garage, and crossing the finish line at 230 miles per hour.

***

IBM is building the next generation analytics platform in the cloud.

1. It started with our investment in Apache Spark as the Analytics O/S, last year.
2. It continues today, as we launch the first IDE for this new way of thinking about data & analytics.
3. Over time, this will evolve as the platform for an enterprise in the data era.

All of this is enabled by Spark.

***

In June 2015, we announced IBM’s commitment to Apache Spark. In closing, I want to provide some context on our progress in the last year. If you missed it last year, here is why I believe Spark is will be a critical force in technology, on the same scale as Linux.

So, what have we accomplished and where are we going?

1) We continue to expand the Spark Technology Center (STC). We opened an STC in India. We continue to hire aggressively. And, later this year, we will move into our new home on Howard St. in San Francisco.

2) Client traction and response has been phenomenal. We have 40+ client references already and more on the way.

3) We have open sourced SystemML as promised and we are working on it with the community, in the open. This contribution is over 100,000 lines of code. SystemML was accepted into Apache as an official Incubator project as of November 2015. Since it was open-sourced, 859 contributions have been made to the project (i.e. a build-out of the Spark backend, API improvements; usability with Scala Spark & PySpark notebooks for data science, experimental work into deep learning, etc.)

4) For Spark 1.6.x, a total of 29 team members contributed to the release (26 of them from the STC), and each contributing engineer is a credited contributor in the release notes of Spark 1.6.x. For Spark 2.0, 31 STC developers have contributed to Spark 2.0 thus far. This is still in progress

5) Our Spark specific JIRAs have been almost 25,000 lines of code. You can watch them in action here. Much of our focus has been on SQL, MLlib, and PySpark.

6) We launched the Open Source Analytics Ecosystem and are working closely with partners like Databricks, Lightbend, RStudio, H20, and many others. We welcome all.

7) We have trained ~400,000 data scientists through a number of forums, including BigDataUniversity.com.

8) Adoption of the Spark Service on IBM Cloud continues to grow exponentially, as users seek access to the Analytics Operating System.

9) We have over 30 IBM products that are leveraging Spark and many more in the pipeline.

10) We launched a Spark newsletter. Anyone can subscribe here.

11) Lastly, we have launched a Spark Advisory Council. Over 25 leading enterprises and partners — Spark experts building new companies and established industry leaders building new platforms — participate in this regular dialogue about their experiences with Spark and the direction of the Spark project. We use this thinking to focus our efforts in the Spark Technology Center. All are welcome. Contact us here if you are interested.

***

Data Science is a team sport. Spark is the enabler. This is why I stated last year that anyone using data will be leveraging Spark in the future. That future is quickly arriving.

Winning in Formula One is about speed, performance, precision, and collaboration. Those that find the winners circle have found a way to integrate the components (human and material) to act as ONE. The same opportunity exists in Analytics and Data Science. Let’s make data science a team sport. Welcome to the first enterprise app for the Analytics Operating System: The Data Science Experience.