IBM | Spark - An Update

IBM made some significant announcements around our investment in Spark, back in June (See here and here). 90 days later, I saw it fit to provide an update on where we stand in our community efforts.


Before I get to the details, I want to first re-state why I believe Spark is will be a critical force in technology, on the same scale as Linux.

1) Spark is the Analytics operating system: Any company interested in analytics will be using Spark.
2) Spark unifies data sources. Hadoop is one of many repositories that Spark may tap into.
3) The unified programming model of Spark, makes it the best choice for developers building data-rich analytic applications.
4) The real value of Spark is realized through Machine Learning. Machine Learning automates analytics and is the next great scale effect for organizations.

I was recently interviewed by Datanami on the topic of Spark. You can read it here. I like this article because it presents an industry perspective on Hadoop and Spark, but it's also clear on our unique point of view.

Also, this slideshare illustrates the point of view:


So, what have we accomplished and where are we going?

1) We have been hiring like crazy. As fast as we can. The STC in San Francisco is filling up and we have started to bring on community leaders like @cfregly .

2) Client traction and response has been phenomenal. We have references already and more on the way.

3) We have open sourced SystemML as promised (see on github) and we are working on it with the the open. This contribution is over 100,000 lines of code.

4) Spark 1.5 was just released. We went from 3 contributors to 11, in one release cycle. Read more here.

5) Our Spark specific JIRAs have been ~5,000 lines of code. You can watch them in action here.

6) We are working closely with partners like Databricks and Typesafe.

7) We have trained ~300,000 data scientists through a number of forums, including You can also find some deep technical content here.

8) We have seen huge adoption of the Spark Service on Bluemix.

9) We have ~10 IBM products that are leveraging Spark and many more in the pipeline.

10) We launched a Spark newsletter. Anyone can subscribe here.


Between now and the end of the year, we have 5 significant events, where we will have much more to share regarding Spark:

a) Strata NY- Sept 29-Oct 1 in New York, NY.
b) Apache Big Data Conference- Sept 28-30 in Budapest, Hungary.
c) Spark Summit Europe- Oct 27-29 in Amsterdam.
d) Insight- Oct 26-29 in Las Vegas.
e) Datapalooza- November 10-12 in San Francisco.

In closing, here is a peek inside:

Even Doctors Will Be Data Scientists

We all know how it works. You walk into a doctor’s office complaining about some pain in your leg or otherwise. They take your temperature, get you on the scale, check your blood pressure, and perhaps even get out the rubber hammer. These measurements are simply snapshots at one particular instant in time and may be subject to error. This limited dataset fails to capture temporal variations or the many other important factors that are required to assess the patient’s health status. After reviewing the few measurements collected, the consultation between the patient and doctor begins. Baased on the rudimentary physical analysis, along with the discussion with the patient, the physician will assert the condition that they believe is present, followed by a recommended treatment.

This approach, which is common throughout the world, is much more based on instinct and gut feeling than a scientific approach to analyzing data. Accordingly, it seems that most decisions are made based on the opinion of the physician instead of a data-proven truth. This type of opinion-based medicine is a problem in both doctor-patient care and in medical research. This is a symptom of a lack of data, as well as years of training physicians to perform without complete data.

The data collected in a typical office visit is only a fraction of the data that could be collected if health were viewed as a data problem. And, if health were redefined as a data problem, physicians would likely need different skills to process and analyze the data.


Vinod Khosla is one of the most successful venture capitalists in the history of Silicon Valley. He was an original founder of Sun Microsystems, and has since gone on to finance a variety of start-up companies as a venture capitalist. While he is not a medical expert, he is a data expert. In his speech at Stanford Medicine X, Khosla highlights three major issues in medicine today:

1) Doctors are human: Doctors, like everyone else, have cognitive limitations. Some are naturally smarter than others or have deeper knowledge about a particular topic. The latter leads to biases in how they think, act, and prescribe. Most shockingly, Khosla cites that doctors often decide on a patient diagnosis in the first 30 seconds of the observation. Said another way, they base their diagnosis on a gut reaction to the symptoms that they can see or are described to them.

2) Opinions dominate medicine: Khosla asserts that medicine is much more based on opinion than data. He cites the Cleveland Clinic Doctors’ Review of Initial Diagnosis study, asserting that Cleveland Clinic doctors disagree with initial diagnoses 11 percent of the time. In 22 percent of cases, minor changes to treatment are recommended. And in a startling 18 percent of cases, major changes to treatment are recommended. As Khosla states, “This means it’s not medical science.”

3) Disagreement is common among physicians: Doctors disagree a lot. It’s so dramatic, that, Khosla states, “whether or not you have surgery is a function of whom you ask.”

Medicine is currently a process of trial and error, coupled with professional opinion.


The Data era in medicine will be defined by a shift from intuition and opinion to data. We can collect more data in a day now than we could in a year not too long ago. Collecting data and applying it to solve healthcare problems will transform the cost and effectiveness of medicine. The question is how quickly we can get there.

Medical schools must evolve as technology advances. Most advancement in medical schools, based on technology, have been focused on utilizing advanced tools and equipment, as opposed to addressing the core knowledge needed by a physician in the data era.

The curriculum for the first two years of medical school varies by school, but it is heavy on the sciences, the human body, and the human condition. This has been typical since the first medical schools in the 1200s. All this time, investment and history, yet the newly minted physician is unprepared for practicing in the data era.

The data era requires an augmentation in curriculum to include key skills required for data-based analysis:


*Data Analysis and Tools

The skills of physicians will necessarily evolve in the data era, and that has to begin in medical schools. This focus will expedite the move away from opinion-based medicine to a future that the ill prefer: prescriptions based on hardened data analysis.


This week, IBM is announcing a set of tools, technology, and processes to bring data science to the masses. Said another way, armed with IBM technology, everyone is a data scientist. We are democratizing the access to data in your organization.

Every organization sees Hadoop as providing an open-source, rapidly evolving platform that is capable of collecting and economically storing a large corpus of data, waiting to be tapped. Yet, most organizations are not yet fully realizing the value of Hadoop due to the lack of skilled data scientists and developers to extract valuable insight. IBM will make everyone a data scientist. We take the first steps this week by:

1) Introducing new modules for In-Hadoop analytics including SQL, Machine Learning, and R.

2) Confirming our commitment to open source with IBM BigInsights Open Platform with Apache Hadoop, to include new innovations like Apache Spark. We are excited to be a founding member of the Open Data Platform.

3) Rolling out expanded data science training for Machine Learning and Apache Spark via BigDataUniversity. Today, over 230,000 professionals and students are being trained at BigDataUniversity and we are on our way to 1 million trained.


We all look forward to how things will be in 15 years. You walk into a doctor’s office, and the physician immediately knows why you are there. In fact, she had discussed some data irregularities that she had spotted at your annual physical exam, six months prior. She doesn’t need to take your temperature, as she receives that data direct from your home every day. You also take your own blood pressure monthly and that is transmitted directly to your physician. Instead, the discussion immediately turns to the possible treatments, along with the probability of success with each one. Recent data from other patients with a similar history and physiology indicate that regular medication will solve the issue 95 percent of the time. With this quick diagnosis, involving no opinions, you are on your way after ten minutes, confident that the problem has been solved. This is medicine in the data era, administered by a physician steeped in mathematics and statistics. In the data era, even doctors become data scientists.

This post is adapted from my book, Big Data Revolution: What farmers, doctors, and insurance agents teach us about discovering big data patterns, Wiley, 2015. Find more on the web at


I joined the hack/reduce launch last night in Boston. The venue itself, Kendall Boiler and Tank building, was worth the trip alone. It is a tremendous atmosphere, in a great location, but that's just a small part of the story.

I met Chris Lynch a few years back. He personifies the old adage, "Often wrong, never in doubt.", and I mean that as a huge compliment. He is the type of person that believes obstacles exist for a reason: to prove how much you want something. Like everything else he has done, hack/reduce seems to be a product of his will.

When the concept was introduced to me, the mission was simple: Ensure that Massachusetts is at the heart of the next wave of IT innovation, namely Big Data. Given IBM's extensive lab presence in Massachusetts, that alone was enough to make sponsoring an easy decision. That being said, I think there are even more important reasons for our participation.

1) I believe this is real and substantial. The leadership ensures that.

2) Big Data is not a buzz word nor an idea, technology or product. It is the next generation of IT. See here, if you want to know why Warren Buffett agrees with me.

3) Big Data is about skills and talent for the next 5 years. Those with access to talent and the ability to cultivate talent will win. hack/reduce is ultimately about bringing the data and resources to the talent. From there, the talent can grow.

4) Massachusetts is critical to IBM. We have acquired many companies in Massachusetts and it has been a wonderful experience. All of this activity culminated in our opening a significant development lab in Westford. For this reason, it is essential to us that the Mass tech scene remain vibrant. The tech scene must include an entire ecosystem: talent, companies, venture firms, business partners, universities, etc. hack/reduce has the potential to be a key driver of the tech scene. It's great for the folks involved, great for the community, and ultimately great for our business.

We will be very active at hack/reduce: training, learning, providing technology and tools, etc. It's great to be a part of this and we can't wait to get started.