The Calm Before the Storm

Have you ever spent an afternoon in the backyard, maybe grilling or enjoying a game of croquet, when suddenly you notice that everything goes quiet? The air seems still and calm -- even the birds stop singing and quickly return to their nests.

After a few minutes, you feel a change in the air, and suddenly a line of clouds ominously appears on the horizon -- clouds with a look that tells you they aren't fooling around. You quickly dash in the house and narrowly miss the first fat raindrops that fall right before the downpour. At this moment, you might stop and ask yourself, "Why was it so calm and peaceful right before the storm hit?"
-How Stuff Works

The last 5 years, with the onset of Hadoop, Cloud, and Mobile was merely the calm before the storm. There is a new modern technology stack, a data revolution, and the onslaught of machine learning that will shape the storm to come over the next decade.


The mobile supply chain has wreaked havoc on the traditional technology stack. With the advent of high volume chips, screens, storage, etc. it has become cost effective to move away from a vertically integrated architecture, to one that is much more flexible and dynamic. We have evolved to a 6 layer Next Generation Technology Stack:

Layer 1: There are 2 aspects to layer 1: a) the repositories and b) the data itself. The repositories include the new breed of flexible and fluid data layers, ranging from Hadoop to Cassandra to other NoSQL data stores. Very flexible, adaptable, and tuned to modern internet and mobile applications. This also includes databases, data warehouses, and mainframes; said another way, anything that stores data of strategic and operational relevance. Within the repositories, the data itself creates a competitive moat, and offers strategic advantage when used appropriately.

Layer 2: A highly performant processing layer, which enables access to all data in a unified way, and easily incorporates machine learning and produces real-time insights. This is why I have called Spark the Analytics Operating System.

Layer 3: Machine learning, on a corpus of strategically relevant data, is the new competitive moat for an enterprise. This layer automates the application of analytics and delivers real-time insights for business impact. It's the holy grail that has never quite been found in most organizations.

Layer 4: A unified application layer, which provides seamless access to analytical models, data, and insights. This is the glue that enables most business users to leverage and understand data-rich applications.

Layer 5: The easiest way to democratize access to data in an organization is to give users something elegant and insightful. Vertical and horizontal applications, built for a specific purpose serve this role in an organization.

Layer 6: The number of people connected to the Internet has surged from 400 million in 1999 to 3 billion last year. The number of connected devices is estimated at 50 billion by 2020. These are all access points for the Next Generation Technology Stack.


In Big Data Revolution, I dissected 3 Business Models for the Data Era. In summary, there are 3 dominant business models that I see emerging:

Data as a competitive advantage: While this is somewhat incremental in its approach, it is evident that data can be utilized and applied to create a competitive advantage in a business. For example, an investment bank, with all the traditional processes of a bank, can gain significant advantage by applying advanced analytics and data science to problems like risk management. While it may not change the core functions or processes of the bank, it enables the bank to perform them better, thereby creating a market advantage.

Data as improvement to existing products or services: This class of business model plugs data into existing offerings, effectively differentiating them in the market. It may not provide competitive advantage (but it could), although it certainly differentiates the capabilities of the company. A simple example could be a real estate firm that utilizes local data to better target potential customers and match them to vacancies. This is a step beyond the data that would come from the Multiple Listing Service (MLS). Hence, it improves the services that the real estate firm can provide.

Data as the product: This class of business is a step beyond utilizing data for competitive advantage or plugging data into existing products. In this case, data is the asset or product to be monetized. An example of this would be Dun & Bradstreet, which has been known as the definitive source of business-related data for years.

Since my work on business models was published, my thinking has evolved a bit. While I think each of those business models is still valid, I am less certain that any of them on their own will create a distinctive competitive advantage. Instead, I believe that the value is where the software meets the data, and access is democratized. Said another way, it's hard to create value by only looking at one layer of the Next Generation Technology Stack.

Enter The Weather Company...


Last week, we announced our intention to acquire The Weather Company. The media reaction has ranged from, "IBM is buying a TV station?" (we are not), to "IBM is buying the clouds", to "IBM is entering the data business." Some of the reactions are wrong, others are humorous, and some are overly simplistic. The reality is that IBM has just made a significant move in defining and leading in the Next Generation Technology Stack. This interview captures it well. Let's look at this in terms of each layer:

Layer 1: IBM has long been a leader in Layer 1, around all types of repositories. From Netezza to DB2 to the mainframe to Informix to Cloudant to BigInsights to enterprise content; most of the worlds valuable enterprise data is stored in IBM technology. With The Weather Company, we now have a rich set of data assets. The Weather Company can decompose what is happening on earth into over 3 billion elements. And, its not just weather data. In an increasingly mobile world, location matters.

Layer 2: IBM is the enterprise leader in Spark. Through a variety of partnerships like Databricks and Typesafe, we are a key part of this blossoming community.

Layer 3: Through our open source contributions to Machine Learning, our rich portfolio of Analytical models, and the worlds greatest Cognitive system (Watson), IBM can provide applications (Layer 5) and insights that are unmatched. Just think how powerful Watson becomes when it understands location and environment, as well as everything else it already knows.

Layer 4: The Weather Company has an internet-scale high volume platform for IoT. It can seamlessly be extended for other sources of data and “can ingest data at a very high volume in fractions of a second that will be an engine that feeds Watson”.

Layer 5: IBM has a rich set of industry applications and solutions across Commerce, Watson, and countless other areas. The Weather Company applications and websites handle 65 billion unique accesses on weather and related data per day. This is scale that is unmatched.

Layer 6: The Weather Company mobile application has 66 million unique visitors a month and connective tissue to tap into the 50 billion connected devices that are emerging.

In summary, this is much more than weather data. Overnight, IBM has become the leader in the Next Generation Technology Stack. It is the basis for extension into financial services, automotive, telematics, healthcare, and every other industry being transformed by data.


It's always calm before the storm hits. Sometimes, in the moment, you don't even recognize the calm for what it is. My guess is that most people have not considered that last 5 years the calm before the storm. But, it was.

IBM | Spark - An Update

IBM made some significant announcements around our investment in Spark, back in June (See here and here). 90 days later, I saw it fit to provide an update on where we stand in our community efforts.


Before I get to the details, I want to first re-state why I believe Spark is will be a critical force in technology, on the same scale as Linux.

1) Spark is the Analytics operating system: Any company interested in analytics will be using Spark.
2) Spark unifies data sources. Hadoop is one of many repositories that Spark may tap into.
3) The unified programming model of Spark, makes it the best choice for developers building data-rich analytic applications.
4) The real value of Spark is realized through Machine Learning. Machine Learning automates analytics and is the next great scale effect for organizations.

I was recently interviewed by Datanami on the topic of Spark. You can read it here. I like this article because it presents an industry perspective on Hadoop and Spark, but it's also clear on our unique point of view.

Also, this slideshare illustrates the point of view:


So, what have we accomplished and where are we going?

1) We have been hiring like crazy. As fast as we can. The STC in San Francisco is filling up and we have started to bring on community leaders like @cfregly .

2) Client traction and response has been phenomenal. We have references already and more on the way.

3) We have open sourced SystemML as promised (see on github) and we are working on it with the the open. This contribution is over 100,000 lines of code.

4) Spark 1.5 was just released. We went from 3 contributors to 11, in one release cycle. Read more here.

5) Our Spark specific JIRAs have been ~5,000 lines of code. You can watch them in action here.

6) We are working closely with partners like Databricks and Typesafe.

7) We have trained ~300,000 data scientists through a number of forums, including You can also find some deep technical content here.

8) We have seen huge adoption of the Spark Service on Bluemix.

9) We have ~10 IBM products that are leveraging Spark and many more in the pipeline.

10) We launched a Spark newsletter. Anyone can subscribe here.


Between now and the end of the year, we have 5 significant events, where we will have much more to share regarding Spark:

a) Strata NY- Sept 29-Oct 1 in New York, NY.
b) Apache Big Data Conference- Sept 28-30 in Budapest, Hungary.
c) Spark Summit Europe- Oct 27-29 in Amsterdam.
d) Insight- Oct 26-29 in Las Vegas.
e) Datapalooza- November 10-12 in San Francisco.

In closing, here is a peek inside:

Scale Effects, Machine Learning, and Spark

“In 1997, IBM asked James Barry to make sense of the company’s struggling web server business. Barry found that IBM had lots of pieces of the puzzle in different parts of the company, but not an integrated product offering for web services. His idea was to put together a coordinated package, which became WebSphere. The problem was that a key piece of the package, IBM’s web server software, was technically weak. It held less than 1 percent of a market..”

“Barry approached Brian Behlendorf [President of the Apache Software Foundation] and the two quickly discovered common ground on technology issues. Building a practical relationship that worked for both sides was a more complex problem. Behlendorf’s understandable concern was that IBM would somehow dominate Apache. IBM came back with concrete reassurances: It would become a regular playing in the Apache process, release its contributions to the Apache code base as open source, and earn a seat on the Apache Committee just the way any programmer would by submitting code and building a reputation on the basis of that code. At the same time, IBM would offer enterprise-level support for Apache and its related WebSphere product line, which would certainly help build the market for Apache.”

-Reference: The Success of Open Source, Steven Weber 2004


In the 20th century, scale effects in business were largely driven by breadth and distribution. A company with manufacturing operations around the world had an inherent cost and distribution advantage, leading to more competitive products. A retailer with a global base of stores had a distribution advantage that could not be matched by a smaller company. These scale effects drove competitive advantage for decades.

The Internet changed all of that.

In the modern era, there are three predominant scale effects:

-Network: lock-in that is driven by a loyal network (Facebook, Twitter, Etsy, etc.)
-Economies of Scale: lower unit cost, driven by volume (Apple, TSMC, etc.)
-Data: superior machine learning and insight, driven from a dynamic corpus of data

I profiled a few of the companies that are exploiting data effects in Big Data Revolution —CoStar, IMS Health, Monsanto, etc. But by and large, big data is an unexploited scale effect in institutions around the world.

Spark will change all of that.


Thirty days ago, we launched Hack Spark in IBM, and we saw a groundswell of innovation. We made Spark available across IBM’s development community. Teams formed based on interest areas, moonshots were imagined, and many became real. We gave the team ‘free time’ to work on Spark, but the interest was so great that it began to monopolize their nights and weekends. After ten days, we had over 100 submissions in our Hack Spark contest.

We saw things accomplished that we had not previously imagined. That is the power of Spark.

To give you a sampling of what we saw:

Genomics: A team built a powerful development environment of SQL/R/Scala for data scientists to analyze genomic data from the web or other sources. They provided a machine learning wizard for scientists to quickly dig into chromosome data (kmeans classifying genomes by population). This auto-scalable cloud system increased the speed of processing and analyzing massive genome data and put the power in the hands of the person that knows the data best. Exciting.

Traffic Planning: A team built an Internet of Things (IoT) application for urban traffic planning, providing real-time analytics with spatial and cellular data. Messaging queues could not handle the massive and continuous data inputs. Data lakes could not handle the large volume of cellular signaling data in real-time. Spark could. The team exploited Spark as the engine of the computing pool, Oozie, to build the controller module, and Kafka as the messaging module. The result is an application to processes massive cellular signal data and visualizes those analytics in real-time. Smarter Planet indeed.

Political Analysis: A team built a real-time analytics platform to measure public response to speeches and debates in real-time. The team built a Spark cluster on top of Mesos, used Kafka for data ingestion and Cloudant for data storage. Spark Streaming was deployed for processing. Political strategists, commentators, and advisors can isolate the specific portion of a speech that produces a shift in audience opinion. The voice of the public, in real-time.

Spark is changing the face of innovation in IBM. We want to bring the rest of the world along with us.


Apache Spark lowers the barrier to entry to build analytics applications, by reducing the time and complexity to develop analytic workflows. Simply put, it is an application framework for doing highly iterative analysis that scales to large volumes of data. Spark provides a platform to bring application developers, data scientists, and data engineers together in a unified environment that is not resource-intensive and is easy to use. This is what enterprises have been clamoring for.

An open-source, in-memory compute engine, Spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. Today, business professionals have analytics in their hands in the form of visual dashboards that inform them what is happening. Think of this as descriptive analytics. Now, with Apache Spark, these can be complemented with analytics smarts built into applications that learn from their surroundings and specifies actions in the moment. Think of it as prescriptive analytics. This means that, with Spark, enterprises can deploy insights into applications at the front lines of their business exponentially faster than ever before.

Spark is highly complementary to Hadoop. Hadoop makes managing large volumes of data possible for many organizations due to its distributed file system. It has grown to a broad ecosystem of capabilities that span data integration and data discovery. It changed the speed at which data could be collected, and fundamentally changed how we make data available to people. Spark complements Hadoop by providing an in-memory compute engine to perform non-linear analysis. Hadoop delivered mass quantities of data, fast. But the real value of data cannot always be exposed because there isn’t an engine to push it through. With Spark, there’s a way to understand which data is valuable and which is not. A client can leverage Spark to augment what they are doing with Hadoop or use Spark on a stand-alone basis. The approach is in the eye of the beholder.


While there are many dimensions to the Spark ecosystem, I am most excited by machine learning. Machine learning is better equipped to deal with the modern business environment than traditional statistical approaches, because it can adapt. IBM’s machine learning technology makes expressing algorithms at scale much faster and easier. Our data scientists, mathematicians, and engineers will work with the open source community to help push the boundaries of Spark technology with the goal of creating a new era of smart applications to fuel modern and evolving enterprises.

With machine learning at the core of applications, they can drive insight in the moment. Applications with machine learning at their core get smarter and more customized through interactions with data, devices and people—and as they learn, they provide previously untapped opportunity. We can take on what may have been seen as unsolvable problems by using all the information that surrounds us and bringing the right insight or suggestion to our fingertips right when it's most needed.

It is my view that over the next five years, machine learning applications will lead to new breakthroughs that will assist us in making good choices, look out for us, and help us navigate our world in ways never before dreamed possible.


I see Apache Spark as the analytics operating system of the future, and we are investing to grow Spark into a mature platform. We believe it is the best technology today for attacking the toughest problems of organizations of all sizes and delivering the benefits of intelligence-based, in-time action. Our goal is to be a leading committer and technology contributor in the community. But actions speak louder than words, which brings us to today’s announcements:

1)IBM is opening a Spark Technology Center in San Francisco. This center will be focused on working in the open source community and providing a scalable, secure, and usable platform for innovation. The Spark Technology Center is a significant investment, designed to grow to hundreds of people and to make substantial and ongoing contributions to the community.

2)IBM is contributing its industry leading System ML technology— a robust algorithm engine for large-scale analytics for any environment—to the Apache Spark movement. This contribution will serve to promote open source innovation and accelerate intelligence into every application. We are proud to be partnering with Databricks to put this innovation to work in the community.

3)IBM will host Spark on our developer cloud, IBM BlueMix, offering a hosted service and system architectures, as well as the tools that surround the core technology to make it easier to consume. Our approach is to accelerate Spark adoption.

4)IBM will deliver software offerings and solutions built on Spark, provide infrastructure to host Spark applications such as IBM Power and Z Systems, and offer consulting services to help clients build and deploy Spark applications.

IBM is already adopting Spark throughout our business: IBM BigInsights for Apache Hadoop, a Spark service, InfoSphere Streams, DataWorks, and a number of places in IBM Commerce. Too many to list. And IBM Research currently has over 30 active Spark projects that address technology underneath, inside, and on top of Apache Spark.

Our own analytics platform is designed with just this sort of environment in mind: it easily blends these new technologies and solutions into existing architectures for innovation and outcomes. The IBM Analytics platform is ready-made to take advantage of whatever innovations lie ahead as more and more data scientists around the globe create solutions based on Spark.

Our strategy is about building on top of and around a successful open platform, and adding something of our own that’s substantial and differentiated. Spark is that platform. We are just at the start of building many solutions that leverage Spark to the advantage of our clients, users, and the developer community.


IBM is now and has historically been a significant force supporting open source innovation and collaboration, including a more than $1 billion investment in Linux development. We collaborate in more than 120 projects contributed to the open source community, including Eclipse, Hadoop, Apache Spark Apache Derby, and Apache Geronimo. IBM is also contributing to Apache Tuscany and Apache Harmony. In terms code contributions, IBM has contributed 12.5 million lines of code to Eclipse alone, not to mention Linux— 6.3 percent of total Linux contributions are from IBM. We’ve also contributed code to Geronimo and a wide variety of other open-source projects.

We see in Spark the opportunity to benefit data engineers, data scientists, and application developers by driving significant innovation into the community. As these data practitioners benefit from Spark, the innovation will make its way into business applications, as evidenced in the Genomic, Urban Traffic, and Political Analysis solutions mentioned above. Spark is about delivering the analytics operating system of the future—an analytics operating system on which new solutions will thrive, unlocking the big data scale effect. And Spark is about a community of Spark-savvy data scientists and data analysts who can quickly transform today's problems into tomorrow's solutions. Spark is one of the fastest-growing open source projects in history. We are pleased to be part of the movement.