Monday, September 14, 2015

IBM | Spark - An Update

IBM made some significant announcements around our investment in Spark, back in June (See here and here). 90 days later, I saw it fit to provide an update on where we stand in our community efforts.


Before I get to the details, I want to first re-state why I believe Spark is will be a critical force in technology, on the same scale as Linux.

1) Spark is the Analytics operating system: Any company interested in analytics will be using Spark.
2) Spark unifies data sources. Hadoop is one of many repositories that Spark may tap into.
3) The unified programming model of Spark, makes it the best choice for developers building data-rich analytic applications.
4) The real value of Spark is realized through Machine Learning. Machine Learning automates analytics and is the next great scale effect for organizations.

I was recently interviewed by Datanami on the topic of Spark. You can read it here. I like this article because it presents an industry perspective on Hadoop and Spark, but it's also clear on our unique point of view.

Also, this slideshare illustrates the point of view:


So, what have we accomplished and where are we going?

1) We have been hiring like crazy. As fast as we can. The STC in San Francisco is filling up and we have started to bring on community leaders like @cfregly .

2) Client traction and response has been phenomenal. We have references already and more on the way.

3) We have open sourced SystemML as promised (see on github) and we are working on it with the the open. This contribution is over 100,000 lines of code.

4) Spark 1.5 was just released. We went from 3 contributors to 11, in one release cycle. Read more here.

5) Our Spark specific JIRAs have been ~5,000 lines of code. You can watch them in action here.

6) We are working closely with partners like Databricks and Typesafe.

7) We have trained ~300,000 data scientists through a number of forums, including You can also find some deep technical content here.

8) We have seen huge adoption of the Spark Service on Bluemix.

9) We have ~10 IBM products that are leveraging​ Spark and many more in the pipeline.

10) We launched a Spark newsletter. Anyone can subscribe here.


Between now and the end of the year, we have 5 significant events, where we will have much more to share regarding Spark:

a) Strata NY- Sept 29-Oct 1 in New York, NY.
b) Apache Big Data Conference- Sept 28-30 in Budapest, Hungary.
c) Spark Summit Europe- Oct 27-29 in Amsterdam.
d) Insight- Oct 26-29 in Las Vegas.
e) Datapalooza- November 10-12 in San Francisco.

In closing, here is a peek inside:

Friday, August 21, 2015

3 Business Models for the Data Era


A walk through the meatpacking district in New York City is a lively affair. In the early 1900s, this part of town was known for precisely what its name implies: slaughterhouses and packing plants. At its peak, there were nearly 300 such establishments, located somewhat central to the city and not far from shipping ports. Through the years, this area of town has declined at times, but in the early 1990s, a resurgence began that shows no signs of ending.

Located in some proximity to the fashion district of Manhattan, the Meatpacking district stands for what is modern, hip, and trendy; a bastion of culture in the work-like city. Numerous fashionable retailers have popped up, along with trendy restaurants. And, on the fringes, there is evidence of the New York startup culture flowing from the Flatiron District, sometimes known as Silicon Alley. A visit to one of the companies in this area opened my eyes to some of the innovation that is occurring where data is the business, instead of being an enabler or addition to the business.


As I entered the office of this relatively newborn company, I was confident that I understood their business. It was pretty simple: They were a social sharing application on the web that enabled users to easily share links, share content, and direct their networks of friends to any items of interest. The term social bookmarking was one description I had heard. The business seemed straightforward: Attract some power users, enable them to attract their friends and associates, and based on the information shared, the company could be an effective ad-placement platform, since it would know the interests of the networks and users.

But what if social bookmarking and ad placement was not the business model at all? What if all that functionality was simply a means to another end?


Data has the power to create new businesses and even new industries. The challenge is that there are many biases about the use of data in a business. There is a view that data is just about analytics or reporting. In this scenario, it’s relegated to providing insight about the business. There is another view that data is simply an input into existing products. In this case, data would be used to enrich a current business process, but not necessarily change the process. While these cases are both valid, the power of the Data era enables much greater innovation than simply these incremental approaches.

There are three classes of business models for leveraging data:

Data as a competitive advantage: While this is somewhat incremental in its approach, it is evident that data can be utilized and applied to create a competitive advantage in a business. For example, an invest- ment bank, with all the traditional processes of a bank, can gain significant advantage by applying advanced analytics and data science to problems like risk management. While it may not change the core functions or processes of the bank, it enables the bank to perform them better, thereby creating a market advantage.

Data as improvement to existing products or services: This class of business model plugs data into existing offerings, effectively differentiat- ing them in the market. It may not provide competitive advantage (but it could), although it certainly differentiates the capabilities of the company. A simple example could be a real estate firm that utilizes local data to better target potential customers and match them to vacancies. This is a step beyond the data that would come from the Multiple Listing Service (MLS). Hence, it improves the services that the real estate firm can provide.

Data as the product: This class of business is a step beyond utilizing data for competitive advantage or plugging data into existing products. In this case, data is the asset or product to be monetized. An example of this would be Dun & Bradstreet, which has been known as the defini- tive source of business-related data for years.

In these business models, there are best practices that can be applied. These best practices are the patterns that teach us how to innovate in the Data era.


Procter & Gamble is legendary for the discipline of brand management. To be the leading consumer packaged-goods company in the world, brand is everything. Ultimately, consumers will buy brands that they know and understand, and those that fulfill an expectation. This was one of the first lessons that Scott Cook learned when he joined Procter & Gamble in the 1970s. He also observed that the core business processes for managing a brand (accounting, inventory, etc.) would be transformed and subsumed by the emerging personal computer revolution. This insight led him to co- found Intuit in 1983, really at the dawn of the expansion of personal computers. The premise was simple: Everyone, small companies and individuals alike, should have access to the financial tools that previously had been reserved for large enterprise companies.

Now broadly known for tax preparation software (TurboTax), along with software solutions for small businesses (QuickBooks) and individuals (Quicken), Intuit has transformed many lives. At over $4 billion in revenue and nearly 8,000 employees, the company is a success by any barometer. Intuit is one of the few software companies to challenge Microsoft head-on and not only live to tell about it, but to prosper in its wake. Microsoft Money versus Quicken was a battle for many years; Microsoft was trying to win with software, while Intuit was focused on winning with data, utilizing software as the delivery vehicle.
Robbie Cape, who ran Microsoft’s Money business from 1999 to 2001, believes that Intuit’s advantage had very little to do with technology. Instead, he attributes Intuit’s success to its marketing prowess. While there may be some truth to the statement, its very hard to believe that Intuit had deep- enough pockets to out-market Microsoft. Instead, the differentiation seems to come from data.

The NPD Group analyst Stephen Baker said that Intuit won by building out critical mass in financial software and the surrounding ecosystem. Intuit had the insight that adjacency products and services, leveraging data, made the core software much more attractive. This insight led to their early and sustained domination of the retail channel.

Intuit’s ability to collect and analyze a large amount of sensitive and confi- dential data is nearly unsurpassed. Nearly 20 million taxpayers use TurboTax online, sharing their most personal data. Over 10 million customers use QuickBooks, with employee information for over 12 million people flowing through its software. Brad Smith has been cited as declaring that 20 percent of the United States Gross Domestic Product flows through QuickBooks. No other collection of data has this type and extent of financial information on individuals and small businesses.

With these data assets, Intuit began to publish the Intuit Small Business Index. The Index provides summarized insights about sales, profit, and employment data from the small businesses that use QuickBooks. This information can provide headlights to business and salary trends, which ultimately becomes insight that can be fed back into the product. This was the point that Microsoft Money either missed or simply could not achieve: The value was never in the software itself. The value was in the collection, analysis, and repurposing of data to improve the outcomes for the users.

In 2009, Intuit purchased Mint, a free web-based personal-finance application. Mint took the Intuit business model a step further: They provide their software for free, knowing that it’s a means to an end. The social aspects of Mint enable users to do much more than simply track their spending. Instead, it became a vehicle to compare the spending habits of an individual to others of a similar geography or demographic. The user can do these comparisons, or the comparisons can show up as recommendations from Mint. Further, Mint brings an entirely different demographic of data to Intuit. While the Intuit customer base was largely the 40-and-over demographic (people who had grown up with Quicken, QuickBooks, etc.), Mint attracted the Millennial crowd. The opportunity to combine those two entirely different sets of data was too attractive for Intuit to pass up.

To date, Intuit has not had a strategy for monetizing the data itself. Perhaps that may change in the future. However, with data at the core of its strategy, Intuit has used that data to drive competitive advantage, while software was merely the delivery vehicle. The companies that tried the opposite have not fared so well.


Chapters 1 through 9 offer a multitude of examples in which data is being utilized to improve existing products or services. In the pursuit of a business model leveraging data, this category is often the low-hanging fruit; more obvious, although not necessarily easy to do. The examples covered previously are:

Farming and agriculture: Monsanto is using data to augment applica- tions like FieldScripts, which provides seeding prescriptions to farmers based on their local environments. While Monsanto could provide prescriptions through their normal course of operation, data has served to personalize and thereby improve that offering.

Insurance: Dynamic risk management in insurance, such as pay-as-you- drive insurance, leverages data to redefine the core offering of insurance. It changes how insurance is assessed, underwritten, and applied.

Retail and fashion: Stitch Fix is redefining the supply chain in retail and fashion, through the application of data. Data is augmenting the buying process to redefine traditional retail metrics of inventory, days sales outstanding, etc.

Customer service: Zendesk integrates all sources of customer engage- ment in a single place, leveraging that data to improve how an organiza- tion services customers and fosters loyalty over time.

Intelligent machines: Vestas has taken wind turbines — previously regarded as dumb windmills — and turned them into intelligent machines through the application of data. The use of data changes how their customers utilize the turbines and ultimately optimizes the return on their investment.

Most companies that have an impetus to lead in the Data era will start here: leveraging data to augment their current products or services. It’s a natural place to start, and it is relatively easy to explore patterns in this area and apply them to a business. However, it is unlikely that this approach alone is sufficient to compete in the Data era. It’s a great place to start, but not necessarily an endpoint in and of itself.


Previously in this chapter, the examples demonstrated how data is used to augment existing businesses. However, in some cases, data becomes the product; the sole means for the company to deliver value to shareholders. There are a number of examples historically, but this business model is on the cusp of becoming more mainstream.


In 1841, Lewis Tappan first saw the value to be derived from a network of information. At the time, he cultivated a group of individuals, known as the Mercantile Agency, to act “as a source of reliable, consistent, and objective” credit information. This vision, coupled with the progress that the idea made under Tappan and later under Benjamin Douglass, led to the creation of a new profession: the credit reporter. In 1859, Douglass passed the Agency to his brother-in-law, Robert Graham Dun, who continued expansion under the new name of R.G. Dun & Company.

With a growing realization of the value of the information networks being created, the John M. Bradstreet company was founded in 1849, creating an intense rivalry for information and insight. Later, under the strain caused by the Great Depression, the two firms (known at this time as R.G. Dun & Company and Bradstreet & Company) merged, becoming what is now known as Dun & Bradstreet.

Dun & Bradstreet (D&B) continued its expansion and saw more rapid growth in the 1960s, as the company learned how to apply technology to evolve its offerings. With the application of technology, the company intro-duced the Data Universal Numbering System (known as D&B D-U-N-S), which provided a numerical identification for businesses of the time. This was a key enabler of data-processing capabilities for what had previously been difficult-to-manage data.

By 2011, the company had gained insight on over 200 million businesses. Sara Mathew, the Chairman and CEO of D&B, commented, “Providing insight on more than 200 million businesses matters because in today’s world of exploding information, businesses need information they can trust.”

Perhaps the most remarkable thing about D&B is the number of companies that have been born out of that original entity. As the company has restruc- tured over the years, it has spun off entities such as Nielsen Corporation, Cognizant, Moody’s, IMS Health, and many others. These have all become substantial businesses in their own right. They are each unique in the markets served and all generate value directly from offering data as a product:

Nielsen Corporation: Formerly known as AC Nielsen, the Nielsen Corporation is a global marketing research firm. The company was founded in 1923 in Chicago, by Arthur C. Nielsen, Sr., in order to give marketers reliable and objective information on the impact of market- ing and sales programs. One of Nielsen’s best known creations is the Nielsen ratings, an audience measurement system that measures television, radio, and newspaper audiences in their respective media markets. Nielsen now studies consumers in more than 100 countries to provide a view of trends and habits worldwide and offers insights to help drive profitable growth.

Cognizant: Starting as a joint venture between Dun & Bradstreet and Satyam Computers, the entity was originally designed to be the in-house IT operation for D&B. As the entity matured, it began to provide similar services outside of D&B. The entity was renamed Cognizant Technology Solutions to focus on IT services, while the former parent company of Cognizant Corporation was split into two companies: IMS Health and Nielsen Media Research. Cognizant Technology Solutions became a public subsidiary of IMS Health and was later spun off as a separate company. The fascinating aspect of this story is the amount of intellec- tual property, data, and capability that existed in this one relatively small part of Dun & Bradstreet. The interplay of data, along with technology services, formed the value proposition for the company.

IMS Health: IMS became an independent company in 1998. IMS’s competitive advantage comes from the network of drug manufacturers, wholesalers, retailers, pharmacies, hospitals, managed care providers, long-term care facilities and other facilities that it has developed over time. With more than 29,000 sources of data across that network, IMS has amassed tremendous data assets that are valuable to a number of constituents — pharmaceutical companies, researchers, and regulatory agencies, to name a few. Like Lewis Tappan’s original company back in 1841, IMS recognized the value of a network of information that could
be collected and then provided to others. In 2000, with over 10,000 data reports available, IMS introduced an online store, offering market intelligence for small pharmaceutical companies and businesses, enabling anyone with a credit card to access and download data for their productive use. This online store went a long way towards democratizing access to the data that had previously been primarily available to large enterprise buyers.

Moody’s: Moody’s began in 1900, with the publishing of Moody’s Manual of Industrial and Miscellaneous Securities. This manual provided in-depth statistics and information on stocks and bonds, and it quickly sold out. Through the ups and downs of a tumultuous period, Moody’s ultimately decided to provide analysis, as opposed to just data. John Moody, the founder, believed that analysis of security values is what investors really wanted, as opposed to just raw data. This analysis of securities eventually evolved into a variety of services Moody’s provides, including bond ratings, credit risk, research tools, related analysis, and ultimately data.

Dun & Bradstreet is perhaps the original innovator of the data-is-the-product business model. For many years, their reach and access to data was unsur- passed, creating an effective moat for competitive differentiation. However, as is often the case, the focus on narrow industries (like healthcare) and new methods for acquiring data have slowly brought a new class of competitors to the forefront.


Despite its lack of broad awareness, CoStar is a NASDAQ publicly traded company with revenues of $440 million, 2,500 employees, a stock price that has appreciated 204 percent over the last three years, a customer base that is unrivaled, and a treasure trove of data. CoStar’s network and tools have enabled it to amass data on 4.2 million commercial real estate properties around the world. Simon Law, the Director of Research at CoStar, says, “We’re number one for one very simple reason, and it’s our research. No one else can do what we do.” Here are some key metrics:

* 5.1 million data changes per day

* 10,000 calls per day to brokers and developers
* 500,000 properties canvased nationwide annually
* 1 million property photographs taken annually

CoStar has an abundance of riches when it comes to real estate data. Founded in 1987 by Andrew Florance, CoStar invested years becoming the leading provider of data about space available for lease, comparable sales information, tenant information, and many other factors. The data covers all commercial property types, ranging from office to multi-family to industrial to retail properties.

The company offers a set of subscription-based services, including

CoStar Property Professional: The company’s flagship product, which offers data on inventory of office, industrial, retail, and other commer- cial properties. It is used by commercial real estate professionals and others to analyze properties, market trends, and key factors that could impact food service or even construction.

CoStar Comps Professional: Provides comparable sales information for nearly one million sales transactions primarily for the United States and the United Kingdom. This service includes deeds of trust for properties, along with space surveys and demographic information.

CoStar Tenant: A prospecting and analytical tool utilized by profes- sionals. The data profiles tenants, lease expirations, occupancy levels, and related information. It can be an effective business development tool for professionals looking to attract new tenants.

CoStarGo: A mobile (iPad) application, merging the capabilities of Property Professional, Comps Professional, Tenant, and other data sources.

The value of these services are obviously determined by the quantity and quality of data. Accordingly, ensuring that the data remains relevant is a critical part of CoStar’s business development strategy.

Since its inception, CoStar has grown organically, but also has accelerated growth through a series of strategic acquisitions. In 2012, CoStar acquired LoopNet, which is an online marketplace for the rental and sale of properties. CoStar’s interest in the acquisition was less about the business (the marketplace) and much more about the data. Said another way, their acquisition strategy is about acquiring data assets, not people or technology assets (although those are often present). As a result of the acquisition, it is projected that CoStar will double their paid subscriber base to at 160,000 professionals, which represents about 15 percent of the approximately 1 million real estate professionals. Even more recently, in 2014, CoStar acquired, a digital alternative to classified ads. The war chest of data assets continues to grow.

The year 2008 was one of the most significant financial crises the world has seen. Financial institutions collapsed, and the real estate market entered a depression based on the hangover from subprime debt. Certainly, you would expect a company such as CoStar to see a similar collapse, given their dependence on the real estate market. But that’s not exactly what happened.

From 2008 to 2009, CoStar saw an insignificant revenue drop of about
1 percent. This drop was followed by an exponential rebound to growth in 2010 and beyond. Is it possible that data assets create a recession-proof business model?

CoStar Financial Results

While there are other data alternatives in the market (Reis Reports, Xceligent, CompStak, ProspectNow, and others), the largest collection of data is a differentiator. In fact, it is a defensible moat that makes it very hard for any other competitors to enter. For CoStar, data is the product, the business, and a control point in the industry.


In 1959, Richard O’Brien founded IHS, a provider of product catalog databases on microfilm for aerospace engineers. O’Brien, an engineer himself, saw how difficult it was to design products and knew that utilizing common components could dramatically increase the productivity of engineers. However, he took it one step further by applying technology to the problem — using electronic databases and CD-ROMs to deliver the knowledge furthered the productivity gains. And engineers love productivity gains.

This attitude toward data was set in the company’s DNA from the start as the company could see how to improve the lives and jobs of their clients, just through a better application of data. In the 1990s, IHS started distributing their data over the Internet, and with an even more cost-effective way to share data assets, they decided to expand into other industries. Soon enough, IHS had a presence in the automotive industry, construction, and electronics.

As seen with CoStar, IHS quickly realized that business development for a data business could be accelerated through acquisitions. Between 2010 and 2013, they acquired 31 companies. This acquisition tear continued, with the recent high-profile $1.4-billion acquisition of R.L. Polk & Company. As the parent company of CARFAX, R.L. Polk cemented IHS’s relevance in the automotive industry.

IHS’s stated strategy is to leverage their data assets across interconnected supply chains of a variety of industries. For their focus industries, there is $32 trillion of annual spending in those companies. IHS data assets can be utilized to enhance, eliminate, or streamline that spending, which makes them an indispensible part of the supply chain. IHS’s data expertise lies in

*Renewable energy

*Automotive parts
*Aerospace and defense
*Maritime logistics

IHS also has a broad mix of data from different disciplines.

While some view data as a competitive differentiator or something to augment current offerings, CoStar, IHS, and D&B are examples of compa- nies that have a much broader view of data: a highly profitable and defensi- ble business model.


The role of data in enterprises has evolved over time. Initially, data was used for competitive advantage to support a business model. This evolved to data being used to augment or improve existing products and services. Both of these applications of data are relevant historically and in the future. How- ever, companies leading in the data era are quickly shifting to a business model of data as the product. While there have been examples of this in history as discussed in this chapter, we are at the dawn of a new era, the Data era, where this approach will become mainstream.

The company that started as a social bookmarking service quickly realized the value of that data that they were able to collect via the service. This allowed them to build a product strategy around the data they collected, instead of around the service that they offered. This opportunity is available to many businesses, if they choose to act on it.

This post is adapted from the book, Big Data Revolution: What farmers, doctors, and insurance agents teach us about discovering big data patterns, Wiley, 2015. Find more on the web at

Monday, July 27, 2015

Reinventing Retail: Customer Intimacy in the Data Era

Retail has continually reinvented itself over the past 100-plus years. Every 20 to 30 years, the form of retail has changed to meet the changing tastes of the public. McKinsey & Company, the global strategy consultancy, has explored the history of retail in depth, citing five distinct timeframes:

*1900s: The local corner store was prominent in many towns. These small variety stores offered a range of items, including food, clothes, tools, and other necessities. The primary goal was to offer anything a person would need for day-to-day life.

*1920–1940: The corner store was still prominent but had grown to a much larger scale. In this era, department stores first began to emerge, and some specialization of stores began to occur.

*1940–1970: In order to effectively deal with some of the specialization seen in the previous era, this timeframe was marked by the emergence of malls and shopping centers. This allowed for concentration of merchants, many of whom served a unique purpose.

*1970–1990: Perhaps best described as the Walmart era — a time when large players emerged, putting pressure on local store owners. These massive stores offered one-stop shopping and previously unseen value in terms of pricing and promotions. The size of these stores gave them economies of scale, which enabled aggressive pricing, with the savings passed on to the consumer.

*1990–2008: This era was marked by increased focus on discounting and large selection, coupled with the emergence of e-commerce.

Each era represented a significant innovation in the business model, but more important was the impact it had on each part of the retail value chain: merchandise and pricing, store experience, and the approach to marketing. Each new era has longed for balancing the new innovations and expansion with a key hallmark of the past: customer intimacy.


Retail, by definition, is mass market. It has been through every era. While subtle changes in approach have occurred, very few have captured the intimacy of the original corner store. The corner store’s owner knew the customers personally; he understood what was happening in their lives, and the store became an extension of the community. In the Data era, mass marketing can reclaim the corner-store experience.

Stitch Fix

Stitch Fix is a data era retailer, focused on personalizing a shopping experi- ence for women. While many women love clothes shopping, Stitch Fix realized that it is an inefficient experience today. It requires visiting many stores, selecting items to try on, and repeating. In fact, a successful shopping trip requires a relatively perfect set of variables to align:

*Location: A store must be near the shopper.
*Store: The store itself must interest the shopper and draw them in. Clothing: The clothing in the store must be of interest to the shopper.
*Circumstance: The clothing must match the circumstance for which the shopper needs clothes (dinner party, wedding, outing, etc.).
*Size: Even if all the preceding elements are present, the store must have the right size clothing in stock.
*Price: Even if all the preceding elements exist, the shopper must be able to afford the clothing.

To some extent, it’s amazing that all of these variables ever align. And perhaps they do not, which leads to compromise. But if all the variables could align and occurred repeatedly, would the shopper be more inclined to buy? Yes, and hence the premise of Stitch Fix.

Stitch Fix is disrupting fashion and retail, targeting professional women shoppers who want all the variables to align. These women do not have the time nor perhaps inclination to search for the alignment and hence, Katrina Lake, the CEO and cofounder states, “We’ve created a way to provide scalable curation. We combine data analytics and retail in the same system.”

When a person signs up for the service, she provides a profile of her prefer- ences: style, size, profession, budget, etc. The data from that profile become attributes in Stitch Fix’s systems, which promptly schedule the dates to receive the clothes, assign a stylist based on best fit, and enable the stylist to see the person’s profile (meaning her likes and dislikes). The customer also specifies when and how often she wants to receive a fix, which is a customized selection of five clothing items. Then the data-and-algorithms team will present sugges- tions to the stylist. This recommendation system helps the stylist make great decisions. Once the customer receives the fix, she can keep what she wants and send back the rest. Stitch Fix obviously maintains the data on preferences so that, over time, it becomes a giant analytics platform, where recommendations can be catered to a unique shopper. Not since the corner store has such intimacy been available, and it’s all because of the data. Clients are happier, the job of the stylist is easier, and this data then feed into the backend processes.

Retail is a difficult business. Fashion retail is even harder. It’s not as simple as managing the supply chain (although that’s not simple) because changing styles, seasons, and tastes are overlaid against the more traditional issues of sizes and stock. Any one poor decision can destroy the profit of a fashion retailer for a particular period, and therefore making the right decisions is at a premium. Stitch Fix attacks this challenge with human capital. Said a different way, this is not your typical management team for a fashion retailer. The leader of Operations at Stitch Fix comes from, while the analytics leader was previously an executive at Netflix. In a sense, Stitch Fix is building a supply chain and data analytics company that happens to focus on fashion. Not the other way around.

The company is making the bet that better customer insight will resolve many of the common fashion retailer issues: returns (ensuring fewer returns), inventory (predicting what people will want), and higher inventory turns (stocking things that customers will buy in the near-term). While Stitch Fix may not succeed as a retailer (although we think it will), it is laying the groundwork for the architecture of a retailer in the Data era.

Ms. Lake makes it clear that the company is first and foremost a retailer, but a retailer with a unique business model incorporating data and technology. Lake says, “We are a retailer. We just use data to be better at the core functions of retail. It’s hard to buy inventory accurately without knowing your customer, so we use data in the sourcing process as well.” She cites the example of looking at not just basic sizes (S, M, L or 2, 4, 6) as most buyers would, but looking at the detail of inseam size too. They can use this level of granularity in the buying process because of data. This attention to detail leads to a better fit for their clients and a higher likelihood those clients
will buy.

Most data leveraged by Stitch Fix is generated by the company. Their advantage comes from the large amount of what Lake calls explicit data, which is direct feedback from clients on every fix. That’s specific, unique, and real-time feedback that can be incorporated into future fixes and purchases. The buyers at Stitch Fix, responsible for stocking inventory according to new trends and feedback, love this data, as it tells them what to buy and focus on. As Lake says, “What customers buy and why, and what they don’t buy and why not, is very powerful.”

Stitch Fix has analyzed over 500 million individual data points. While the company has shipped over 100,000 fixes, no two have ever been the same. That’s personalization. The company sells 90 percent of the inventory that it buys each month at full price, again because of personalization. Data and personalization have the impact of delighting clients while revolutionizing the metrics of retail.


Zara’s business model is based on scarcity. In a store, if a shopper sees a pair of pants he likes, in his size, he knows it’s the only one that will ever be available, which drives him to purchase impulsively and with conviction. Scarcity is a powerful motivator. In 2012, Inditex (the parent company of Zara) reported total sales of $20.7 billion, with Zara representing 66 percent of total sales (or $13.6 billion), with 120 stores worldwide. Scarcity can also be a revolutionary business model and profit producer.

Amancio Ortega was born in Spain in 1936. In 1972, he founded Confecciones Goa to sell quilted bathrobes. He quickly learned the complexity of fashion, extending to retail, as he operated this supply chain of his own creating. Using sewing cooperatives, Ortega relied on thousands of local women to produce the bathrobes. This was the most cost-effective way for him to produce robes, but it came with the complexity of managing literally thousands of suppliers. This experience taught Ortega the importance of vertical integration or, said another way, the value of owning every step of the value chain. He founded Zara in 1975, with this understanding.

Zara uses data to expedite the entire process of the value chain. While it takes a typical retailer 9 to 12 months to go from concept to store shelf, Zara can do the same in approximately two weeks. This reduced timetable is accomplished through the use of data: The stores directly feed the design team with real-time behavioral data. Zara’s designers create approximately 40,000 new designs annually, from which 10,000 are selected for production. Given the range of sizes and colors, this variety of choice leads to approxi- mately 300,000 new stock keeping units (SKUs) every year.Chapter 4: Personalizing Retail and Fashion 67

Zara’s approach to the business has become known as fast fashion, as they will quickly adapt their designs to what is happening on the store floor, usher new products quickly to market, and just as swiftly move onto the next thing. This fast pace drives incredible efficiency in the implementation of the business model, yet at the same time, it creates enormous customer loyalty and intimacy, given the role of scarcity. Since the business can react so quickly, there is always sufficient capacity to produce the right design at the right time.

Zara’s system depends on the frequent sharing and exchange of data through- out the supply chain. Customers, store managers, designers, production staff, buyers, and warehouse managers are all connected by data and react accord- ingly. Data drives the business model, but it’s the reaction to the data that produces competitive advantage. Many businesses have a lot of data, but very few utilize it to rapidly effect decision making.

Unsold items account for less than 10 percent of Zara’s stock, compared with the industry average of 17 to 20 percent. This is the data in action. According to Forbes, “Zara’s success proves the theory that if a retailer can forecast demand accurately, far enough in advance, it can enable mass production under push control and lead to well managed inventories, lower markdowns, higher profitability (gross margins), and value creation for shareholders in the short- and long-term.”


Stitch Fix and Zara each provide a glimpse into the future of retail. It's not simply about ecommerce and automation. Instead, with the power of data, a retailer can redefine core business processes and in many cases, invent new ways of interacting with customers. This new level of intimacy changes the role that a retailer plays in a consumers life; from a sales outlet to a trusted advisor. However, knowing what needs to be done is easier than actually doing it — therein lies the challenge for all fashion designers and retailers.

This post is adapted from the book, Big Data Revolution: What farmers, doctors, and insurance agents teach us about discovering big data patterns, Wiley, 2015. Find more on the web at

Monday, July 6, 2015

100% Effectiveness

In a recent profile, Reid Hoffman declared that he is only operating at 60% of capacity/effectiveness. Given that this is coming from the founder/Chairman of LinkedIn, and someone who is also a Partner at Greylock, it makes you think twice. It made me wonder if I'm setting the bar too low.


The Stanford Graduate School of Business has done a nice job with its 'Insights' program. All/most of them are available to view online. I recently watched the one with Steve Schwarzman and his views on talent and hiring resonated with me.

He talks about assessing the talent in your organization on a scale of 1 to 10 (10 being best). He says,

"If you're a 10, God bless you. You'll be wildly successful. If you attract 10's, they always make it rain if you need rain. A 10 knows how to sense problems, design solutions, and do new things.

A nine is great at executing. They come up with good strategies, but not great strategies. A firm full of nines, that's a winning firm. Eights, they just do stuff that you tell them. And sevens and below, I don't know what they are since we don't tolerate them."

Let me paraphrase and augment the descriptions a bit:

-designs great strategies
-leads from the front
-senses problems/issues and resolves them
-constantly drives new initiatives and creates new value
-executes and delivers...over and over again

-designs good strategies
-demonstrates attributes of a great leader
-executes flawlessly
-resolves issues quickly, as they are understood or highlighted

-executes flawlessly

7 and below

I realized when I heard Schwarzman talking and then paraphrased per above, that my post on "Principles of Great Performance" was a bit off. In that post, I really defined the principles of an 8 or 9 performer. This confirmed that I am mentally setting the bar too low.

Perhaps I am operating at a mere 50% of capacity.


Other great Stanford Insights interviews:

Ajay Banga, Mastercard
Marc Andreesen, a16z
Vinod Khosla, Khosla Ventures

Tuesday, June 16, 2015

Monday, June 15, 2015

Scale Effects, Machine Learning, and Spark

“In 1997, IBM asked James Barry to make sense of the company’s struggling web server business. Barry found that IBM had lots of pieces of the puzzle in different parts of the company, but not an integrated product offering for web services. His idea was to put together a coordinated package, which became WebSphere. The problem was that a key piece of the package, IBM’s web server software, was technically weak. It held less than 1 percent of a market..”

“Barry approached Brian Behlendorf [President of the Apache Software Foundation] and the two quickly discovered common ground on technology issues. Building a practical relationship that worked for both sides was a more complex problem. Behlendorf’s understandable concern was that IBM would somehow dominate Apache. IBM came back with concrete reassurances: It would become a regular playing in the Apache process, release its contributions to the Apache code base as open source, and earn a seat on the Apache Committee just the way any programmer would by submitting code and building a reputation on the basis of that code. At the same time, IBM would offer enterprise-level support for Apache and its related WebSphere product line, which would certainly help build the market for Apache.”

-Reference: The Success of Open Source, Steven Weber 2004


In the 20th century, scale effects in business were largely driven by breadth and distribution. A company with manufacturing operations around the world had an inherent cost and distribution advantage, leading to more competitive products. A retailer with a global base of stores had a distribution advantage that could not be matched by a smaller company. These scale effects drove competitive advantage for decades.

The Internet changed all of that.

In the modern era, there are three predominant scale effects:

-Network: lock-in that is driven by a loyal network (Facebook, Twitter, Etsy, etc.)
-Economies of Scale: lower unit cost, driven by volume (Apple, TSMC, etc.)
-Data: superior machine learning and insight, driven from a dynamic corpus of data

I profiled a few of the companies that are exploiting data effects in Big Data Revolution —CoStar, IMS Health, Monsanto, etc. But by and large, big data is an unexploited scale effect in institutions around the world.

Spark will change all of that.


Thirty days ago, we launched Hack Spark in IBM, and we saw a groundswell of innovation. We made Spark available across IBM’s development community. Teams formed based on interest areas, moonshots were imagined, and many became real. We gave the team ‘free time’ to work on Spark, but the interest was so great that it began to monopolize their nights and weekends. After ten days, we had over 100 submissions in our Hack Spark contest.

We saw things accomplished that we had not previously imagined. That is the power of Spark.

To give you a sampling of what we saw:

Genomics: A team built a powerful development environment of SQL/R/Scala for data scientists to analyze genomic data from the web or other sources. They provided a machine learning wizard for scientists to quickly dig into chromosome data (kmeans classifying genomes by population). This auto-scalable cloud system increased the speed of processing and analyzing massive genome data and put the power in the hands of the person that knows the data best. Exciting.

Traffic Planning: A team built an Internet of Things (IoT) application for urban traffic planning, providing real-time analytics with spatial and cellular data. Messaging queues could not handle the massive and continuous data inputs. Data lakes could not handle the large volume of cellular signaling data in real-time. Spark could. The team exploited Spark as the engine of the computing pool, Oozie, to build the controller module, and Kafka as the messaging module. The result is an application to processes massive cellular signal data and visualizes those analytics in real-time. Smarter Planet indeed.

Political Analysis: A team built a real-time analytics platform to measure public response to speeches and debates in real-time. The team built a Spark cluster on top of Mesos, used Kafka for data ingestion and Cloudant for data storage. Spark Streaming was deployed for processing. Political strategists, commentators, and advisors can isolate the specific portion of a speech that produces a shift in audience opinion. The voice of the public, in real-time.

Spark is changing the face of innovation in IBM. We want to bring the rest of the world along with us.


Apache Spark lowers the barrier to entry to build analytics applications, by reducing the time and complexity to develop analytic workflows. Simply put, it is an application framework for doing highly iterative analysis that scales to large volumes of data. Spark provides a platform to bring application developers, data scientists, and data engineers together in a unified environment that is not resource-intensive and is easy to use. This is what enterprises have been clamoring for.

An open-source, in-memory compute engine, Spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. Today, business professionals have analytics in their hands in the form of visual dashboards that inform them what is happening. Think of this as descriptive analytics. Now, with Apache Spark, these can be complemented with analytics smarts built into applications that learn from their surroundings and specifies actions in the moment. Think of it as prescriptive analytics. This means that, with Spark, enterprises can deploy insights into applications at the front lines of their business exponentially faster than ever before.

Spark is highly complementary to Hadoop. Hadoop makes managing large volumes of data possible for many organizations due to its distributed file system. It has grown to a broad ecosystem of capabilities that span data integration and data discovery. It changed the speed at which data could be collected, and fundamentally changed how we make data available to people. Spark complements Hadoop by providing an in-memory compute engine to perform non-linear analysis. Hadoop delivered mass quantities of data, fast. But the real value of data cannot always be exposed because there isn’t an engine to push it through. With Spark, there’s a way to understand which data is valuable and which is not. A client can leverage Spark to augment what they are doing with Hadoop or use Spark on a stand-alone basis. The approach is in the eye of the beholder.


While there are many dimensions to the Spark ecosystem, I am most excited by machine learning. Machine learning is better equipped to deal with the modern business environment than traditional statistical approaches, because it can adapt. IBM’s machine learning technology makes expressing algorithms at scale much faster and easier. Our data scientists, mathematicians, and engineers will work with the open source community to help push the boundaries of Spark technology with the goal of creating a new era of smart applications to fuel modern and evolving enterprises.

With machine learning at the core of applications, they can drive insight in the moment. Applications with machine learning at their core get smarter and more customized through interactions with data, devices and people—and as they learn, they provide previously untapped opportunity. We can take on what may have been seen as unsolvable problems by using all the information that surrounds us and bringing the right insight or suggestion to our fingertips right when it's most needed.

It is my view that over the next five years, machine learning applications will lead to new breakthroughs that will assist us in making good choices, look out for us, and help us navigate our world in ways never before dreamed possible.


I see Apache Spark as the analytics operating system of the future, and we are investing to grow Spark into a mature platform. We believe it is the best technology today for attacking the toughest problems of organizations of all sizes and delivering the benefits of intelligence-based, in-time action. Our goal is to be a leading committer and technology contributor in the community. But actions speak louder than words, which brings us to today’s announcements:

1)IBM is opening a Spark Technology Center in San Francisco. This center will be focused on working in the open source community and providing a scalable, secure, and usable platform for innovation. The Spark Technology Center is a significant investment, designed to grow to hundreds of people and to make substantial and ongoing contributions to the community.

2)IBM is contributing its industry leading System ML technology— a robust algorithm engine for large-scale analytics for any environment—to the Apache Spark movement. This contribution will serve to promote open source innovation and accelerate intelligence into every application. We are proud to be partnering with Databricks to put this innovation to work in the community.

3)IBM will host Spark on our developer cloud, IBM BlueMix, offering a hosted service and system architectures, as well as the tools that surround the core technology to make it easier to consume. Our approach is to accelerate Spark adoption.

4)IBM will deliver software offerings and solutions built on Spark, provide infrastructure to host Spark applications such as IBM Power and Z Systems, and offer consulting services to help clients build and deploy Spark applications.

IBM is already adopting Spark throughout our business: IBM BigInsights for Apache Hadoop, a Spark service, InfoSphere Streams, DataWorks, and a number of places in IBM Commerce. Too many to list. And IBM Research currently has over 30 active Spark projects that address technology underneath, inside, and on top of Apache Spark.

Our own analytics platform is designed with just this sort of environment in mind: it easily blends these new technologies and solutions into existing architectures for innovation and outcomes. The IBM Analytics platform is ready-made to take advantage of whatever innovations lie ahead as more and more data scientists around the globe create solutions based on Spark.

Our strategy is about building on top of and around a successful open platform, and adding something of our own that’s substantial and differentiated. Spark is that platform. We are just at the start of building many solutions that leverage Spark to the advantage of our clients, users, and the developer community.


IBM is now and has historically been a significant force supporting open source innovation and collaboration, including a more than $1 billion investment in Linux development. We collaborate in more than 120 projects contributed to the open source community, including Eclipse, Hadoop, Apache Spark Apache Derby, and Apache Geronimo. IBM is also contributing to Apache Tuscany and Apache Harmony. In terms code contributions, IBM has contributed 12.5 million lines of code to Eclipse alone, not to mention Linux— 6.3 percent of total Linux contributions are from IBM. We’ve also contributed code to Geronimo and a wide variety of other open-source projects.

We see in Spark the opportunity to benefit data engineers, data scientists, and application developers by driving significant innovation into the community. As these data practitioners benefit from Spark, the innovation will make its way into business applications, as evidenced in the Genomic, Urban Traffic, and Political Analysis solutions mentioned above. Spark is about delivering the analytics operating system of the future—an analytics operating system on which new solutions will thrive, unlocking the big data scale effect. And Spark is about a community of Spark-savvy data scientists and data analysts who can quickly transform today's problems into tomorrow's solutions. Spark is one of the fastest-growing open source projects in history. We are pleased to be part of the movement.

Wednesday, June 3, 2015

Technical Leadership

As companies grow and mature, it is difficult to maintain the pace of innovation that existed in the early days. This is why as many companies mature (i.e. Fortune 500), they sometimes lose their innovation edge. The edge is lost when technical leadership in the company either takes a backseat or evolves to a different role (different than the role it had in the early days). I see a number of companies where over time, the technical managers give way to "personnel" or "process" managers, which tends to be a death knell for innovation.

Great technical leaders provide a) team support and motivation, b) technical excellence, and c) innovation. Said another way, they lead through their actions and thought leadership.


As I look at large organizations today, I believe that technical leaders fall into 3 types (this is just my framework for characterizing what I see).

The Ambassador
A technical leader of this type brings broad insight and knowledge and typically spends a lot of time with the clients of the company. They drive clients in broad directional discussions and will often be a part of laying out a logical architectures and approaches. They are typically not as involved where the rubber hits the road (ie implementation of architectures or driving specific influence product roadmaps). Most of the artifacts from The Ambassador are in email, powerpoint, and discussion (internally and with clients).

The Developer
A technical leader that is very deep, typically in a particular area. They know their user base intimately and use that knowledge to drive changes to the product roadmap. They are heavily involved in critical client situations, as they have the depth of knowledge to solve the toughest problems and they make the client comfortable due to their immense knowledge. Most of the artifacts from The Developer are code in a product and a long resume of client problems solved and new innovations delivered in a particular area.

The Ninjas
A technical leader that is deep, but broad as appropriate. They integrate across capabilities and products, to drive towards a market need. They have a 'build first' mentality or what i call a 'hacker mentality'. They would prefer to hack-up a functional prototype in 45 days, than do a single slide of powerpoint. Their success is defined by their ability to introduce a new order to things. They thrive on user feedback and iterate quickly, as they hear from users. Said another way, they build products like a start-up would. Brian, profiled here, is a great example of a Ninja. Think about the key attributes of Brian's approach:

1) Broad and varied network of relationships
2) Identifying 'strategy gaps'
3) Link work to existing priorities
4) Work with an eye towards scale
5) Orchestrating milestones to build credibility

That's what Ninja's do.


Most large companies need Ambassadors, Developers, and Ninjas. They are all critical and they all have a role. But, the biggest gap tends to be in the Ninja category. A company cannot have too many, and typically does not have enough.