I figured I'd focus my first post on a broad topic and what better way than to discuss what this blog will revolve around: data science! Actually, when I talk to people about data science, I usually get blank stares, maybe a head nod or two, and if I'm lucky, someone will ask "is that related to big data?". This is understandable because data science is an emerging field—practically everyone has their own definition, so I'd like to begin by sharing mine.
After reading myriad blogs, white papers, and chatting with data scientists, I've gathered that data science originated as a buzzword, popularized in the past five years by Drew Conway's venn diagram. Since then, a multitude of venn diagrams have cropped up, some more serious than others, but they all shake hands on one thing: data science combines mathematics, computer programming, data visualization, machine learning, business strategy and data mining to solve problems and answer questions within a certain industry such as finance, pharma, energy, healthcare, manufacturing, etc. Perhaps that's why data scientists are referred to as unicorns—finding someone with such broad experience is exceedingly challenging. Over the past few years however, data science has steadily matured with the emergence of competitive fellowships and online courses. There are also numerous bootcamps and some universities have even begun offering graduate degrees.
Here's my interpretation of data science: the use of data to derive insights or develop a product that bolsters business value. I know that sounds comically ambiguous but to put it into context, let's compare data science to data analytics.
Data analytics has existed for decades, popularized by the advent of Excel. Data analysts come in many flavors: marketing analyst, financial analyst, operations analyst, etc. However, they all utilize data to summarize what is happening in the company or organization. Data analytics, which is synonymous with business intelligence when focusing on historical data, is crucial for providing a descriptive picture of correlations or answering questions such as:
- How do our costs compare to our competitors?
- What products or clients are our biggest winners/losers?
- How are our customers purchasing our products (online vs. brick and mortar)?
On the other hand, data science takes the insight generated from data analytics one step further to a) figure out why something is happening in the company or organization, b) determine how to utilize this information to improve future decision-making (also known as predictive analytics), and/or c) develop a product.
As data science contains the word science, its workflow typically involves applying the scientific method to reach the above goals. As true scientists (here's where they really begin to differ from analysts), they come up with their own questions to investigate, generate hypotheses, and gather the data necessary to test those hypotheses. Since real-world datasets are fragmented and notoriously messy, data scientists spend the majority of their time cleaning and organizing the data (referred to as data wrangling) prior to performing an analysis. The final component of the workflow is telling a story and providing an actionable or "productionizable" recommendation; therefore, data scientists master building articulate presentations or visualizations to communicate their findings.
An example of the types of analysis a data scientist might perform is A/B testing to unearth the driving force behind trends or correlations. The bulk of the analyses, however, utilize an algorithmic approach called machine learning that allows computers to autonomously analyze and make predictions from data—large datasets enable exceptionally accurate predictions (hence the commotion over big data). Machine learning is a means to achieve AI and according to Arthur Samuel, "gives computers the ability to learn without being explicitly programmed". In a nutshell, rather than giving a machine a list of instructions to follow (aka a computer program), machine learning trains them like we teach our children or pets. If you're thinking about Skynet or the Matrix, you've been watching too many sci-fi movies! Data scientists apply machine learning to build an AI that detects fraud, understands how a customer segment "feels" about a product, diagnoses diseases, predicts equipment failure, forecasts sales, optimizes the value chain, determines a pricing strategy, accelerates drug discovery, etc.
Certain companies utilize data science to also develop data-driven, AI-based products. For example, nowadays customer service systems are being supplemented with chatbots. Marketing teams use AI to understand and adapt to dynamic customer behaviors and preferences. Your email spam filter and phone's navigation app are built using machine learning. Netflix, Yelp and Pandora incorporate AI to provide you with personalized recommendations. And while machines have been able to "see" and "hear" for decades, data science has empowered them to "think", "understand" and "adapt" as well, thus enabling lifelike robots, digital assistants such as Siri, and the quest for driverless cars.
While I've given the impression data science can be easily distinguished from data analytics, many companies use the terms interchangeably; startups may even have the same person wearing both hats. Moreover, some data scientists will work on all three aspects I've mentioned here, while others will have more focused roles. The take home is that currently, opinions are vacillating on what exactly data science is. In fact, I don't even think it's a new field of science per se; rather, I agree with Jim Gray: data science is the latest evolution of scientific thinking, following empirical, theoretical and computational. Since this is the first blog post, let's stop here for now. I could go on about why data science couldn't have existed in any point in history until now or dive deeper into machine learning, but I'll leave that for another day.
Don't hesitate to leave your comments below!