What exactly is data science?

I figured I'd focus my first post on a broad topic and what better way than to discuss what this blog will revolve around: data science! Actually, when I talk to most people about data science, I usually get blank stares. This is understandable because data science is an emerging field—practically everyone has their own definition, so I'd like to begin by sharing mine.

After reading myriad blogs, white papers, and chatting with data scientists, I've gathered that data science originated as a buzzword, popularized in the past five years by Drew Conway's venn diagram. Since then, a multitude of venn diagrams have cropped up, some more serious than others, but they all shake hands on one thing: data science combines mathematics, computer programming, data visualization, machine learning and business strategy to solve problems and answer questions within a certain industry such as finance, pharma, energy, healthcare, manufacturing, etc. Perhaps that's why data scientists are referred to as unicorns—finding someone with such broad experience is exceedingly challenging. Over the past few years however, data science has steadily matured with the emergence of competitive fellowships and online courses. There are also numerous bootcamps and some universities have even begun offering graduate degrees.

My interpretation of data science is the use of data to derive insights or develop a product that bolsters business value. I know that sounds comically ambiguous but to put it into context, let's compare data science to traditional data analytics.

Data analytics has existed for decades, popularized by the advent of Excel. Data analysts come in many flavors: marketing analyst, financial analyst, operations analyst, etc. However, most utilize data to summarize what is happening in the company or organization. Data analytics is crucial for providing a descriptive picture of correlations or answering questions such as:

  • How do our costs compare to our competitors?
  • What products or clients are our biggest winners/losers?
  • How are our customers purchasing our products (online vs. brick and mortar)?

On the other hand, data science takes the insight generated from data analytics one step further to a) figure out why something is happening in the company or organization, b) determine how to utilize this information to improve future decision-making, and/or c) develop a product.

As data science contains the word science, its workflow typically involves applying the scientific method to reach the above goals. As a type of researcher (here's where they really begin to differ from traditional analysts), they come up with their own questions to investigate, generate hypotheses, and gather the data necessary to test those hypotheses. Since real-world datasets are fragmented and notoriously messy, data scientists spend the majority of their time cleaning and organizing the data (referred to as data wrangling) prior to performing an analysis. The final component of the workflow is telling a story and providing a "productionizable" recommendation; therefore, data scientists master building articulate presentations or visualizations to communicate their findings.

An example of the types of analysis a data scientist might perform is A/B testing to unearth the driving force behind trends or correlations. The bulk of the analyses, however, utilize an algorithmic approach called machine learning that allows computers to autonomously analyze and make predictions from data—large datasets enable more accurate predictions (hence the commotion over big data). Machine learning is a means to achieve AI and according to Arthur Samuel, "gives computers the ability to learn without being explicitly programmed". In a nutshell, rather than giving a machine a list of instructions to follow (aka a computer program), machine learning trains them like we teach our kids or pets. Data scientists apply machine learning to build an AI that detects fraud, understands how a customer segment "feels" about a product, diagnoses diseases, predicts equipment failure, forecasts stock prices, determines a pricing strategy, accelerates drug discovery, etc.

Certain companies utilize data science to also develop data-derived products. For example, nowadays customer service systems are being supplemented with chatbots. Marketing teams use AI to understand and adapt to dynamic customer behaviors and preferences. Your email spam filter and phone's navigation app are built with machine learning. Netflix, Yelp and Pandora incorporate AI to provide you with personalized recommendations. And while machines have been able to "see" and "hear" for decades, data science is empowering them to "adapt", "decide" and "understand" as well, thus enabling lifelike robots, digital assistants such as Siri, and the quest for driverless cars.

While I've given the impression data science can be easily defined, many companies use the terms data science, data analytics, predictive analytics and business intelligence interchangeably. Moreover, some folks will work on all three aspects of data science I've mentioned above, while others will have more focused roles. The take home is that currently, data science is a sort of a catch-all term for an exciting but ever-evolving field. In fact, I don't even think it's a new field of science per se; rather, I agree with Jim Gray: data science is the latest evolution of scientific thinking, following empirical, theoretical and computational. Since this is the first blog post, let's stop here for now. I could go on about why data science couldn't have existed in any point in history until now or dive deeper into machine learning, but I'll leave that for another day.

Don't hesitate to leave your comments below!

Comments

Comments powered by Disqus