The Three Building Blocks of Data Science

During the opening lecture of the first data science class I ever took as an undergraduate, the professor presented the above Venn Diagram. I couldn’t find the exact presentation, and so I did my best to make a copy of the diagram while maintaining the key points.

Looking back on it, I suspect she hit us with this image early on because she wanted us to remember the foundations of data science in a world increasingly distracted by the shiny and glamorous products of Silicon Valley.

These days, everyone seems fixated on some hot new topic or technology for data science, be it PyTorch, TensorFlow, the newest Tableau update, or Google’s state-of-the-art natural language processing models. What do all of these topics have in common? They are undeniably technical.

Don’t get me wrong. Technical expertise is certainly an essential component of effective data science work, but it is not the only component. By its very nature, data science is an interdisciplinary field. To excel in it, it is important to draw from all of its elemental disciplines.

In this article, I will present the three building blocks of data science – statistics, computer science, and domain expertise – and discuss how each one is important to the field, as well as explore what can go wrong if one or more is neglected.

Statistics

Most of us probably toss the word statistics around to bolster any random argument we make, but can we actually define it? According to Oxford Languages, statistics is "the practice or science of collecting and analyzing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample" [1].

In simpler terms, statistics looks at a bunch of numbers and tries to find meaningful patterns in them. It’s generally split into two branches: 1) descriptive statistics, which seeks to describe patterns in existing data, and 2) inferential statistics, which seeks to make predictions about future data.

There are two important facets to note in the original definition:

Traditionally, statistics – as an offshoot of mathematics – focuses on strictly numerical data. As we will see, this is not necessarily the case with data science.
Accurate statistics is reliant on a representative sample. This related to point 1, because a blind focus on numbers can actively work against this goal.

Statistics was data science before data science existed. People have been analyzing data for insights for centuries – but formal data science is a relatively new field. Why? Well, in the olden days, data was collected by hand in relatively small amounts, which in turn meant it could also be analyzed by hand. However, with the advent of computers, the amount of data available to us increased exponentially, and statistics alone no longer sufficed to process and study it.

This leads us to the next component of modern data science.

Computer Science

Returning to our good friend Oxford Languages, we can define computer science as "the study of the principles and use of computers."

Hmm. Not super informative. Arguably even a bit misleading.

Many people default to thinking computer science is equivalent to programming or software engineering. In reality, computer science involves a range of different subjects – including but not limited to graphics, theoretical computer science, operating systems, computer architecture, algorithm design, and programming languages.

What connects all of these fields – and thus defines computer science as a whole – is the use of computer programs to perform step-by-step logical operations to solve some problem. In fact, that’s all a "computer" is – a machine which performs a series of logical steps. Computer science involves active manipulation of those steps in pursuit of a specific goal.

The largest advantage of computers is that they can perform computations much, much faster than humans can. This is the primary reason computer science is a principal component of data science.

In theory, data science could exist without computers. Computers don’t provide the mathematical basis for analyzing the data – that’s the role statistics play. In a world where humans could think and write at the speed of light, statistics might be enough.

But in practice, there is so much data that it would be impossible to collect, study, process, and analyze it all by hand. Enter computer science, the modern-day tool which facilitates big data insights.

At this point, we can see that statistics provides the mathematical foundation for data science, and computer science provides the real-world processing capacity for data science.

However, both fall flat without the third, equally important element.

Domain Expertise

This time, there is no Oxford Languages definition to draw from, because domain expertise isn’t really a single field; it’s more of an overarching term that’s best understood in the following way.

Let’s ask ourselves a question: in talking about data science, where does the data actually come from? Is it a mathematical remainder term of some statistical equation? Or an elusive construct hidden somewhere in the hardware of a computer?

Those might seem like stupid questions, but they become a little less foolish when we consider how so many self-proclaimed data scientists are so obsessed with the numbers and the code that they forget the success of both of those is fundamentally reliant on the data itself.

And where does the data originate? The domain.

Data is worthless without the context for understanding it properly – context which can only be obtained by a domain expert: someone who understands the field where the data stems from and can thus provide the perspectives needed to interpret it correctly.

Let’s consider a toy example to illustrate this. Imagine we collect data from a bunch of different golf games from recent years of the PGA Tour. We obtain all the data, we process and organize it, we analyze it, and we confidently publish our findings, having triple-checked all our formulas and computations.

And then, we become laughingstocks of the media. Why? Well, since none of us has ever actually played golf, we didn’t realize that lower scores correspond to a better performance. As a result, all our analyses were based on the reverse, and therefore incorrect.

This is obviously an exaggeration, but it gets the point across. Data only makes sense in context, and so it is essential to consult with a domain expert before attempting to draw any conclusions.

In practice, a failure to account for the domain and incessant reliance on purely quantitative methods can result in (and in many cases, already has resulted in) an unethical, restrictive practice of data science.

Some Final Thoughts

When I started my PhD program in human-computer interaction and human-centered data science six months ago, I shared a curious exchange with one of the professors in my department. For context, he comes from a sociology background.

A few fellow students were meeting him for coffee to discuss intersecting research interests; I randomly ran into them, and he asked me what mine were. I responded that I studied human-centered data science, computer science education, and visualization.

He focused on my first point, and went on to briefly discuss how his research projects revolve around the history and sociology of technology. He described how he examines the basic infrastructures of information technology, and is interested in organizations that support scientific activities by providing data and computational tools (sound familiar?).

He ended with a half-joke, chuckling and saying, "So, I’m somewhat of a human-centered data scientist myself."

At the time, I didn’t think much of it – but over the past six months, I’ve begun to see his point. His role is that of facilitating and advocating for the domain expert so that people can properly understand and interpret their data in context— a job whose importance cannot be overstated.

And on that note, let’s review:

Statistics provides the tools needed to mathematically draw insights from data. Without it, we risk drawing conclusions that aren’t scientifically rigorous.
Computer science provides the tools needed to collect, process, and analyze data at scale. Without it, there is no way we’d be able to comprehend the vast amount of data available to us in the modern world.
Domain expertise provides the tools needed to contextualize and understand data. Without it, we risk drawing inaccurate conclusions relying on mathematical and computational techniques that overlook intricacies of the data only visible to a field expert.

Combine these three together, and we get data science.

Want to excel at Python? Get exclusive, free access to my simple and easy-to-read guides here. Want to read unlimited stories on Medium? Sign up with my referral link below!

Murtaza Ali – Medium

My name is Murtaza, and I am a PhD Student at the University of Washington studying human-computer interaction. I enjoy writing about education, programming, life, and the occasional random musing.