A Systematic Approach to Bad Data

Notes from Industry

Anyone who works with data knows how time-consuming (and grey hair-inducing) the initial preprocessing steps can be. We all have chomped at the analytic bit, staring eagerly down the project path to our distant modeling destinations. So what if the data isn’t perfect? How much could a few suspect values or variables really impact findings?

Resist that urge to gloss over raw data.

Trust me. Your employers, shareholders, and future self will thank you. Why?

Simply put, diving headfirst into exploratory analyses or running models without first verifying the quality of the provided data is a major oversight. According to IBM estimates, human error and the poor quality that comes along with it cost the U.S. upwards of $3 trillion a year. Yes, you read that number right. Dirty data is no small matter.

Data cleaning is thankfully often an early staple of data science and analytics-related coursework. While facing the expansive, complex datasets common in today’s "big data" era, novices can find relief in mastering classic data cleaning techniques. Cleaning is an indisputably vital step in ensuring quality data. If you’re unsure as to what these techniques entail or need a refresher on best practices, check out my previous article detailing a basic data cleaning framework here.

But, data novices beware… Sometimes cleaning doesn’t cut it.

I know. Not exactly what you wanted to hear. But, the hard truth of the matter is, some data can’t be cleaned into adequacy.

Data can not only be "dirty", but it can also be architecturally unsound.

Okay, but what does "architecturally unsound" mean?

Data architecture is a fairly broad concept. It describes what systems are built for connecting data to overarching business strategy, from what data is collected to how it is stored, managed, transformed, and ultimately used.

Unfortunately, there is room for error throughout any given data pipeline. These errors can manifest in various ways, including lapses in dataset engineering and data streaming bugs.

As such, architecturally unsound data will have characteristics such as entire rows or columns of missing data, incorrect forced-choice answer responses, data collected during wrong timeframes or with the wrong sample, or (horror of horrors 😱 ) data that doesn’t fit original business or project aims.

Okay, but what do I actually do about it?

You can clean the long hours away but, if errors such as these pop up during your wrangling process, you may want to put things on a temporary hold. No many how many "NA’s" you catch or variable names you relabel, it’s not going to make up for wholly missing or invalid data.

No one wants to collect hard-earned data only to realize it’s unusable. But how can you tell if the data is actually unusable or not? Do you instantly have to scrap every instance of unsound data and rerun collection?

The short answer is no!

With a clear, four-step system, you can mitigate architectural errors, salvage usable data, potentially save your company the financial burden of rerunning data collection measures, AND also ensure the highest quality data. Whew.

This system’s overarching goal is to identify and correct errors at the pipeline source. To do so, simply follow the proposed questions one by one in order until you find the solution that best fits your particular architectural problem.

Let’s outline the flowchart below:

Pro-tip: Before diving headfirst into the above flowchart, examine your raw dataset. Look for patterns indicative of faulty data. Are there large chunks of data mysteriously missing? Can the measured values answer your project question(s)? Are there significant errors your trusty cleaning packages (like R’s tidyverse) can’t readily address? Don’t collect more data unless you have to.

A Simple, Synthetic Example

In the realm of tutorials, examples speak louder than words.

Say you’re working for a holistic mental health startup that is hoping to pilot a new survey on their app. This survey aims to boost client-app interaction by gaining additional insight on how previous app experience influences clients’ current app activity and, ultimately, allow the product team to better target clients with low interaction time. After a few team meetings, a survey is engineered and made available to a subgroup of 50 clients prior to launching it app-wide.

As part of the analytics team on the project, you are asked to take a look at the resulting survey data. You quickly realize something isn’t right. Note that the dataset used in this example is artificially produced and not necessarily representative of real data trends.

There are a few problems here.

First, of the four client behaviors tracked by the app (meditation habits (_meditationfeature), eating habits (_dietfeature), exercise habits (_exercisefeature), and sleeping habits (_sleepfeature)), all of the meditation values are missing.

Second, every value for the _what_apptype variable is "Educational", which although possible, is statistically unlikely.

Third, each feature is tracked by the number of times it is visited, not total time spent on the feature. Although usable, this data doesn’t capture the project objective as well as originally hoped.

Let’s now move through the flowchart for dealing with bad data!

Step 1: Can you clean or request new data?

YES: As suggested by the earlier pro tip, don’t request new data unless you have to. Data errors are common and many are fixable. Again, check out my post here on data cleaning for more insight on identifying and correcting fixable types of errors.

If the errors are indeed cleanable, go ahead and do so prior to analysis.

Unfortunately, the current dataset’s errors are too significant to simply clean away. Among the plethora of exploratory tools out there, we can run R "plyr" package’s count() function to determine dataset (df) frequency distributions:

As we can see above, the entirety of the _meditationfeature data is missing ("NA") and the _what_apptype data is suspiciously homogenous. Yes, imputation is a viable cleaning technique in many circumstances, but predicting entire columns of data is a clearcut method towards exacerbating validity concerns.

The empty _meditationfeature column also suggests that a data streaming glitch may be the root cause of the error, as it is unlikely that all of the data for an established app feature is missing. The homogenous _what_apptype column also suggests an architectural error, as it is likewise unlikely that all 50 clients have only previously interacted with "Educational" apps.

NO: Simply requesting new data would not fix the error if the root of the problem is the architecture itself as is likely. And, as cleaning is also out due to the extent of the errors, we are forced to move on to Step 2.

Step 2: Do you have enough valid data to leave out the errors?

YES: Before jumping to this conclusion, be sure that your sample size is large enough to reliably run your later models after removing the observations with errors. Here’s an external overview on how to do so if you’re unsure.

Also, be sure that removing the invalid data doesn’t cause systematic errors. Here’s a Wikipedia overview on the subject if you’re unsure as to what it entails.

If your dataset is large enough to ignore the architectural errors without significant statistical cost, perform your analyses once the dataset is parred down appropriately.

NO: In our example case, entire columns of the dataset are affected. It therefore isn’t plausible to remove each observation with errors, as we would be left without any data.

Additionally, keep in mind that removing observations has a stronger impact on accuracy in small datasets than large ones. Let’s check the size of our dataset again:

Our particular example dataset is relatively small with only 50 observations, which unfortunately isn’t under the "big data" umbrella by a long shot.

With these caveats in mind, we don’t have enough valid data to leave out the errors. We will move onto Step 3.

Step 3: Is there available proxy data?

YES: Proxy data is data that may not be directly related to project aims but that can nonetheless be used in proxy of unmeasured variables. Again, for a closer look at proxy data definitions, check out this Wikipedia page.

In our case here, proxied data could come in the form of historical data (i.e. data gathered by the startup at an earlier timepoint that has proximally related variables). Historical proxy data can also come in the form of publically available datasets or datasets willingly shared by other companies.

More concretely, our factitious startup could have data from a previous project also measuring the number of times each feature had been visited at earlier timestamps.

NO: For the sake of the example, let’s suppose, instead, that the "meditation" feature was only recently launched and there isn’t enough client-feature interaction data yet to project accurate use trends for the current project.

As we have no substitute for the meditation feature data, we’ll move along to Step 4 instead.

Step 4. Can another data collection be executed?

YES: Seems a simple enough question, but there are often several factors at play while evaluating the answer.

One such factor is the cost of re-running data collection. Whether it be the expense of time, rearranging company priority items, or the actual monetary investment it will take to pay participants, involved tech services, and/or employees themselves, data collection can take a budgetary toll. Be sure to consult with your team members before moving forward.

Another factor at play is the data pipelining itself. If the errors are indeed architectural in nature, simply re-running the collection will return the inconsistencies that got you to this step in the first place. To circumvent this, again consult with your team members to identify the root pipeline glitches. Is there a mistake in the front-end design of the survey error? Is there a different streaming service that better integrates with your app platform of choice? Was data lost in the merging of multiple datasets? Only once questions like these have been answered and the glitches properly amended should data collection again commence.

Once the architectural errors and the burden of running additional collection measures have been addressed and approved, you may continue with analyses once the new dataset is available.

NO: Again, for the sake of the example, let’s suppose that your startup is new with minimal funds and strict deadlines. After talking with your team members, it is decided that presenting the available data on time is more important than having initially perfect data.

This leads us to the final resolution in our flowchart: Modify your project objective. Because our available data doesn’t measure whether or not previous app interaction positively influences clients’ interaction with current app features at quite the micro level the team was originally hoping for (i.e. is behavior influenced by app type, evaluation of the full battery of features, and total time spent on features rather than total times visited), project objectives must to altered to highlight macro evaluations instead.

With this in mind, you mock up a bar chart visualizing these new, macro analyses to put in your presentation.

Not as fine-grained an analysis as hoped, but general trends are still present. You do however note that, contrary to natural presumption, clients who have experience with apps are slightly less likely to open a given feature. In your presentation, you are consequently sure to stress that more data concerning total time spent on each feature is likely needed to better parse the trend. You also include recommendations to fully address potential architectural bugs so as to not exacerbate data validity concerns in the future.

In Conclusion

Remember to be honest and realistic about the quality of your data. No matter how quickly we would like to move through data checks and cleaning, bad data will remain bad data without proper scrutiny.

Although the included example is a simple and synthetic one, the flowchart steps remain the same no matter how complex the project at hand is. Incorporating this straightforward approach into your day-to-day data preprocessing can easily transform unease, wasted time and money, and use of poor data into confidence, financial efficiency, and high-quality analyses ✨