Menu

Data Scientists and Data Janitors

by Jason Williscroft

Data Science is sexy.

As opening statements go, this one has an impeccable pedigree. In October of 2012, the Harvard Business Review published an article entitled “Data Scientist: The Sexiest Job of the 21st Century“.

Depending whom you ask, the term Data Science has been around since 1960, or 1974, or 2008 (according to the authors of the aforementioned HBR article, who claim to have coined it). But no matter how you slice it, everybody seems to agree that Data Science is how Big Questions get answered in The Future.

One of the many promises of Data Science is that it can answer those Big Questions without making anybody else work very hard. In The Past—say, last week—the only way to answer Big Questions was to spend lots of time and money marshaling and cleansing data so that regular, un-sexy, lower-case, non-Data scientists could apply really obscure mathematics in order to arrive at useful conclusions. But in The Future—like, next week—freshly-minted upper-case Data Scientists with Ivy-League credentials and fascinating facial hair will spend lots of time and money tending glistening machines running advanced algorithms applying really obscure mathematics in order to arrive at useful conclusions.

These algorithms, it is asserted, will be able to produce high-quality conclusions without any of the old-fashioned mess and bother of marshaling and cleansing data. Like a Stirling engine that can run on gasoline or diesel fuel or camel dung, Data Science can consume whatever dirty old data you have lying around the yard and produce conclusions of boundless precision and utility.

This assertion, as it happens, is camel dung.

In an August 2014 article that has recently found new life on LinkedIn, the New York Times technology editorial staff stumbled over something of a bombshell: apparently, Garbage In-Garbage Out is still a thing. It turns out that—just like regular scientists—Data Scientists really are capable of producing stunningly correct conclusions from high-quality data. But—again, just like regular scientists—the quality of their conclusions suffers when the quality of their source data declines.

To be fair, Data Scientists and their algorithms are capable of producing low-quality conclusions from garbage data with far greater efficiency than their lower-case forebears. But as advantages go, that one is unlikely to win any prizes.

The NYT article quotes a number of Data Scientists, who all voice versions of the same complaint: in the real world, data has to be cleansed, aligned, and prepared before it can be mined for useful nuggets of intelligence. And that isn’t easy.

One of many money quotes: “It’s an absolute myth that you can send an algorithm over raw data and have insights pop up.”

Or this: “Data wrangling is a huge—and surprisingly so—part of the job. It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”

Or this: “Practically, because of the diversity of data, you spend a lot of your time being a data janitor, before you can get to the cool, sexy things that brought you to the field in the first place.”

Enter Data Management.

Simply put, Data Management is the art and practice of being a data janitor. Raw data moves through multiple stages where it is typed, cleansed, aligned, matched, and mastered, all with the express purpose of feeding the high-quality result to downstream systems (and Data Scientists) that do better work with high-quality data.

While Data Science provides broad analytical vision around one big question, Data Management provides narrowly-focused precision around a hundred thousand small ones. You can do Data Management without doing Data Science—consider your monthly bank statement—but doing Data Science without doing Data Management is a pointless waste of time, talent, and treasure.

Many institutions are responding to the seduction of Big Data by diverting resources from Data Management projects to data lakes, Hadoop clusters, and other sexy Big Data toys. Can these implementations deliver value? Absolutely! But our concern is that the value Big Data delivers is significantly constrained by the quality of the data Big Data consumes… and that data quality depends critically on robust Data Management.

Within this context, diverting funds from Data Management to buy more Big Data is as self-defeating as cutting a race car’s curbside weight by removing the engine. It just doesn’t work.

The moral of the story: high-quality data is worth paying for. With robust Data Management in place, a competent Data Scientist armed with a free copy of R can generate useful insights all day long. Without it, a whole platoon of Data Scientists will accomplish little more than drowning in their own (very expensive) data lake.

P.S. HotQuant does Data Management. Click here to learn more.

Previous Post Peeking Into The Sausage Factory: A Subjunctive Interview with Sir Richard Branson
Next Post Dependency Inversion and the Data Access Layer