by Jason Williscroft
Here’s an exercise: my train pulls into Milwaukee in 15 minutes. Let’s see if I can describe what I do in that time.
If you’re a financial institution (or any other massive consumer of data), you probably get your data from more than one source. And the odds are high that the incoming data is…
- Dirty: important elements are often either missing or just flat-out wrong.
- Twisted: structurally, the incoming data just doesn’t fit your internal data model.
- Cryptic: of the dozen different methods your sources use to represent a country, a currency, or an asset class, none of them matches the way you do it internally.
- Inconsistent: even when two of your sources do both provide a given data element, they often disagree on its actual value.
Clearly data in this state is not up to the task of, say, supporting trading operations. Or a medical diagnosis. So what to do?
The Enterprise Data Management systems we build consume that data and run it through the following stages:
- Loading. Data arrives in a text file, or a spreadsheet, or a message, and winds up in a database table. If the incoming data is irretrievably broken–an empty file, say–we throw exceptions that human beings can parse, prioritize, and resolve.
- Data Typing. The source promised you a date but the value they delivered is actually somebody’s dog’s name. We inspect the incoming data, catch such errors before they can derail the process, and throw more exceptions.
- Alignment. We resolve all the different naming standards for country, currency, asset class, etc., with the organization’s internal standards, so apples can be compared to apples. And throw more exceptions.
- Matching. We identify which rows from a given source–say, the one that describes Apple common stock–line up with corresponding rows from a different source. Then we can inspect data elements across sources and throw exceptions when their disagreements are significant.
- Mastering. The heart of the process. We apply the client’s business logic to cherry-pick data values from the various sources in order to populate their master data model while maximizing data quality… and then throw exceptions when the mastered data still misses the quality mark (or just doesn’t fit).
- Output. At the end of the day, the whole point is to feed better data to the target systems that need to consume it: order management systems, accounting systems, etc. So we transform our mastered data into a form each target system can handle, throw exceptions when that transformation fails, and send the result out the door.
That–and the user interfaces that expose the process both to the humans responsible for it, and to their internal clients–is the data pipeline. There’s also a reference infrastructure: all the machinery and user interfaces used to manage the alignment of data; the configuration of business logic; and the definition, management, and resolution of exceptions generated by the data pipeline.
And at the end of the pipeline? Clean, consistent, available data, coupled with data quality metrics that unambiguously indicate whether you can trade or diagnose with confidence and, if not, whynot, and what it will take to get there.
What I’ve just described is an Enterprise Data Management system, and is best summarized as a wholesale replacement of the living center of an organization’s system diagram. If all that sounds like the kind of transformative, long-term, high-risk, fishbowl-visibility project that stands to make or break virtually every career it touches… well, you’re right.
End of the line!