Solid EDM: Why the Raw Table Matters

by Adam Roderick

Data in a data management system originates from many different sources. These sources live in a heterogenous environment, so the data can arrive in many different ways.
The raw table sits at the forefront of the data ingestion process. It is the starting point that pulls the data into the EDM system, moving it from external to internal. The data type of columns in a raw table is the most permissive allowed by the system. For example, VARCHAR(MAX) in SQL Server.
The practice of a raw table is simple to understand and simple to implement, but it is easy to overlook. Having a raw table as part of the load pattern in an EDM system is important.
Bad Data
First, if your data platform is one that cares about data types—and most do since they optimize for performance and storage—you always have the risk of bad data coming in. Integer fields containing text, date values below the minimum recognized date, text values that are too long—you get the idea. No matter what the source system owners might promise, bad data is a reality and will happen. So you need to plan for it. A raw table allows the data to transition from external to your system to internal without the risk of data typing errors. Once the data is in the raw table, you can execute a controlled process to type the data.
A Consistent Onramp Pattern
Data feeds can come from all types of different systems, in different formats, and across different transport mechanisms. With all this variety, the implementations of how to ingest data can also vary widely, even on teams that say they have coding standards and patterns in place. Ensuring every data load starts with a raw table means every data load has a consistent starting point. After that, implementing common patterns and reusable components becomes much more easy.
Define the Interface

Some of the most time consuming work when integrating with an external system is analyzing and defining the interface. If the raw table acts as the interface between the source and EDM systems, then the two are very decoupled from each other. Both teams can continue to modify and improve the internals of their systems—the only rule is to maintain compatibility with the interface. That gives both sides of the raw table all the flexibility they need to stay agile.

Previous Post The Data Pipeline Requirements Model
Next Post Data Governance and the Enemy of the Good