by Jason Williscroft
Data management projects are complex. In a previous article we made the case for explicitly capturing and decomposing High-Level Requirements (HLRs) into Low-Level Requirements (LLRs) in a manner that facilitates the tracking of individual LLRs through the analysis and development process and into production code. A more thorough treatment of this idea, along with the nuts and bolts of how to go about actually doing that, are the topic of this article: the Data Pipeline Requirements Model (DPRM).
This article will present the DPRM from two directions. First we shall discuss what the DPRM is, by describing its components and how they interact to describe a data pipeline. This will be a discussion in principle: we shall describe the model without making reference to any particular implementation of it. In other words, we will be discussing the DPRM at a requirements level. Next we shall discuss how one might actually implement the DPRM, in order to manage the actual requirements of an actual data pipeline in a real-life data management project.
Some constraints are in order. In this article we will not discuss requirements related to the user interface, nor will we discuss non-functional requirements around performance, etc. In this piece we are narrowly focused on the requirements that define the Data Pipeline: Data Elements and the series of transformations that carry them across various Stages from a set of source artifacts, generally through one or more matching and mastering processes, to the target systems that ultimately consume the data. This is where most of the complexity lives in a modern data management system.
We will assume a project managed according to Agile methodologies. The particular flavor of Agile (Scrum, Kanban, etc.) is not relevant. Moreover, we will assume that the Agile process is driven by Business Requirements (i.e. User Stories), and that with respect to the Data Pipeline the target of the Agile process—the set of artifacts under management—includes not only the code base but also the Data Pipeline Requirements (DPRs) that derive from those Business Requirements and are ultimately realized in code.
This is a critical point: DPRs and code together form a connected structure, where each lower-level object is declaratively traceable to its higher-level parents, and whose various levels evolve in tandem as time progresses. It is pointless to write data transformation code that does not realize an LLR, just as it is pointless to write down an LLR that does not aggregate through some HLR into some business requirement, so at any point one should expect the HLR portion of the structure to cover more ground than the related LLRs, and the LLRs to be more fully-baked than the actual code.
Wait, that’s just Waterfall dressed up to look like Agile! Not so. In a Waterfall project, most requirements at most levels would be specified before writing the first line of code. Not the case here: DPRs and code are all under continuous, tandem development. From a procedural point of view, there is really only one tweak to the standard picture of Agile:
Any new or changed requirement should be written down and rationalized against the existing requirements & code base before being decomposed to lower-level requirements or realized in code.
In carpentry terms: measure twice, cut once, and write all your notes down in one place. If this sounds controversial, it shouldn’t.
Understanding the DPRM
Managing complexity requires a consistent approach. Within the context of Data Pipeline Requirements, we will accomplish this by creating a modeling framework that is expressive enough to account for any DPR we might care to articulate. For the sake of clarity, we will present a stateless model, which expresses necessary objects and their relationships without supporting any return to or review of previous states. In other words, this simplified DPRM has no history and no undo button.
There are two fundamentally different kinds of DPR: those that describe data elements in their various stages of existence—Data Definition Requirements, or DDRs—and those that describe the transformations data elements undergo in transit from one stage to the next: Data Transformation Requirements (DTRs).
Data Definition Requirements
A Data Element is an atomic unit of data that has a precise meaning. A given Data Element may exist in several distinct Stages. For example, a security identifier might exist at various successive Stages in a data pipeline as a column of text in a CSV file, as an NVARCHAR(MAX)-type column in a database table, and an INT-type column in a different database table. So while a Data Element definition will tell you what its data means, its various Stages will tell you where its data is and what format its data has. Data Elements may appear anywhere in the pipeline, persist for as long as they are needed, and then disappear.
The following diagram illustrates the concepts of Data Elements, Stages, and the Data Element Stages where they intersect:
Data Elements & Stages
Some points worth noting about DDR objects:
- A Data Element Stage is the lowest level of data definition, and corresponds to a column in a file, a database table, or some other tabular array of data. A Data Element Stage object should carry sufficient attributes to support analysis and automation as described below.
- A Data Element may be thought of as a business-level data definition, and is a collection of Data Element Stages. It has a consistent meaning at every Stage, but as it propagates through the data pipeline its related Data Element Stage attributes change: column name, data type, etc. Data Element objects should carry attributes that allow them to be partitioned and analyzed in useful ways (i.e. by system, entity, or pipeline phase).
- A Stage is a also a collection of Data Element Stages, but along a different axis. Depending on the context, this might correspond to a file, a database table, or any other tabular array of data. In combination with related Data Element Stages, useful Stage attributes would allow a developer (or an automated tool) to generate file format descriptions or SQL table definitions.
All three of these objects can be realized directly in code, generally in combination. A SQL table definition, for example, will require attributes from a Stage and related Data Element Stages. As requirements these objects are atomic and cannot be decomposed, so there is no need to distinguish between HLRs and LLRs where Data Definition Requirements are concerned.
In the DPRM, Data Definition Requirements do not decompose, whereas Data Transformation Requirements do.
Also, each DDR object must carry a Status attribute that reflects that object’s current state within the Development Workflow (see below).
Data Transformation Requirements
The critical activity that distinguishes data pipelines from other software machines is the Data Transformation. This is the mechanism whereby data propagates down the data pipeline. In most cases, Data Transformations into a given Stage occur in parallel, as every Data Element at that Stage will generally be processed by the same SQL query or other programming construct (let’s call this a Component).
When business requirements are first articulated, system architecture is typically up in the air and it is not yet possible to think in terms of Components. Consequently—to borrow an example from a previous article—an early vision of a data pipeline will look something like this (the geometric figures with their various fills, borders, and orientations are meant to abstract away data attributes like type and value):
Data Pipeline: High-Level View
We can make two observations about the high-level view illustrated above:
- The data pipeline’s Data Definition Requirements—the Source and Target Stages—are fully articulated. Intermediate Stages may be required in order to realize the pipeline’s business logic, but the Data Element Stages that aggregate into the Source and Target Stages are atomic. This is consistent with our assertion above that DDRs do not decompose.
- There is a great deal of complexity hidden inside the bubble labeled Business Logic. Given that this logic writes to three distinct targets—which might not even be all of the same kind—it is a virtual certainty that the code that realizes this business logic will have to exist in multiple Components that may execute in different contexts and with independent timing. In order for this kind of representation to be in any way useful, it must be more granular.
Within the DPRM, a Data Transformation is an object that…
- Takes one or more Data Element Stages as arguments. If an argument must be populated with a constant, that constant should still be represented as a Data Element Stage. This provides the basis for driving pipeline operations with configuration values stored as reference data rather than with literal values stored (perhaps redundantly) in code.
- Writes its result to one and only one Data Element Stage.
- Articulates its logic purely in terms of its arguments, without regard to any intermediate data Stages. Accomplishing this will require higher-level DTR logic to be written in pseudocode or even plain language, while lower-level DTR logic can be written in something approaching formal code.
HLRs and LLRs are both examples of DTRs. The distinction is in the level of abstraction at which they are expressed: as the name implies, HLRs are expressed at a higher level than are LLRs.
Writing DTRs for the data pipeline above at the highest-possible level of abstraction would require the articulation of one DTR for each Data Element appearing in each Target Stage. If we confined our attention to Target A, we might compose the following four HLRs:
Target A HLRs
Data Pipeline: Low-Level View
Recall that, by definition, a Data Transformation has one and only one Data Element Stage at its output. Each of the logic bubbles in the picture above has several Data Element Stages at its output, all of them aggregating to the same Stage. It should follow, then, that each of these logic bubbles comprises as many Data Transformations as there are target Data Element Stages, all of which are realized within the same Component.
This picture should give us some guidance in decomposing the Target A HLRs illustrated above into LLRs. Some points to consider:
- A given HLR will break down into a series—or, more likely, a tree—of coupled LLRs.
- The Data Element Stages on each end of this structure should match those on either end of their common parent HLR. In other words, the decomposition of an HLR should have the same arguments and target as its parent.
- Between each pair of coupled LLRs will lie Stages that hold intermediate values produced by each lower-level Data Transformation.
If we re-examine the low-level view above, focusing purely on the parts relevant to HLR DTR-0001 illustrated above, we have:
Data Pipeline: Low-Level View focused on DTR-0001
Note an unexpected result: in our analysis, we have discovered that an additional data element (the red square) participates in the Source 1 Alignment Logic to produce the red triangle at the Aligned 1 Stage. Consequently, the branched path that produces this red square must be included in the decomposition of DTR-0001, even though it will also be included in the decomposition of DTR-0002. In a more realistic scenario, we might have discovered a similar effect later in the pipeline at the Matching or Mastering stage, producing significantly more branching, but for the purpose of illustration we shall keep this scenario simple. More on the effects of branching in a few paragraphs.
Based on the low-level view above, here is an LLR decomposition of DTR-0001, the HLR that produces the red triangle at Target A:
Note that the LLRs above are presented in a generally target-first order. This is not strictly necessary, but does align with the most efficient process of analysis, which propagates backwards from target to source.
Because of the discovery described above—that the red square participates in the logic that produces the red triangle—the first DPRs in processing order (DTR-0011, DTR-0012, and DTR-0013), which sit on the “left edge” of their common HLR parent DTR-0001, do not comport with the guideline above that they should collectively have the same arguments as their parent HLR. This is a common and extremely useful consequence of the decomposition process: we have discovered that there are actually more arguments to the HLR than we first realized!
At this point the correct path is to revise the original HLR argument list and internal transformation logic to account for the newly discovered argument, thus bringing the parent HLR into sync with its child LLRs and resolving a previously unknown (and possibly expensive) misalignment between business and technical requirements. Here is the revised DTR-0001:
Each LLR is coupled to its parent HLR via a Trace object. We have illustrated above that, with respect to Target A, DTR-0011 actually has two parents: DTR-0001, which we have broken down above, and DTR-0002, which drives a square into Target A. Here, then, is another important DPRM concept:
An LLR may have a Trace to more than one parent HLR.
To a developer this will be instantly recognizable as an example of code reuse, where each of several parents is leveraging the same child transformation. While highly desirable as a mechanism for reducing code complexity and enhancing maintainability, code reuse can have the effect of significantly increasing the blast area of changes to the reused code. As we will see below in our discussion of Development Workflow, the explicit tracing of every LLR to all of its parent HLRs makes it very easy to identify this blast area and propagate the effects of changes accordingly.
There is a second example of code reuse wthin the DPRM. Until now we have discussed Data Transformations as if each one were unique. Consider a data-typing transformation which converts a text representation of a date into a DATE-typed value. A mature data pipeline may feature dozens of such transformations, each identical but for the particular Data Element Stages connected to their inputs and outputs.
Within the DPRM, the arguments (including types and default values), output data type, and transformation logic of a given Data Transformation are defined in a Function object. The Data Transformation is then applied via a Transform object, which references the Function and connects its arguments and output to specific Data Element Stages. In addition to providing for efficient code re-use, this scheme also makes it possible to map the blast area of data transformation logic changes and know which areas of the pipeline and requirements structure must be analyzed and possibly refactored to account for those changes.
Agile imposes a simple workflow on a development team: a user story—which is effectively a business requirement—is either in a Backlog, it is in Development, or it is Done. Once an item is Done it can be moved to Production, but not before. The collection of workflow states and the constraints around moving between them can be expressed as a directed graph, like this:
Simple Agile Development Workflow
Different Agile projects might feature slightly different workflows, but for our purposes the key element of the example above is that the only significant in-sprint activity is Development: the conversion of user stories into code.
Within this context, articulating a sprint goal is simply a matter of identifying a user story and specifying a target state. User stories in the backlog (i.e. business requirements) are treated as found objects, which are only subject to change outside the development process (i.e. in backlog grooming sessions and related activities). Consequently there are only two possible end states for a given work item during a sprint: Done, or back into the Backlog for further analysis and future development.
The DPRM imposes an additional activity within a sprint: Analysis. This is the act of decomposing higher-level requirements into lower-level ones and rationalizing the impact of change with the existing requirements base. By contrast with the end-to-end Development activity described above, within the DPRM Coding is a singularly focused activity: it is the conversion of fully-baked DDRs and LLRs into code. Since both are in-sprint activities, a sprint goal might involve the analysis and decomposition of a business requirement or HLR into DDRs and LLRs, the coding of completed DRRs and LLRs, or both.
A DPRM development workflow looks something like this:
DPRM Development Workflow
Again, different projects may define different workflows. For our purposes, the key element added by the DPRM is the decomposition of Development into Analysis and Coding as distinct in-sprint activities.
Several times now we have mentioned rationalizing the impact of change. We have identified it above as part of the Analysis activity, but we have not described what we actually mean by it.
Every software developer is familiar with refactoring. This is the act of changing existing code, generally in order to accommodate other changes made elsewhere. It is inevitable that parts of a software solution will be coupled to some degree with other parts, so that changes in one place create an impact elsewhere that must be accounted for. It is within this blast area that refactoring must be done, if at all. Prior to refactoring, then, must come the identification of the blast area. This is a critical activity: identify too large a blast area and developers engage in unnecessary, time-wasting analysis; too small and developers will miss critical refactors, resulting in defects and code instability. In the absence of appropriate tooling, a developer must identify his blast area intuitively by relying on his general familiarity with the code base and his understanding of what connects to what.
Clearly, at some level of system complexity no developer will be able to analyze effectively the blast area of a given change. In our experience this is the single most common source of data pipeline defects, particularly in mature implementations where component coupling is pervasively broad and deep. Furthermore, the inability to conduct reliable blast area analysis results in endless defect cycles, where the failure to propagate the effects of a defect fix across its entire blast area creates new defects, and so on ad infinitum.
Recall that, within the DPRM, the artifacs under Agile management include not only project code but also its associated requirements. It should follow, then, that requirements as well as code are subject to refactor.
Defining the DPRM refactoring process is beyond the scope of this article. We can, however, present the shape of the refactoring process. It looks like this:
- Mark all DPRs that are new or have changed.
- Follow lines of relationship from marked DPRs—via Trace objects, DTR-DDR relationships, and shared Functions—to identify and mark additional DPRs that require analysis in order to determine whether these changes will propagate to them. This process can be automated, based on the relationships already defined.
- Evaluate every such marked DPR and either propagate the required change and mark the DPR as having changed, mark it for future deletion, or conclude that no change is required and remove the mark.
- Return to 2 and recurse.
- Once all changes have been propagated, delete all DPRs marked for deletion.
In practice, given a set of new or changed requirements, this process will result in a number of other DPRs—often located some distance from the original changes—moving backward in the progression of states, from Done to Coding or from Coding to Analysis. Once the work has been done to move these back to a Done state, the team will by definition have propagated the effects of the original change to all affected parts of the system and brought the whole back into a stable and internally consistent state with a minimum of effort expended on unnecessary analysis and none at all on correcting defects caused by unpropagated change.
Implementing the DPRM
As of this writing, there exists no commercially-available software tool that implements the DPRM in its entirety. (HotQuant is working on a tool, called hqSpec, that will do precisely that, but at this point we are at least a year from commercial release.) Meanwhile, HotQuant continues to build data pipelines for our clients by applying some form of rational requirements management. How do we do it?
A Simplified Data Model
For most modern software development teams, Agile methodologies are the order of the day. Scrum teams must track user stories, epics, sprints, goals, and a number of other objects related to the process. Many of these teams use software tools like Jira to organize their efforts, while others make do perfectly well with rolls of butcher paper, boxes of Sharpies, and stacks of Post-It notes.
Either way the process is the same, as are the objects comprising it and the relationships between them. This is the case with any data-driven process: the particulars of implementation are far less important than the nature of the data model and business logic driving it.
Based on the foregoing discussion, here is a data model that captures the essentials of the DPRM:
A Simplified DPRM Data Model
It is worth reiterating this this is a simplified DPRM data model. It does not feature support for retaining a history of changes to the model, so it does not support historical review or undo. There is also work to be done to support the refactoring process described (but not specified) above. In the main, though, this data model satisfies the major requirements laid out above. Any tool that implements it will constitute significantly more than a good start, and will give a data pipeline development team visibility into their state of analysis and development, and the effects of change, that would otherwise be completely impossible.
Making It Work
A key nonfunctional requirement of the DPRM, however it is implemented, is that it be broadly accessible by the development team. Individual analysts and developers need access, both to introduce their own changes into the system, and to traverse the requirements model in order to propagate and realize the effects of their changes on other requirements and on code. This intrinsically constrains the set of choices available to a team that wishes to leverage the DPRM.
Clearly the ideal option would be to implement the DPRM on a relational database, coupled with object code to encapsulate appropriate business logic and a well-executed user interface accessible to all users from a web browser. We mentioned above that HotQuant is building precisely that. For a project team, whose true mandate is to build a data pipeline and not a requirements management system, this is an impossibility. Until HotQuant or somebody else comes up with a commercially available implementation of the DPRM, organizations on the client side are going to have to “fake it” well enough to capture the DPRM’s more important benefits while avoiding most of the costs.
Recent versions of Microsoft Excel support the construction of data models that include foreign-key relationships. While in principle this means an Excel spreadsheet could support an instance of the DPRM, and while it is possible to share a spreadsheet across a team, in practice it is not really useful for team members to have concurrent write access to a single spreadsheet. Without concurrent access, the team must take turns using the DPRM, a source of inertia that will surely result in such a tool being used little if at all.
Microsoft Access supports more robust data model expressions as well as good concurrency when deployed into the right network environment. Additionally, VBA offers options for automation around the refactoring process described above. This is a potentially powerful combination, with the down side that it also may represent a non-trivial development effort. When a large team needs to get off the ground immediately, it may not have the luxury to wait around for a week or two while a few of its members implement the DPRM in Access and VBA.
HotQuant has had some success using connected SharePoint lists. SharePoint is a useful platform because it is intrinsically a shared environment, and a list—which is analogous to a database table—offers a ready-made (if cumbersome) user interface. The fact that SharePoint lists feature only rudimentary keys can be somewhat overcome by imposing a coded syntax on a key, so that for example a Data Element Stage key can be written [Stage Name].[Data Element Name]. This makes it possible to create functional relationships between lists, at the cost of requiring some exactness in data entry: a somewhat brittle solution, but useful. Finally, SharePoint is already broadly used and well-secured at most client organizations, so there are rarely significant objections to making a sandbox available for a DPRM implementation.
Often an organization can save time and complexity by eliminating the Trace object and flattening the DPRM. While on the one hand this eliminates the ability to decompose HLRs—effectively, every DTR becomes an LLR—it preserves the critical function of providing for the exact expression of an LLR that can be translated directly into code. This can be a useful trade-off.
Finally, there’s paper. Some organizations find that operating a paper-based Scrum process—sharpies, Post-It notes, and such—preserves the visceral, hands-on feel that is an essential component of Agile, and prevents a team from entering rabbit holes that would be prohibitively difficult to follow with a paper process. The same principle could easily apply to the DPRM. Our objection would be that even simple data pipelines possess an underlying complexity of requirements that would render a paper-based process too cumbersome to pursue. On the other hand, we have never tried, and certainly some simplification of the DPRM would be easier to implement on paper than the full model. We would certainly be interested to hear about any attempts in that direction.
Making The Leap
In this article we have presented a rational method for capturing, analyzing, and maintaining requirements associated with a modern data pipeline.
We have used a lot of words like “should” and “could” to describe data structures and procedures that, we freely admit, have not been fully implemented in any commercially available product. A reader—particularly one engaged in the struggle to build an actual data pipeline, ahead of schedule and under budget—might be excused if tempted to dismiss this presentation as academic, theoretical, and only useful to a team with infinite time, money, and human resources at their disposal.
That would be a mistake.
Modern software engineers understand that a problem discovered in production can cost an order of magnitude more time and money than one discovered in development. Just so with the step between requirements definition and development. A process that prevents most defects from being coded in the first place would have to be cumbersome indeed not to be worth its cost.
The DPRM is precisely that. By ensuring that the code written is actually an expression of the business requirements articulated, and by ensuring those business requirements are continually updated to reflect constraints discovered during the coding process, the DPRM serves as a living channel of communication—a direct feedback loop—between the business and the developer. It offers the unparalleled advantage of speaking the correct native language on each end.
Absent a shrink-wrapped tool, the DPRM can be somewhat cumbersome to operate. The alternative, though, is worse: the cost of fixing a defect is almost always higher than the cost of preventing one.
We at HotQuant have found that the degree of a project’s success by any useful metric—velocity, cost, responsiveness—is almost always in direct proportion to the amount of attention paid to a formal understanding of requirements. While we have not yet produced a shrink-wrapped implementation of the DPRM, we have “faked it” a number of different ways as described above, and we find that projects where we have been more successful at implementing the DPRM have without exception been more successful projects.
We hope it will be as useful for you as it has been for us.