by Jason Williscroft
HotQuant designs and builds Enterprise Data Management (EDM) systems—data pipelines—for large financial institutions.
I’ve written elsewhere about sources of inertia that can send an EDM project sideways. In that piece I identified a major risk around requirements, which manifests as an inability to articulate solid High-Level Requirements and decompose them into Low-Level Requirements in a manner that facilitates the process of analysis, development, deployment, and support in the production environment.
This is the Requirements Trap. It is eminently avoidable, and the consequences of falling into it are severe: missed deadlines, bloated budgets, a tangled and useless product, and ultimate project failure.
Here’s what you need to know.
Driving the Problem
Large institutions have learned the hard way not to put IT in charge of big data management projects.
The argument is straightforward: such projects are very expensive! There is an infinite number of solutions that could be built, so given the high cost of building any one alternative, it just makes sense to build the one that most closely aligns with actual business requirements. Since the technical wizards in IT have no direct experience with the business problems to be solved, the business needs to drive the problem, generally through the vehicle of a Data Governance Organization (DGO).
If one could draw an axis between purely-business and purely-technical operations, most DGOs begin operational life at the business end of the scale, articulating mission statements, value propositions, services and priorities. In other words: the DGO begins by writing down its businessrequirements. And business requirements necessarily have an end-to-end scope: they speak in business terms about what goes into a process and what comes out the other end, without concern for any messy internal detail.
But as the project proceeds, a DGO begins to consider the thing that must be built in order to realize this vision, and so begins to slide down toward the technical end of the scale by acquiring a staff of technical business analysts and developers and asking them to deliver a relevant set of technical requirements.
This collection of artifacts is a project’s High-Level Requirements.
The point of writing High-Level Requirements (HLRs) in EDM is to express the DGO’s business requirements in terms of the actual legacy systems and data sources known to the enterprise. So while the language of an HLR is technical—move data element X from source Y to target Z when these conditions apply—the scope of an HLR is end-to-end, just like its parent business requirement.
For example, the need for an EDM system generally begins with the desire to feed some data in a particular format to one or more target systems. Such data can be found on a number of different source systems. The nature of the beast, however, is that no single source typically contains allthe data you need—and none of it is in the right format—so your first cut at an HLR diagram for your system would look something like this:
In addition to this very attractive diagram, a well-executed set of HLRs will contain three kinds of thing:
- A technical description of every artifact and data element required by each Target.
- A technical description of every relevant artifact and data element provided by each Source.
- A technical description of the business logic that transforms relevant Source data elements into each required Target data element.
Let’s leave the specifics of how to describe these things as a topic for another article. For now, there are two key points to understand about HLRs:
- They are a technical expression of the business requirements articulated by the DGO. Different language, higher granularity, but the same end-to-end scope.
- They are not even remotely buildable!
Let’s talk about that.
In the real world—meaning, in this case, at EDM pipeline implementation time—before I can write required data to a target system I have to apply target-specific export logic to my master data model, meaning I must have constructed a synthesis of my various data sources on the basis of some mastering logic.
And before I can apply mastering logic to my data sources I have to match them, meaning I have apply matching logic to determine which rows from different sources refer to the same underlying entity (e.g. a security, a transaction, or a price).
And before I can match those sources I have to align them, meaning that (a) each entity to be matched is described by a single key unique to its source and (b) all enumerated source values are transformed to a common enumeration. Can’t match prices on the basis of currency, for example, if one side is expressed as an ISO3 string and the other side is expressed as an integer.
And before I can align the sources I have to type each one, meaning that I have to transform each data element (which probably arrives as plain text) into a value of the appropriate SQL type, which is (a) also a simple kind of alignment and (b) necessary if I want to exploit type-specific operations in T-SQL.
So an EDM data pipeline as actually constructed looks more like this:
When the three categories of requirements artifact listed above—source data elements, target data elements, and transformation logic—are expressed at this level of detail, we have created Low-Level Requirements (LLRs).
I stated above that HLRs are not buildable. The LLR diagram above should illustrate why. HLRs express end-to-end logic. At implementation time, every HLR is “smeared out” across multiple logical and physical components, each of which has its own inputs, outputs, logic, and execution context. An HLR exists in a purely conceptual space, where data is transformed in a single, relatively complex, end-to-end process. Actual developers write actual code in implementationspace, which is partitioned and constrained by system boundaries, execution contexts, and decisions around architecture, data modeling, developer coordination, and so on. LLR transformation logic is much smaller in scope and far simpler to articulate than the end-to-end logic of its parent HLR.
So in order for a developer to implement any HLR, it must first be decomposed into a set of related LLRs, where each data transformation is self-contained and can be translated from the requirement directly into code.
The Requirements Trap
LLRs are not optional. They have to exist, because only they—as opposed to their parent HLRs—reflect the topology of the system actually being built. The question is not whether LLRs exist, but where.
There are fundamentally two ways to generate LLRs:
- Explicitly. An analyst—ideally with the assistance of a developer familiar with the implementation—writes down each LLR in a manner that is traceable to its HLR parent. This provides a ready-made checklist of what has to be built, whose elements can be validated and coordinated against corresponding HLRs and tracked through the various stages of analysis, development, and deployment. Also, because each LLR is articulated independently of its code, testing an LLR becomes a simple matter of exercising code and comparing the result against expectations articulated in the relevant LLR.
- Implicitly. A developer starts with an HLR and formulates a set of corresponding LLRs in his head. In the worst case this will happen on the fly, and the only expression of an LLR anywhere will be its actual realization in code. Consequently, there exists no objective standard, scoped to the executable component, against which to test. Also, since each HLR is spread out across multiple executable components, and since each component expresses fragments of multiple HLRs, writing or maintaining code associated with one LLR—meaning the piece of an HLR that lives in a given component—requires the developer to maintain a mental picture encompassing all components associated with the HLR being built and allHLRs with operations passing through the component under development.
I don’t think I have to explain where the advantages lie here. But I will observe that the longer the data pipeline gets (more transformations in sequence), and the more complex the HLRs (more data elements participating in each transformation), the harder it becomes to generate LLRs implicitly without breaking something along the way. There is a level of complexity beyond which no developer is that good, and even if he were, the development team would have no reliable method of estimating, tracking, or evaluating his work.
Hence, the Requirements Trap:
- When a project enters the requirements-definition stage, its parent DGO is still mostly a business organization. It probably has the resources and the inclination to produce a solid set of HLRs, but lacks the collective software-engineering depth to understand why LLRs matter or to write LLRs that reflect reality. Business analysts who already wrote HLRs may balk at the work of decomposing them into LLRs as (a) too technical and (b) redundant. We already wrote requirements, didn’t we?
- Once coding starts—in the absence of explicit LLRs—developers begin to wrestle with the implicit generation of LLRs. But project estimation is already complete and budgets allocated, and none of that accounted for the initial effort necessary to write LLRs, nor the ongoing effort required to keep them in sync with HLRs as project requirements inevitably change. So developers are encouraged to do the best they can with the requirements they have, leading to rising frustration and declining code quality.
- Over time, a gap opens up between explicitly written HLRs and implicit LLRs whose only expression is in their own physical code. Since no objective standard exists, this gap cannot be measured directly, and thus it is impossible to know just how far off track the project is or to estimate the level of effort required to bring the implementation back into sync with the HLRs. Testing becomes an increasingly seat-of-the pants operation, and extension and maintenance suffer since changing any code requires a working understanding of most of the code.
A key attribute of the Requirements Trap is that nobody experiences it directly. As the code base grows, developers experience it at the low level as a rising sense of frustration with requirements ambiguity and a trend toward increasing code complexity. Project management experiences it at the high level as a rising gap between estimates and delivery times and a trend toward decreasing code quality. Both sides are correct in what they see, but the underlying cause is simply a lack of strict discipline (the kind imposed by working machinery) around (a) deciding what to build and (b) evaluating whether the team has actually built it.
For that, you need tools.
Avoiding The Trap
We have expended a lot of prose above on the idea of writing things down.
It is well known in project management circles that the management overhead associated with a project scales non-linearly with project complexity. In other words: twice the complexity requires more than twice the management.
Data management projects like EDM data pipelines are among the most complex things that have ever been built. They potentially involve the movement of thousands of data elements describing each of millions of entities, from dozens of sources, to dozens of targets, across significantly more than a handful of distinct transformation steps. When broken down into LLRs as described above, a mature data pipeline is a close choreography of many hundreds or thousands of moving parts, most of which perform dozens or hundreds of operations in parallel.
This kind of extreme complexity demands a significant management effort. Whether your project management model is Agile, Waterfall, or something in between, as the number of moving parts expands it becomes increasingly critical to do the following:
- Keep a complete list of your requirements at every level: business, HLR, and LLR. We love Agile, by the way: when your requirements change, change your lists!
- Key every requirement on every list to a complete description. The only way an analyst can know whether a set of LLRs completely realizes an HLR is by comparing their descriptions. If you don’t write these down, you can’t know if you are building what you intended to build.
- Maintain traceability between requirements at each level, all the way down to the code. In other words, there’s no point in having a HLR unless it realizes some business requirement, and no point to an LLR that doesn’t realize an HLR. So make those connections explicit and maintain them.
- Track your requirements through your process. Whatever the stages of your delivery process are (e.g. Analysis, Development, Ready-to-Deploy, Production), you should know at a glance where each LLR stands, and by aggregation, every HLR and business requirement.
- Start now! Whether your DGO has been operating for a week or a year or five, the Requirements Trap is real and has no bottom. If you haven’t fallen in yet, avoid it. If you’re in it, climb out!
Precisely how to do this is an open question: there are a hundred methods and dozens of university faculties devoted to the management of complex requirements. It can be done with spreadsheets, if your team is rigorous about version control and passionate about sharing. It’s a little cumbersome but in our experience much more effective to employ a set of connected SharePoint lists. There are other options available commercially—none of them perfect—and of course you can always write your own. HotQuant is doing that; we intend to release hqSpecsometime in 2017.
In the end, though, the price of building complex systems is managing the complexity.
Is there a cost associated with this? Absolutely, both in dollar terms and in project time! But the cost of avoiding the Requirements Trap is knowable—you can estimate it, budget for it—and is utterly dwarfed by the cost of falling into it or (worse) staying in it.
So don’t do that.