by Jason Williscroft
The Data Matcher component sits at the heart of most MEDM implementations. Its purpose is to examine the properties of data rows arriving from multiple sources in order to determine which rows, across sources or within a single source, describe the same entity. The Data Matcher expresses this identification at its output by assigning corresponding rows a common integer identifier, called the Cadis Id.
The Data Matcher is a combinatorial machine, which operates by systematically applying a set of abstract matching rules against pairs of sources. Rules and sources are both ordered so that highest priority rules are tested first, producing an optimal balance of processing performance and match accuracy while reliably shunting questionable matches into a queue for human review.
Data Matcher configuration is an exercise in business analysis and design. A Data Matcher must abstract out the essential match-related properties of its inputs and express a set of rules that are sufficient to capture all anticipated matching conditions, while preserving performance by exposing a minimal section of the rule space to the combinatorial engine.
Historically within the MEDM community, the common approach to Data Matcher design has been very much ad hoc: the implementation consultant selected a set of rules that appeared to cover the most common match cases, then iterated the design against test data until the number of Data Matcher exceptions reached some acceptable minimum. This approach is labor-intensive, but effective so long as matching requirements remain stable. The approach does not scale well with Data Matcher complexity, nor does it flex well when Data Matcher requirements change (for example, with the addition of a new source or new match terms to an existing source).
In early 2011 HotQuant managed an initial MEDM implementation at a financial institution in Denver, involving the construction of a security master. The implementation featured an unusually high number of security sources, with a complex overlap of matching terms among sources. It became immediately apparent that the traditional approach to Data Matcher design was going to fall flat: there were too many inputs, and early requirements still too much in flux, to enable even an experienced operator to evolve an effective design and maintain it in a sufficiently stable condition to support the rest of the development effort.
Consequently, HotQuant determined to build a tool that would accomplish the following goals:
- Systematically capture sources, match terms, associated priorities, and relevant mappings from source columns to match terms.
- Efficiently express match rule exclusions intrinsic to the incoming data. For example, no match rule should encompass both Maturity Date and Expiration Date since these terms cannot coexist on the same asset.
- Rationally generate match rules, rule code, and rule confidence levels directly from captured requirements.
The resulting Matcher Rule Generator–implemented as a Microsoft Excel VBA application–has proven very effective across multiple complex implementations. It enables development teams to keep Data Matcher requirements continuously in sync with the broader design effort, and seamlessly realize Data Matcher requirement changes into consistent, high-quality code, while radically reducing the level of effort associated with Data Matcher design.
This tool expresses a core HotQuant principle: the process of rational design is itself subject to rational design.