Quality counterparty data is more important than ever

The increased volume of regulation, most notably concerning trade reporting and Know Your Customer (KYC), means that firm’s counterparty data is now more than ever scrutinised by regulators. Furthermore, the current trend in falling profitability across the industry has led firms to reduce costs by consolidating both business locations and systems. In order to effectively consolidate systems a gold standard of counterparty data must be agreed. Consequently, the frequency of counterparty data remediation projects has increased dramatically in recent years.

Despite this increase, firms are still being fined for breaches in reference data quality several years on from the introduction of trade reporting and KYC regulation[1]. Evidently, something is wrong with the common remediation method.

Common issues that remediation projects face

  • Vast data population: Iterations of counterparty data regulation continually increase the number of entities in scope. Remediation populations commonly include thousands of counterparties with multiple reportable fields.
  • Prioritisation: The large size of the data population combined with a lack of analytical tools to assess the quality of data makes it difficult to effectively tranche those entities most at risk of breaching regulation. As a result, prioritisation of the remediation effort is difficult and teams find it tough to know where to begin.
  • Unsophisticated toolset: Remediation teams often struggle to probe large datasets with the software available to them.
  • Stakeholder reporting: Stakeholders throughout the firm find it difficult to keep abreast of the project progress due to the difficulty in producing meaningful management information.

Counterparty data is not black and white

When comparing two populations of data the most common approach is to use a combination of fields to create a key between the two datasets and then subsequently compare data fields as if they were binary. In practice this is not an effective approach due to the complexity of data being compared and a lack of data validation. For example, this approach would see a data discrepancy between the two entity names ‘JDX Consulting Limited’ and ‘JDX Consulting Ltd’ when in actual fact they are matching entities. It is preferable to take a probabilistic approach using fuzzy matching algorithms. This also allows the remediation team to effectively tranche the dataset by risk of a mismatch and provides the foundation for the prioritisation of the project on a risk basis. This is not possible with a black and white binary approach.

Levenshtein distance

An effective algorithm to calculate the probability of a match between two strings was formulated by the Russian computer scientist Vladimir Levenshtein in 1965. The Levenshtein distance is a measure of the difference between two strings. It is calculated by summing the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other[2]. By weighting the Levenshtein distance by the length of the strings being compared it is possible to calculate a probability of a match between them. For Example:

String 1String 2Transformation of 1 into 2Levenshtein distanceProbability of mismatch
JDX Consulting LimitedJDX Consulting LtdDeletion of ‘I’, ‘m’, ‘I’ and ‘e’418%
JDX Consulting LimitedJDX INCDeletion of ‘C’, ‘o’, ‘n’, ‘s’, ‘u’, ’l’, ‘t’, ‘g’, ‘  ‘, ‘L’, ‘I’, ‘m’, ‘I’, ‘t’, ‘e’ and ‘d’

Insertion of ‘C’


Example project: Mifid II LEI remediation

Mifid II increases the trade reporting requirements that were put in place under Mifid I. More counterparties will become in scope of the reporting regime and all will be required to have an LEI. To prepare for this change firms will need to confirm that their current counterparty data is correct and that LEIs are associated with the entire in-scope population. After defining the in-scope population, the following three-stage approach is suggested:

Stage 1 – Confirmation of existing LEI population

  1. Use the LEI as the key field between the firm’s data and the counterparty data stored in freely available LEI databases, such as the DTCC’s GMEI utility.
  2. Apply the Levenshtein algorithm to common fields between the dataset such as entity name and registered address.
  3. Build up a risk profile of the entity population to assess on a probabilistic basis whether any existing LEIs are incorrectly assigned and remediate based on this prioritisation.

Stage 2 – Allocation of LEIs to newly in scope entities

  1. Reverse engineer the stage 1 approach by mapping fields associated with entities the firm does not have an LEI for against all records in the publicly available LEI database.
  2. Build up a profile for each entity on the basis of the most likely matching LEI and remediate on this basis.

Stage 3 – Outreach to clients not able to locate in stages 1 and 2 to inform them of the regulation, confirm existing reference data and store their LEI.

Software considerations

When performing a remediation, the go to tool is normally Excel. Although this works for small datasets with minimal data processing, a more bespoke tool is preferable. Software such as Qlikview or Tableau allows users to easily build up a rule set and consistently apply these across a population. Data processing with fuzzy matching logic takes exponentially longer as the population size increases. Unlike Excel, these programs perform all calculations in memory allowing the analysis to be done much faster. They also provide easily distributed dashboards and reporting of project progress to stakeholders.

Act Smart – Act Now

It is clear that as regulation continues to encompass wider populations of entities, the task of remediating data becomes more difficult. However, the complexity of remediation projects should not be underestimated. By applying some clever logic to the problem and using the correct toolset the common issues that all remediation projects face can be mitigated. Taking this approach will, in the long term, save firms time, effort and potential costly sanctions, and acting sooner rather than later will prevent problematic data molehills becoming insurmountable data mountains.

[1] List of FCA fines, 2016, http://www.fca.org.uk/news/list?ttypes=&yyear=&ssearch=fine

[2] Black, Paul E., ed. (14 August 2008), “Levenshtein distance”, Dictionary of Algorithms and Data Structures [online], U.S. National Institute of Standards and Technology