Oct 21, 2014

As institutions fulfil their regulatory requirements, access to data to demonstrate and evidence ongoing compliance will be critical. Just how critical is illustrated by the fact that market estimates of the investment required to support document management and document lifecycle processing across the financial services industry across both buy- and sell-side firms runs into billions of dollars. Digitising data efficiently using a proven industry-wide data model alongside a comprehensive, tested support system (technology and resources) packaged as a managed shared service can provide a unique and effective solution going forward. Ravi Sonecha, consultant and Stuart McClymont, head of Market Infrastructure Services at JDX Consulting explain the mechanics.

There is increasing demand across the financial services industry to access data attributes that currently exist only within paper documents. These may be relationship agreements, for example; Master Confirmation Agreements (MCA’s), Credit Support Annex (CSA’s) and Swap Execution Facility (SEF) agreements. Or they might involve transactional documents, such as trade confirmations or term sheets. Business-as-usual processing, risk management, regulatory response, execution speeds and/or the need for increased control have all intensified the need to access information accurately and in a timely manner.

It is estimated that the investment required to support document management and document lifecycle processing across the financial services industry is approximately £2bn across buy- and sell-side firms. However, given the increasing cost pressures that firms are facing, this level of unilateral investment is not sustainable and therefore an alternative industry shared solution would appear to be the most effective way of achieving these shared goals.

Processes to retrieve information accumulated as physical text continue to improve both in terms of functionality and accuracy. The transformation from an ordinary PDF scan to a sophisticated, searchable computer-readable file to access key data is through the use of Optical Character Recognition (OCR).

OCR is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. Modern-day, intelligent systems with a high degree of recognition accuracy accommodate for a large quantity of available fonts. OCR technology can still be quite rigid, hence there is often a need to ‘prepare’ a set of documents before the OCR can be conducted. We term this preparation ‘pre-processing’.

Pre-processing technology used before the OCR phase improves the character recognition accuracy and capture levels. Even with a dedicated product, however, a dedicated human involvement is often required as the documents set may be delivered to the managed service in a variety of formats. PDF scans can be illegible, document types incorrect or agreements delivered in a foreign language. Regular correspondence with a client to escalate and resolve these exception documents is a critical pre-processing component, requiring dedicated human resource.

Of those documents that are eligible for pre-processing, techniques to pre-process include ‘de-skewing’ and ‘de-speckling’ the document to make the lines of text perfectly horizontal and to remove spots; ‘normalisation’ to scale the document to the required ratio; ‘binarisation’ to transform the PDF from colour to greyscale; ‘layout analysis’ to identify columns and paragraphs; and ‘line removal’ if non-text boxes and lines are present. The prepared file is thus fed through the OCR system where sophisticated recognition technology digitises and extracts the text and data.

Depending on the maturity of the technology, raw extraction from the OCR tool can vary between 80%-90%. Effort and time required for extraction is also highly dependent on the structure of the document; structured documents include those with a strict, systematic format where data input is largely inflexible, such as a tax form. Unstructured documents, such as MCAs/CSAs, contain fields that can vary from agreement to agreement making the extraction of specific data difficult.

Downstream users

Downstream users of legal documentation / financial agreements such as; management, trading and risk departments, and regulators will require data extraction accuracy to be 100% due to reliance on the data to take commercial or risk decisions. A subsequent workflow involving human reviewers retrieving key missed data can therefore be created to complement the mechanised extraction post-OCR; this manual function is known as the 2eye/4eye review process. This requires a document to be manually reviewed by two sets of “eyes” as an additional quality check to ensure no key data has been missed.

The end goal is to create a disciplined and efficient end-to-end managed service with shared technology and resources for clients wishing to transform their paper-stored data into computer-legible text. At present, a number of digitisation technologies are available, developed in silos within institutions without the operational support and skill-set that a managed service would provide. A dedicated, shared service across multiple departments providing the pre-process and OCR services would therefore drive down the cost of R&D and technology support whilst simultaneously providing a higher extraction rate.

Once the document format has been identified (MCA, CSA, Long Form Confirmation etc.), a Business Analyst (BA) / documentation Subject Matter Expert (SME) can work with each client to ascertain and define the key clauses required by downstream users, further working with software developers to extract the key data into a custom-made User Interface (UI) to increase user-friendliness.

After the technology has been adapted, the pre-processing and OCR phases can begin before the final 2eye/4eye review occurs to ensure 100% extraction rate. The managed service can support the technology and resources more efficiently and effectively at the most efficient cost point per document.

As with the rapidly evolving technology space in general, OCR software will continue to improve over time. The long-term goal will be to completely remove the human element to document digitisation, establishing a service without pre-processing and with 100% extraction.

However, this level of technology and process automation will take time. Therefore having the appropriate skilled resources surrounding the technology in a managed service as the technology evolves is critical. There is an opportunity for the industry to make use of such a managed shared service (technology and resources) to ensure maximum efficiency, at the optimal level of cost, and at the highest levels of quality. Most documents are between two parties and to minimize the costs to both parties documents should be digitised once and the data consumed by both parties.

An industry managed shared service allows industry participants to benefit from technology proven by other parties, whilst also avoiding duplicative investment by both parties to a document, halving the cost. Using a managed shared service and not having to buy technology and resources individually reduces the portfolio of applications and number of resources within industry participants.