The Data Science revolution

Data Science, Predictive Analytics, Big Data, Supervised/Unsupervised Machine Learning. Itinerant buzzwords or opportunity for competitive advantage? As we’ve seen during the last couple of decades, it can be much more than that: innovative Data Science applications have not only brought major technology disruptions across industries, but have also established today’s market leaders. Take the examples of online retailers and media streaming companies such as Amazon, YouTube, and Netflix, or social media networks and peer-to-peer services such as Facebook, Uber, and Airbnb. Notably, many of the companies that have totally transformed traditional business models hold few assets other than data and algorithms. In fact, recognising such intangible resources as profitable assets is yet to be broadly realised, with many companies far behind in investing in the necessary human, software, and hardware resources for the collection and mining of data.

Data Science innovations are ubiquitous in the technology sector; but what about the finance industry? Enter Signet Bank, a small regional bank in Virginia. By applying predictive analytics to the issuance of credit cards in the 1990s, the business achieved such an enormous growth that led to a quite well-known spin-off: Capital One. The credit card industry developed a new norm – from uniform pricing across customers to tailored products based on the specific client profile; a true example of a data-driven revolution. Other notable Data Science applications that had a major impact on the assessment of credit risk include the calculation of credit scores and the estimation of probabilities of default.

Mining data: Machine Learning

But what is Machine Learning and why is it so powerful? Despite its somewhat obscure name, the core principles are simple to understand. Loosely speaking, machine learning consists of a collection of methodologies and algorithms that, given an input dataset, can be applied to perform a wide range of tasks such as:

  • predict the value of a variable (e.g. the price of an asset) – Regression
  • predict the category of an observation (e.g. whether an individual will default) – Classification
  • group similar items, individuals, or entities (e.g. categorise clients in groups) – Clustering
  • identify similar items, individuals, or entities (e.g. find contracts with similar clauses) – Similarity Matching

Regression and Classification are typical cases of supervised learning, where a set of input variables (also called features or predictors) are used to predict the value or class of a target variable (also called output or response). The word “supervised” implies that the predictive model is built by discovering common patterns and correlations between the input and output variables of a training dataset. Regression examples include the calculation of credit scores – based on a set of features, such as income, age and credit history – and the pricing of a property – based on real estate data such as the floor area, location and year of construction. Classification examples include the detection of fraud from transaction data, and whether a company will default over the next year given a set of financial statement, macroeconomic, and historic default rate data. It follows that the more accurate the predictability of the model, the higher the efficiency and profitability of the business. A non-exhaustive list of methods that perform Regression and Classification is multivariate regression, logistic regression, support vector machines, and decision trees.

On the other hand, Clustering is often referred to as an “unsupervised” machine learning algorithm because there is no expected dependence to be established. Instead, the task is to discover structure and/or patterns in the data by grouping together and characterising the properties of entities such as clients, companies, or products. For instance, by categorising companies based on their size, revenue, probability of default, etc., a bank can optimise resource allocation by incentivising business with specific client groups. Examples of Clustering methods include K-Means, density-based clustering algorithms such as DBSCAN, and self-organizing maps. Similarity Matching, as the name suggests, finds a close match to a desired item. For example, this is used to recommend financial products to a customer based on other customers with a similar profile, or to identify and target potential clients that are likely to respond positively to an offer.

The above methods need not be associated with datasets that only consist of numeric values. Natural Language Processing (NLP) is involved in a lot of powerful applications in Machine Learning as well. Classification, Clustering, and Similarity Matching algorithms can organise, group, and compare articles, documents, or any kind of text based on their content. Typical examples are the grouping of news articles by topic, and the recommendation of similar articles to read. In finance, documents such as contracts, deals, or agreements can be grouped together and categorised based on their terms and clauses. This can not only facilitate the management of a large number of complex contracts, but also help ensure compliance within the ever expanding regulatory framework. Yet another application is the processing of emails and transcribed phone calls in order to detect or even prevent fraudulent activities.


Data Science, Predictive Analytics, Big Data, and Machine Learning, are far from an ephemeral hype. Similarly to the Internet revolution two decades ago, Data Science is gradually becoming part of every organisation, transforming business models and defining a new norm for how we conduct business. Implementing data-driven solutions is not a straightforward process though; projects can be highly technical, the feasibility of which requires investment in talent, software and hardware resources. However, the value added by collecting and mining data can often offer unparalleled opportunities for growth, and getting on board early can increase the return on investment in the long run.