Mahalanobis Distance

Updated on April 4, 2024
Article byKumar Rahul
Edited byKumar Rahul
Reviewed byDheeraj Vaidya, CFA, FRM

What Is Mahalanobis Distance?

Mahalanobis distance is a statistical measure used to determine the similarity between two data points in a multidimensional space. It is instrumental in data analysis, pattern recognition, and classification tasks. This distance metric takes into account the covariance structure of the data, which makes it suitable for situations where the variables are correlated.

For eg:
Source: Mahalanobis Distance (wallstreetmojo.com)

It can help investors construct portfolios that are well-diversified across different assets. By considering the covariance between asset returns, it can identify which assets are most similar or dissimilar to each other. This information helps in selecting assets that provide the best risk-return trade-off for a given level of portfolio risk.

Key Takeaways

• Mahalanobis distance is a multivariate measure that quantifies the dissimilarity between two data points in a multidimensional space, considering the covariance structure of the data.
• It takes into account the correlations between variables, making it suitable for datasets where variables are interrelated.
• It is scale-invariant, meaning it is not affected by the scaling of variables, making it versatile for different units of measurement.
• Thresholds for identifying outliers or anomalies can be customized based on the specific application, allowing flexibility in analysis.

Mahalanobis Distance Explained

Mahalanobis distance is a mathematical measure that quantifies the dissimilarity between two data points in a multivariate dataset. named after the Indian statistician Prasanta Chandra Mahalanobis. It’s a versatile tool for data analysis and pattern recognition, originating in the field of statistics during the early 20th century.

Prasanta Chandra Mahalanobis, an influential Indian scientist, introduced this concept in the 1930s. He played a pivotal role in establishing the Indian Statistical Institute (ISI) and contributed significantly to the development of statistical methods in India. Mahalanobis recognized the limitations of using Euclidean distance for multivariate data analysis, especially when dealing with correlated variables. To address this, he proposed a distance metric that incorporates the covariance structure of the data. His work aimed to develop statistical tools to aid in diverse fields, including agriculture, economics, and social sciences. The Mahalanobis distance became one of his most enduring contributions to statistics.

The Mahalanobis distance formula considers the mean vector and the covariance matrix of the dataset to calculate the distance between data points. It standardizes the data, transforming it into a space where variables are uncorrelated and have unit variances.

Financial Modeling & Valuation Courses Bundle (25+ Hours Video Series)

–>> If you want to learn Financial Modeling & Valuation professionally , then do check this â€‹Financial Modeling & Valuation Course Bundleâ€‹ (25+ hours of video tutorials with step by step McDonald’s Financial Model). Unlock the art of financial modeling and valuation with a comprehensive course covering McDonaldâ€™s forecast methodologies, advanced valuation techniques, and financial statements.

Formula

The Mahalanobis distance formula measures the number of standard deviations that are one data point away from the mean of the dataset in a multidimensional space. The formula is as follows:

Mahalanobis Distance (D) = âˆš((X – Î¼)’ Î£^(-1) (X – Î¼))

Where:

• D is the Mahalanobis distance between the two data points.
• X represents the vector of values for the data point one wants to measure the distance.
• Î¼ (mu) is the mean vector of the multivariate dataset, containing the mean values of each variable.
• Î£ (Sigma) is the covariance matrix of the dataset, which captures the relationships and variances between variables.
• Î£^(-1) is the inverse of the covariance matrix.

Here’s a step-by-step breakdown of the formula:

1. Subtract the Mean: (X – Î¼) calculates the difference between the values of the data point one is interested in (X) and the mean vector (Î¼). This step standardizes the data by centering it around the mean.
2. Covariance Matrix Inverse: Î£^(-1) is the inverse of the covariance matrix. It accounts for the correlations between variables and their variances. Inverting the covariance matrix allows us to give more importance to variables that have higher variances or are more relevant to the analysis.
3. Matrix Multiplication: (X – Î¼)’ Î£^(-1) (X – Î¼) performs matrix multiplication between the transposed (X – Î¼) vector and Î£^(-1), and then the result is again multiplied by (X – Î¼). This step computes the weighted squared differences between the data point and the mean, with the weights determined by the covariance matrix.
4. Square Root: Finally, taking the square root of the result gives the Mahalanobis distance, which represents how far the data point X is from the mean, considering the correlations and variances of the variables in the dataset.

Examples

Let us understand it through the following examples.

Example #1

Let’s consider an imaginary scenario where a bank is using Mahalanobis distance for fraud detection. The bank has a dataset of customer transactions, including information such as transaction amount, location, time of day, and customer history.

The bank calculates the Mahalanobis distance for each transaction from the mean transaction profile of legitimate customer behavior. If a transaction’s Mahalanobis distance is significantly higher than the average, it may be a potentially fraudulent transaction. This approach helps the bank identify unusual patterns of behavior that might indicate fraud, even if the transaction amount is not extraordinarily high.

Example #2

In a report from CNBC dated February 5, 2020, a study conducted by researchers from MIT and State Street suggests a concerning economic outlook, highlighting the application of statistical tools like Mahalanobis distance in financial analysis. The study indicates a 70% chance of a recession occurring within the next six months. This finding raises alarm bells as the global economy faces uncertainties and potential headwinds.

The research takes into account various economic indicators and financial market data, including sophisticated analytical methods like Mahalanobis distance, to arrive at this prediction. Mahalanobis distance, a statistical measure, factors in the covariance structure of economic variables, offering insights into data similarity and dissimilarity in multidimensional space.

Factors such as trade tensions, geopolitical instability, and slowing global economic growth have contributed to the heightened recession risk, as highlighted by Mahalanobis’s distance-based analysis.

How To Interpret?

Here’s how to interpret Mahalanobis distance:

1. Magnitude of Distance: The Mahalanobis distance is a positive value that quantifies the dissimilarity between a data point and the mean of the data set. A smaller distance indicates that the data point is closer to the mean and is more similar to the dataset as a whole. In contrast, a more considerable distance signifies more significant dissimilarity.
2. Standard Deviations: One can think of the Mahalanobis distance in terms of standard deviations. A Mahalanobis distance of 1 corresponds to a distance of 1 standard deviation away from the mean in each dimension. More considerable distances represent deviations that are multiple standard deviations away.
3. Thresholds: To interpret the Mahalanobis distance effectively, one needs to establish a threshold. The choice of threshold depends on the specific application and the desired level of sensitivity to outliers.
4. Multivariate Analysis: Mahalanobis distance is instrumental in multivariate analysis because it accounts for correlations between variables. If a data point has a considerable Mahalanobis distance from the mean, it suggests that it deviates significantly from the expected behavior, considering the relationships between variables.
5. Context Matters: Interpretation should always consider the context of the analysis. For example, in fraud detection, a high Mahalanobis distance may indicate a suspicious transaction, while in medical diagnosis, it could signal a patient’s health anomaly.
6. Decision Making: In practical applications, decisions are based on the Mahalanobis distance. For example, if the distance of a financial portfolio from the average risk profile is too high, it might warrant a review or adjustment of the portfolio composition.

Applications

Some of its known applications are:

1. Outlier Detection: It is helpful in anomaly detection. In finance, for instance, Mahalanobis distance can identify unusual market behaviors or fraudulent transactions by flagging data points with distances significantly more significant than the norm.
2. Portfolio Optimization: In finance, it aids in constructing well-diversified portfolios by quantifying the distance of individual assets or investments from the portfolio’s mean risk-return profile. Investors use it to allocate assets effectively.
3. Credit Scoring: Lenders use Mahalanobis distance to assess the creditworthiness of applicants. It helps in comparing an applicant’s financial attributes to historical data, identifying deviations that may signify credit risk.
4. Quality Control: In manufacturing, Mahalanobis distance monitors product quality. It can flag products with measurements that deviate significantly from the production process mean, indicating potential defects.
5. Image Recognition: In computer vision, it classifies and recognizes objects based on their features. Mahalanobis distance helps measure the similarity between feature vectors extracted from images.
6. Healthcare: Medical professionals employ it for patient diagnosis. For example, it can help identify patients whose health characteristics deviate significantly from the norm, aiding in early disease detection.
7. Market Research: Researchers use Mahalanobis distance in clustering and classification tasks to group similar market segments or customer profiles based on various attributes.

Following is a comparison of the advantages and disadvantages of using Mahalanobis distance:

Mahalanobis Distance vs Euclidean Distance

Below is a comparison between Mahalanobis distance and Euclidean distance:

1. What does a high Mahalanobis distance indicate?

A high Mahalanobis distance suggests that a data point is significantly dissimilar from the mean of the dataset, considering variable correlations. This could indicate an outlier or an unusual data point.

2. Can Mahalanobis distance be used with non-normal data?

While Mahalanobis distance assumes multivariate normality, it is also applicable to non-normal data. However, the results may be less reliable in such cases, and alternative distance metrics may be taken into consideration.

3. How do I set a threshold for Mahalanobis distance for outlier detection?

Thresholds for Mahalanobis distance depend on domain knowledge, simulation, or statistical methods. One can choose a threshold that balances sensitivity and specificity, depending on the specific application and tolerance for false positives/negatives.

This article has been a guide to what is Mahalanobis Distance. We explain its formula, examples, comparison with Euclidean distance, and how to interpret it. You may also find some useful articles here –