Intelligent K-Nearest Neighbours: Why NOT to use Euclidean distance in KNN for recommender systems

Aparna Pande

Abstract

We present a recommendation engine using Python that combines Agglomerative Hierarchical Clustering and K-Nearest Neighbors — an approach we call “Intelligent KNN.” This talk makes the case against using Euclidean distance to compute similarity, in a unique approach to get better recommendations.

Description

A rule-based recommendation system, commonly found in practice, has a few downsides. While it may work well for data points that behave predictably, it may not work very well for outliers. In addition, users of the recommendation system in interactive applications may be unaware of the rules driving the recommendations, and therefore may be unable to modify/tweak the rules selectively for different datasets.

A common algorithm used by data scientists in recommendation systems is K-Nearest Neighbors, which uses Euclidean distance as a similarity measure. However, this may not translate accurately if the rule-based system cannot be represented by a combination of weights applied to the Euclidean distance.

Our solution to make the rule-based system more flexible, transparent and adaptive was to use a recommendation system that periodically pre-clustered data points using Agglomerative Hierarchical Clustering and Jaccard Distance on the proprietary rules. We used that cluster classification as a feature in the KNN calculation. Python scikit-learn libraries allowed us to make this recommendation system interactive, providing frequent updates and near-real-time results.

This talk will provide a high-level overview of the Machine Learning and mathematical concepts, detail our approach and reasoning behind switching a commonly used default similarity measure, and walk through its application using scikit-learn libraries.

Bio

Aparna is a senior engineer for the Fixed Income Data and Analytics Platform at Bloomberg. She has an undergraduate degree in Computer Science (with a concentration in AI) from Cornell University. Since joining Bloomberg, Aparna has driven a number of initiatives in New York and London while increasing the breadth of the Fixed Income Data Platform, providing low latency cross-asset search and analytics capabilities to teams and functions at Bloomberg, using a suite of mainly open-source technologies.