Anomaly detection is a useful technique for identifying instances that deviate from the norm and is widely used in manufacturing and fault detection. In cybersecurity, anomaly detection is used in intrusion detection and anti-fraud solutions. The idea behind anomaly detection is based on the assumption that valid data instances tend to have a certain structure. Once certain characteristics of this structure are determined, other data instances can be analyzed to see whether their traits exhibit a structure common to the valid data points and if so, then can be labeled as valid. However, if their structure differs, they are labeled as anomalies.
In this blog, we will discuss K-nearest neighbors (KNN), a common technique in anomaly detection. We will then provide an overview of where it intersects with emerging privacy preserving technologies and how it impacts advanced analysis on multiple encrypted datasets.
The more widely used techniques in the field of anomaly detection are based on density techniques such as KNN local outlier factor, isolation forest, etc. In general, the data is considered as a point in a multi-dimensional space, defined by the number of features used in the analysis. This enables us to evaluate the distance between points. We assume that the amount that points differ from one another is dependent on their distances. Namely, the distance between them characterizes how similar these data points are. In KNN, the prediction of whether a point is anomalous is done with respect to its k-nearest neighbors, where k is an integer, with a value typically around 5-10.
Various data science and statistical tools require the use of data from multiple sources. However, many entities do not want their data to be shared with third parties. Though organizations may not want to expose their data, significant value can be reached if they were able to collaborate on their data in such a way that they don’t actually reveal it. One piece of cutting-edge technology which addresses this issue is Homomorphic Encryption (HE). HE is an encryption scheme that makes it possible to perform simple operations on encrypted data. This allows us to perform calculations on multiple sets of encrypted data and obtain statistics and predictions without revealing the data itself. While some statistics and Machine Learning (ML) tools require computations that can be performed just by using simple operations, others, such as KNN, cannot. The reason for this lies in the fact that in order to find the k-nearest neighbors to a point, one must be able to sort the distances between a certain point to all of the other data points – a computation that requires comparison, which can be challenging to perform using HE.
K-nearest neighbors is a powerful tool, and when there is no problem performing analysis on data in the clear, this and other standard anomaly detection techniques provide satisfactory results. However, what if the data we want to analyze is sensitive or confidential? Is there a way to do deviation testing on multiple encrypted datasets while keeping the sensitive data private? Duality has patented a proprietary anomaly detection technique that can detect outliers on multiple encrypted databases, without sharing the data externally or decrypting. In a world where privacy concerns are at the forefront of everyone’s minds, security is essential, and our data analysis techniques must keep up.
To learn more, watch our on-demand webinar: Cryptography Enabled Data Science – Meeting the Demands of the Data Driven Enterprise.