Nearest-Neighbor-Lookup

The nearest-neighbor-lookup (NNL) problem is defined as follows: Given a (fixed) set of N points in n-dimensional space, and a query position q in n-dimensional space, find the point p in the given set that has the minimum Euclidean distance from q. The k-NNL problem is an extension in which the k closest points to a query position q have to be found. The NNL problem occurs in many applications, ranging from implicite representation of Voronoi diagrams over scattered data interpolation to computation of local approximation hyperplanes to calculate normal vectors of surfaces defined by point clouds.

Project Goals

The goal of this project was to implement an efficient algorithm to solve the NNL problem for very large point sets (N on the order of several hundred million) in arbitrary-dimensional spaces. When dealing with very large point sets, the theoretical computational complexity of an algorithm is often dominated by the size of the algorithm's memory footprint. For example, given a set of 200 million 3D points defined by three 32-bit floating-point coordinates, the data set alone occupies 2.3 GB of memory; any memory overhead induced by additional acceleration data structures would quickly exhaust the computer's main memory, and require paging. This could lead to severe performance degradation. We therefore decided to implement a kd-tree based NNL algorithm that requires zero memory overhead, and has O(N*log(N)) theoretical computation complexity to create the kd-tree, and expected O(log(N)) complexity to perform a single point query if the query point is "close" to the point set, and a worst-case complexity of O(N) when the query point is far away from the point set. Other NNL algorithms offer better worst-case complexity, but many applications can guarantee a-priori that query points are close, and for those the kd-tree implementation offers superior performance.

Project Details

The trick to create a zero-overhead kd-tree of a set of N points is to represent the tree as a compact array of nodes, with the convention that all subtrees are stored at consecutive indices, and if a subtree covers the index range [left, right] (both inclusive), the subtree's root node is at index (left+right)/2. This convention leads to a very simple and efficient expected O(n*log(n)) algorithm to create a balanced kd-tree using a sequence of modified Quicksort sweeps over the point array, and to a memory layout that holds leaf nodes close to each other in memory.

This compact memory layout for the kd-trees not only benefits by minimizing memory footprint, but also minimizing memory accesses when traversing the tree during nearest-neighbor queries. In a normal kd-tree implementation, traversing a node requires loading its associated point from memory, and then loading the two pointers to its two children. Using the array-based layout, there are no pointers to be loaded as the child node's indices can be computed directly from the parent node's index. The node ordering implied by the above convention has the additional benefit of storing nodes in the lower regions of the tree close to each other in memory, which improves memory locality especially for nearest-neighbor queries, since most of the time finding a closest point is spent traversing the lower regions of the tree.

Nearest-neighbor queries are implemented in the usual fashion by traversing the kd-tree in the direction of the query point, and culling subtrees from the traversal as soon as a nearest-neighbor candidate point is found, and a node's split plane has a larger distance from the query point than the current candidate point. This results in a very efficient expected O(log(n)) query time.

Project Status

This is a one-shot project that grew out of experimentation with different kd-tree implementations in the context of the Templatized Geometry Library. The final version (so far) of the project source code can be downloaded from the download page.

Pages in this Section

Test data sets
A list of test data sets, and timings for kd-tree creation and nearest neighbor queries.
Download page
Contains the complete source code for the nearest-neighbor-lookup test program.