MetricKnn API
Fast Similarity Search using the Metric Space Approach
|
MetricKnn is an open source library to address the problem of similarity search, i.e., to efficiently locate objects that are close to each other according to some distance function. MetricKnn is comprised of:
For more information and downloads please visit the project's homepage: http://metricknn.org/
The following list contains the most relevant types defined by the library:
All the methods in the library start with the prefix mknn_[type]_method(...)
, all the object types are named with the prefix MknnType
and all the constants are named MKNN_CONSTANT
. The object types are represented by pointers to opaque structs defined by a typedef. The only exception of MknnResultQuery which is a non-opaque struct for fast reading the results.
mknn_Module_newFoo
) must be released by a destructor (methods mknn_Module_release
). In order to simplify the release of objects, some methods accept a parameter free_on_release to bind the lifetime of one object to the other, thus the release of one object triggers the release of the other.In order to use the library to perform a similarity search, the source code usually follows six steps:
void*
pointers) as long as a custom distance compares them.Steps 1-3 are part of the offline phase, i.e., they must be completed prior to attend any search request, and Steps 4-6 are part of the online phase, i.e., they can be run by different users in parallel in order to perform a search.
Parse vectors in a text file and create a MknnDataset:
The format of the text file is one vector per line, each dimension separated by one or more tabs or spaces, e.g., a file with three vectors 4-d:
The Euclidean distance can be instantiated by:
The Linear Scan index can be instantiated by the following method. Note that the reference_set
and distance
has been bound to the lifetime of the index object, hence when index
is released, reference_set
and distance
will also be released:
The resolver is created by choosing the search according to the index, in this case an exact search for the linear scan. The search will return the two nearest neighbors at a distance less or equal than 50. The search method can use at most 4 parallel threads.
The query set will be a set of 20 random vectors 4-d, where each dimension is a float number between 0 and 100 (excluded):
In order to start a similarity search, the query_set
and resolver
must be given to the search method:
Note that the second and fourth parameters of mknn_resolver_search
method means that the lifetime of resolver
and query_set
objects are bound to the created result
object.
The located nearest neighbors are stored in the result
object:
Finally, the result
object must be released (thus the query_set and the resolver will also be released), and the index must be released (thus the reference_set and the distance are also released):
The full example is in example_linearscan.c
The source code can be compiled with the command:
Or if you use the pkg-config util, you can compile the example with the following command:
The linear scan computes the exact nearest neighbors. The same result can be achieved by LAESA but usually faster.
In order to replace the linear scan by LAESA index, the construction of the index and the resolver must use different parameters:
The index uses five pivot objects. The exact search requires the same parameters than the linear scan:
Ths similarity search using this resolver
return the same result as the linear scan. The actual performance of LAESA'search is related to the number of pivots, the distance, and the distribution of the distances of the objects.
LAESA is able to perform an approximate search, which can reduce drastically the search time at the cost of missing some nearest neighbors. In order to resolve the search but computing just a 5% of distances, the parameters of the resolver
are:
MetricKnn can perform similarity searches using custom distances. Moreover, any metric index can use the custom distance and improve the search efficiency. If the custom distance satisfies the metric properties, then search will return the exact k-NN otherwise it will return an approximated result.
A custom distance is created by implementing a method with the signature defined by the typedef mknn_function_distanceEval_eval. Additionally, if the distance requires some pre-computing, two methods with the signatures mknn_function_distanceEval_createState and mknn_function_distanceEval_releaseState can be defined.
The following code shows a custom distance for float vectors that applies a exponential weight to each dimension difference:
The distance signature requires a parameter current_threshold
. This parameter stores the current distance to the k-th candidate (or the constant DBL_MAX
for the first k objects). Therefore, if during the distance computation the partial result of the distance is higher than current_threshold
, it is possible to perform an early termination by returning any value >= current_threshold.
Once defined the distance method, the distance object can be instantiated with the following statement:
The MknnDomain stores the type of the objects in a dataset. The method mknn_dataset_getDomain can be used to retrieve the type of objects stored in a MknnDataset. Prior to invoke a custom distance, the search resolver declares the MknnDomain of the objects that will be compared. Therefore, a custom distance can be used to implement a distance function independent of object types.
During the similarity search, the resolver can start several threads to compute distances but each thread will invoke create_state
to create a state_distEval
that will not be shared with other threads. However, if user_distance
modifies some global variable it will be necessary to consider a proper synchronization (e.g. pthread_mutex_t
).
In the previous example, the user_distance
method still statically casts the void*
object to a float*
. If you want to implement a generic distance for any domain, the mknn_distance_newCustomFactory enables to dynamically change the distance method depending on the domains. In that case, it is possible to choose an efficient implementation of the distance depending on the domain of the objects.
The full example is in example_custom_distance.c
MetricKnn enables to define custom implementations of datasets. A custom dataset can be defined by the method mknn_datasetLoader_Custom which requires a pointer to a data object and pointers to different methods that receives that data object and performs the required actions: get the dataset size, get an object from the dataset, add an object to the dataset (optional), release the data object (optional). Optionally, the mknn_datasetLoader_Custom receives the MknnDomain with the type of the objects in the custom dataset.
For example, if we define the object People
containing of a list of Person
, the operations for getting the size and retrieving each person are:
The definition of the custom dataset also needs the definition of the MknnDomain (which is optional):
Now we can define a custom distance which compares two Person
(i.e., the object returned by method people_getObject
):
The instantiation of the custom MknnDistance and the query dataset is:
Now we can retrieve for each Person in custom_query
the K most similar Person in custom_dataset
according to the distance compare_people
. Note that compare_people
computes the difference of the field age
thus it satisfies the metric properties. In this case we can use a Metric Index to obtain an identical result as the Linear Scan but much faster:
The full example is in example_custom_dataset_and_distance.c
More examples can be reviewed in the examples page.
The full API can be browsed in the files section.
For more info please visit the project's homepage: http://metricknn.org/