MetricKnn API
Fast Similarity Search using the Metric Space Approach

Table of Contents

MetricKnn

MetricKnn is an open source library to address the problem of similarity search, i.e., to efficiently locate objects that are close to each other according to some distance function. MetricKnn is comprised of:

For more information and downloads please visit the project's homepage: http://metricknn.org/

C API Documentation

The following list contains the most relevant types defined by the library:

All the methods in the library start with the prefix mknn_[type]_method(...), all the object types are named with the prefix MknnType and all the constants are named MKNN_CONSTANT. The object types are represented by pointers to opaque structs defined by a typedef. The only exception of MknnResultQuery which is a non-opaque struct for fast reading the results.

Object Lifetime
In order to avoid memory leaks, every object created by a constructor (methods mknn_Module_newFoo) must be released by a destructor (methods mknn_Module_release). In order to simplify the release of objects, some methods accept a parameter free_on_release to bind the lifetime of one object to the other, thus the release of one object triggers the release of the other.

Common Steps

In order to use the library to perform a similarity search, the source code usually follows six steps:

  1. Create the MknnDataset with all the objects to search in. There are several methods for loading datasets, see mknn_dataset_loader.h. Additionally, MetricKnn is able to use custom objects (defined as void* pointers) as long as a custom distance compares them.
  2. Instantiate a MknnDistance with the desired distance function by invoking mknn_distance_newPredefined or mknn_distance_newCustom. The parameters for the pre-defined distances can be reviewed in mknn_predefined_distances.h. The custom distances can be used to compare any type of objects.
  3. Instantiate a MknnIndex by invoking mknn_index_newPredefined, which requires a MknnDataset, a MknnDistance, and index-specific construction parameters. The parameters for the available indexes can be reviewed in mknn_predefined_indexes.h.
  4. Instantiate a MknnResolver by invoking mknn_index_newResolver, which requires index-specific search parameters. The parameters for the available indexes can be reviewed in mknn_predefined_indexes.h. A single index can be used to instantiate many resolvers.
  5. Create a MknnDataset with the query objects and start the search by invoking mknn_resolver_search.
  6. Read the result of the search.

Steps 1-3 are part of the offline phase, i.e., they must be completed prior to attend any search request, and Steps 4-6 are part of the online phase, i.e., they can be run by different users in parallel in order to perform a search.

Example

Parse vectors in a text file and create a MknnDataset:

The format of the text file is one vector per line, each dimension separated by one or more tabs or spaces, e.g., a file with three vectors 4-d:

85.3892059 9.9864845 17.1167011 30.6644592
91.7683105 19.110569 25.5137539 6.0781879
21.1834602 93.5810928 38.5745049 22.1366367

The Euclidean distance can be instantiated by:

The Linear Scan index can be instantiated by the following method. Note that the reference_set and distance has been bound to the lifetime of the index object, hence when index is released, reference_set and distance will also be released:

MknnIndex *index = mknn_index_newPredefined(mknn_predefIndex_LinearScan_indexParams(), true, reference_set, true, distance, true);

The resolver is created by choosing the search according to the index, in this case an exact search for the linear scan. The search will return the two nearest neighbors at a distance less or equal than 50. The search method can use at most 4 parallel threads.

The query set will be a set of 20 random vectors 4-d, where each dimension is a float number between 0 and 100 (excluded):

In order to start a similarity search, the query_set and resolver must be given to the search method:

MknnResult *result = mknn_resolver_search(resolver, true, query_set, true);

Note that the second and fourth parameters of mknn_resolver_search method means that the lifetime of resolver and query_set objects are bound to the created result object.

The located nearest neighbors are stored in the result object:

for (i = 0; i < mknn_result_getNumQueries(result); ++i) {
void *query_object = mknn_dataset_getObject(query_set, i);
for (j = 0; j < resq->num_nns; ++j) {
void *database_object = mknn_dataset_getObject(reference_set, resq->nn_position[j]);
double distance = resq->nn_distance[j];
}
}

Finally, the result object must be released (thus the query_set and the resolver will also be released), and the index must be released (thus the reference_set and the distance are also released):

The full example is in example_linearscan.c

Code Compilation

The source code can be compiled with the command:

gcc -I/path-to-metricknn/include example.c -L/path-to-metricknn/lib -lmetricknn

Or if you use the pkg-config util, you can compile the example with the following command:

gcc `pkg-config --cflags metricknn` example.c `pkg-config --libs metricknn`

Metric Indexes

The linear scan computes the exact nearest neighbors. The same result can be achieved by LAESA but usually faster.

In order to replace the linear scan by LAESA index, the construction of the index and the resolver must use different parameters:

MknnIndex *index = mknn_index_newPredefined(mknn_predefIndex_Laesa_indexParams(5), true, reference_set, true, distance, true);

The index uses five pivot objects. The exact search requires the same parameters than the linear scan:

Ths similarity search using this resolver return the same result as the linear scan. The actual performance of LAESA'search is related to the number of pivots, the distance, and the distribution of the distances of the objects.

LAESA is able to perform an approximate search, which can reduce drastically the search time at the cost of missing some nearest neighbors. In order to resolve the search but computing just a 5% of distances, the parameters of the resolver are:

Domains and Custom Distances

MetricKnn can perform similarity searches using custom distances. Moreover, any metric index can use the custom distance and improve the search efficiency. If the custom distance satisfies the metric properties, then search will return the exact k-NN otherwise it will return an approximated result.

A custom distance is created by implementing a method with the signature defined by the typedef mknn_function_distanceEval_eval. Additionally, if the distance requires some pre-computing, two methods with the signatures mknn_function_distanceEval_createState and mknn_function_distanceEval_releaseState can be defined.

The following code shows a custom distance for float vectors that applies a exponential weight to each dimension difference:

struct StateDist {
int dimensions;
double *weights;
};
void *create_state(MknnDomain *domain_left, MknnDomain *domain_right) {
struct StateDist *state = (struct StateDist *) malloc(sizeof(struct StateDist));
state->dimensions = mknn_domain_getVectorNumDimensions(domain_left);
state->weights = malloc(state->dimensions * sizeof(double));
for (int i = 0; i < state->dimensions; ++i)
state->weights[i] = pow(0.99, i);
return state;
}
double user_distance(void *state_distEval, void *object_left, void *object_right, double current_threshold) {
struct StateDist *state = (struct StateDist *) state_distEval;
float *array1 = (float*) object_left;
float *array2 = (float*) object_right;
double dist = 0;
for (int i = 0; i < state->dimensions; ++i) {
dist += state->weights[i] * fabs(array1[i] - array2[i]);
if (dist > current_threshold)
return dist;
}
return dist;
}
void release_state(void *state_distEval) {
struct StateDist *state = (struct StateDist *) state_distEval;
free(state->weights);
free(state);
}

The distance signature requires a parameter current_threshold. This parameter stores the current distance to the k-th candidate (or the constant DBL_MAX for the first k objects). Therefore, if during the distance computation the partial result of the distance is higher than current_threshold, it is possible to perform an early termination by returning any value >= current_threshold.

Once defined the distance method, the distance object can be instantiated with the following statement:

MknnDistance *distance = mknn_distance_newCustom(create_state, user_distance, release_state);

The MknnDomain stores the type of the objects in a dataset. The method mknn_dataset_getDomain can be used to retrieve the type of objects stored in a MknnDataset. Prior to invoke a custom distance, the search resolver declares the MknnDomain of the objects that will be compared. Therefore, a custom distance can be used to implement a distance function independent of object types.

During the similarity search, the resolver can start several threads to compute distances but each thread will invoke create_state to create a state_distEval that will not be shared with other threads. However, if user_distance modifies some global variable it will be necessary to consider a proper synchronization (e.g. pthread_mutex_t).

In the previous example, the user_distance method still statically casts the void* object to a float*. If you want to implement a generic distance for any domain, the mknn_distance_newCustomFactory enables to dynamically change the distance method depending on the domains. In that case, it is possible to choose an efficient implementation of the distance depending on the domain of the objects.

The full example is in example_custom_distance.c

Custom Datasets and Custom Distances

MetricKnn enables to define custom implementations of datasets. A custom dataset can be defined by the method mknn_datasetLoader_Custom which requires a pointer to a data object and pointers to different methods that receives that data object and performs the required actions: get the dataset size, get an object from the dataset, add an object to the dataset (optional), release the data object (optional). Optionally, the mknn_datasetLoader_Custom receives the MknnDomain with the type of the objects in the custom dataset.

For example, if we define the object People containing of a list of Person, the operations for getting the size and retrieving each person are:

struct Person {
int age;
};
struct People {
struct Person *array;
int size;
};
struct People *newPeople(int size) {
struct People *people = malloc(sizeof(struct People));
people->size = size;
people->array = malloc(size * sizeof(struct Person));
for (int i = 0; i < size; ++i) {
struct Person *person = people->array + i;
person->age = 20 + rand() % 80;
}
return people;
}
static int64_t people_getNumObjects(void *data_pointer) {
struct People *people = (struct People *) data_pointer;
return people->size;
}
static void *people_getObject(void *data_pointer, int64_t pos) {
struct People *people = (struct People *) data_pointer;
struct Person *person = people->array + pos;
return person;
}
static void people_release(void *data_pointer) {
struct People *people = (struct People *) data_pointer;
free(people->array);
free(people);
}

The definition of the custom dataset also needs the definition of the MknnDomain (which is optional):

struct People *people = newPeople(1000);
MknnDataset *custom_dataset = mknn_datasetLoader_Custom(people, people_getNumObjects, people_getObject, NULL, people_release, NULL, false);

Now we can define a custom distance which compares two Person (i.e., the object returned by method people_getObject):

double compare_people(void *state_distEval, void *object_left, void *object_right, double current_threshold) {
struct Person *person_1 = (struct Person*) object_left;
struct Person *person_2 = (struct Person*) object_right;
return abs(person_1->age - person_2->age);
}

The instantiation of the custom MknnDistance and the query dataset is:

MknnDistance *custom_distance = mknn_distance_newCustom(NULL, compare_people, NULL);
struct People *few_people = newPeople(10);
MknnDataset *custom_query = mknn_datasetLoader_Custom(few_people, people_getNumObjects, people_getObject, NULL, people_release, NULL, false);

Now we can retrieve for each Person in custom_query the K most similar Person in custom_dataset according to the distance compare_people. Note that compare_people computes the difference of the field age thus it satisfies the metric properties. In this case we can use a Metric Index to obtain an identical result as the Linear Scan but much faster:

MknnIndex *index = mknn_index_newPredefined(mknn_predefIndex_Laesa_indexParams(1), true, custom_dataset, false, custom_distance, false);
MknnResult *result = mknn_resolver_search(resolver, false, custom_query, false);

The full example is in example_custom_dataset_and_distance.c

More Information

More examples can be reviewed in the examples page.

The full API can be browsed in the files section.

For more info please visit the project's homepage: http://metricknn.org/

Powered by Download MetricKnn