MetricKnn
Fast Similarity Search using the Metric Space Approach
|
MetricKnn is an open source library to address the problem of similarity search, i.e., to efficiently locate objects that are close to each other according to some distance function. MetricKnn is comprised of:
For more information and downloads please visit the project's homepage: http://metricknn.org/
The Command Line Interface (CLI) is a powerful command that can be used to perform different types of indexing and searches without requiring to compile any source code. It provides options for defining datasets, distance function and index structure.
In order to use it, you have to download it, uncompress the zip file to a folder, open a terminal (in windows Run->cmd.exe), change to the created folder (in windows could be cd C:\Users\My_User\Desktop
) and run the command metricknn
.
In order to show a brief help, the command can be invoked without parameters. A detailed help can be shown with parameter -help
:
The command options can be divided in six types:
It is possible to save options in configuration files in order to simplify the run of consecutive experiments.
The generation of random vectors needs five parameters: number of vectors, dimensionality, coordinates minimum and maximum, and datatype. For example, the following line generates three 4-d vectors, each dimension in the range [0, 10), then generates 1000 4-d vectors with integer values, and then computes the three nearest neighbors using the euclidean distance:
If you want to perform different experiments with the same random vectors, the datasets must be saved. The following example generates random vectors and saves them to files query and ref:
The similarity search using the previously saved datasets is performed by restoring datasets as follows:
Vectors can also be read from text files. The following example parses float vectors in query.txt and reference.txt and then prints the three closest vectors:
The text file format is one vector per line, each dimension separated by one or more tabs or spaces, e.g. for 3 vectors 4-d:
-query_save_dataset
and -reference_save_dataset
).The comparisons of objects in the query dataset and objects in the reference datasets are performed by a distance function. MetricKnn provides several distance functions.
The distances are identified by a single code. Some distances may need parameters, which can be introduced by appending them to the distance id, following the format: ID,parameter1=value1,parameter2=value2
.
For example, some valid distances are:
The list of available distances and their possible parameters can be listed:
Which produces the following output:
The following command shows a more detailed help for a given distance:
The search index corresponds to the algorithm used comparing objects and any data structures that may be built in order to improve the search performance. MetricKnn provides some index methods, each one requiring different parameters.
The indexes are identified by a single code. The code for the default index is LINEARSCAN
. An index may need parameters either for the construction (included in the -index
argument) and for the search (included in the -search
argument).
The exact search of LAESA
index with two pivots on the previous dataset:
The approximate search of LAESA
index with two pivots, computing only 5% of distances:
The approximate search with the k-d tree on the same dataset:
The list of available indexes and their possible parameters can be listed:
Which produces the following output:
The following command shows a more detailed help for a given index:
Instead of building the index on every search, it can be saved and then loaded with parameters -index_save_file
filename and -index_load_file
filename.
The search parameters (-knn
-range
-search
) can also be saved and loaded with -search_save_file
filename and -search_load_file
filename
The concatenation consists in joining \( m \) datasets of sizes \( \{ n_1,...,n_m \} \) in order to produce a single larger dataset of size \( \sum_{i=1}^{m} n_i \). All the datasets must be the same domain, i.e., the objects in all the datasets must be identical type.
For example, the following command produces a larger reference dataset by concatenating two datasets loaded from files:
The concatenation can be performed in both query and reference datasets, e.g.:
The combination consists in joining \( m \) datasets of size \( n \) in order to produce a single dataset of size \( n \) whose objects are arrays of objects of length \( m \). In order to compare those arrays of objects, the distance MULTIDISTANCE
computes a linear combination of distances.
For example, the following command creates combined datasets, where each object is an array of size 2 whose first value is a 3-d vector and the second is a 2-d vector. The distance to compare those objects is \( 0.5 * L2 \) distance between first objects plus \( 2 * L1 \) distance between second objects.
The MULTIDISTANCE
is a linear combination of metrics, therefore it is also a metric. This is relevant because this combined distance can also be used by metric access methods, hence the following command produces the same result:
The list of options currently supported by the command line are:
Usage: metricknn [options] ... MetricKnn is a library for similarity search. GENERAL OPTIONS -help Shows this detailed help. -version Prints version and exits. QUERY DATASET OPTIONS The dataset with query objects. -query_strings_file [filename] Loads strings from file 'filename'. File format: one string per line. The file parsing may take a lot of time, consider using option -query_save_file in order to generate a binary file to be loaded in future experiments. This option can be repeated in order to load different files. In that case, one of the options -query_concatenate or -query_combine must be set. -query_vectors_file [filename] [dim_datatype] Loads vectors from file 'filename'. File format: one vector per line, each coordinate separated by one or more space or tabs. Each coordinate is casted to a value in 'dim_datatype', which impacts the needed memory to store data. Valid datatypes are: Floating point types: FLOAT, DOUBLE. Integer signed types: INT8, INT16, INT32, INT64. Integer unsigned types: UINT8, UINT16, UINT32, UINT64. If text parsing takes a lot of time, consider using option -query_save_file in order to generate a binary file to be loaded in future experiments. This option can be repeated in order to load different files. In that case, one of the options -query_concatenate or -query_combine must be set. -query_vectors_random [num_vectors] [dimensions] [dim_min_value] [dim_max_value] [dim_datatype] A dataset is created by generating 'num_vectors' vectors of length 'dimensions'. Each coordinate is a number between 'dim_min_value' (included) and 'dim_max_value' (excluded). Each coordinate is casted to a value in 'dim_datatype', which impacts the needed memory to store data. Valid datatypes are the same than in option -query_vectors_file. The random generation may take a lot of time, consider using option -query_save_file in order to generate a binary file to be loaded in future experiments. This option can be repeated in order to generate different sets. In that case, one of the options -query_concatenate or -query_combine must be set. -query_restore_dataset [filename] Loads objects from file 'filename', which must have been created by -query_save_dataset. The objects in 'filename' are stored in binary format for efficient loading. This option can be repeated in order to load different files. In that case, one of the options -query_concatenate or -query_combine must be set. -query_concatenate If the input considers more than one source of objects, i.e., one or more occurrences of -query_strings_file, -query_vectors_file, -query_vectors_random or -query_restore_dataset, all the loaded objects will be CONCATENATED one after the other to produce the final dataset. For example, if dataset A contains N vectors of D coordinates, and dataset B contains M vectors of D coordinates, the concatenation produces a new dataset containing N+M vectors. of D coordinates. In order to produce a valid dataset, all the domains of the object sources must coincide. -query_combine If the input considers more than one source of objects, i.e., one or more occurrences of -query_strings_file, -query_vectors_file, -query_vectors_random or -query_restore_dataset, all the loaded objects will be MIXED to produce a MULTIOBJECT dataset. For example, if dataset A contains N vectors of D coordinates, and dataset B contains N vectors of E coordinates, the combination produces a dataset with N objects, where each object is an array of length 2 whose first value is a D-dimensional vector and second value is an E-dimensional vector. In order to produce a MULTIOBJECT dataset, all the sources must contain the same number of objects, but they can contain different domains. The combined objects must be compared with a multi-metric distance. -query_save_dataset [filename] Saves the loaded objects to file 'filename'. If many datasets have been selected (see options -query_concatenate and -query_combine) all the datasets are saved to 'filename'. -query_print_txt [filename] Print all the objects in the dataset in text format to file 'filename'. REFERENCE DATASET OPTIONS The dataset with objects to search in. -reference_strings_file [filename] -reference_vectors_file [filename] [dim_datatype] -reference_vectors_random [num_vectors] [dimensions] [dim_min_value] [dim_max_value] [dim_datatype] -reference_restore_dataset [filename] -reference_concatenate -reference_combine -reference_save_dataset [filename] -reference_print_txt [filename] Analogous to Query Dataset Options. DISTANCE OPTIONS -distance [distance_string] Distance id and parameters in the format: ID,parameter1=value1,parameter2=value2... -distance_save_file [filename] Filename to save distance. -distance_load_file [filename] Filename to load distance. -list_distances Lists all pre-defined distances and exits. -help_distance [id_distance] Prints help for a given distance and exits. INDEX OPTIONS -index [index_string] Index id and build parameters in the format: ID,parameter1=value1,parameter2=value2.... Default: LINEARSCAN. -index_save_file [filename] Filename to save index. -index_load_file [filename] Filename to load index. -list_indexes Lists all pre-defined indexes and exits. -help_index [id_index] Prints help for a given index and exits. SEARCH OPTIONS -knn [num] Number of nearest neighbors to retrieve for each query object. Default: 2. -range [value] Search radius. A value less or equal than 0 means infinity range. Default: 0. -search [search_string] Index search parameters. -threads [num] Maximum number of parallel threads to resolve searches. Default: number of cores. -search_save_file [filename] Filename to save search parameters. -search_load_file [filename] Filename to load search parameters. OUTPUT OPTIONS -file_output [filename] Print results to a file instead of standard output. -description_query [filename] -description_reference [filename] Loads a file containing a description for each object in the dataset, and in addition to the position of the object prints the line at the same position in 'filename'. There must be as many lines in 'filename' as objects in the dataset. -print_objects_query -print_objects_reference In addition to the position, print the corresponding object. -no_output Do not print any result.
More examples can be reviewed in the examples page.
For more info please visit the project's homepage: http://metricknn.org/