MetricKnn

MetricKnn is an open source library to address the problem of similarity search, i.e., to efficiently locate objects that are close to each other according to some distance function. MetricKnn is comprised of:

Command Line (this document), which is able to perform similarity searches through console commands. Available for Windows and Linux.
C API, which provides the implementation of different indexes, distances and search methods.
C++ API, which is a thin C++ wrapper of the C API.

For more information and downloads please visit the project's homepage: http://metricknn.org/

Command Line Interface Documentation

The Command Line Interface (CLI) is a powerful command that can be used to perform different types of indexing and searches without requiring to compile any source code. It provides options for defining datasets, distance function and index structure.

In order to use it, you have to download it, uncompress the zip file to a folder, open a terminal (in windows Run->cmd.exe), change to the created folder (in windows could be cd C:\Users\My_User\Desktop) and run the command metricknn.

In order to show a brief help, the command can be invoked without parameters. A detailed help can be shown with parameter -help:

metricknn

metricknn -help

The command options can be divided in six types:

The query objects.
The search objects.
The distance function.
(optional) The index structure, which can improve the efficiency of the search.
(optional) The search parameters, including number of nearest neighbors to compute, distance range, and other index-dependent parameters (e.g. level of approximation).
(optional) The output format.

It is possible to save options in configuration files in order to simplify the run of consecutive experiments.

Note: MetricKnn can perform searches using either pre-defined distances and custom distances. However, the command line can only use pre-defined distances. If you need to use a custom distance, you must build a program and link to the C/C++ API.

Generating random vectors

The generation of random vectors needs five parameters: number of vectors, dimensionality, coordinates minimum and maximum, and datatype. For example, the following line generates three 4-d vectors, each dimension in the range [0, 10), then generates 1000 4-d vectors with integer values, and then computes the three nearest neighbors using the euclidean distance:

metricknn
    -query_vectors_random         3 4 0 10 float
    -reference_vectors_random  1000 4 0 10 uint8
    -distance L2
    -knn 3

Note: The previous command as well as all the following examples must be written in a single line.

If you want to perform different experiments with the same random vectors, the datasets must be saved. The following example generates random vectors and saves them to files query and ref:

metricknn
    -query_vectors_random         3 4 0 10 float
    -query_save_dataset query
    -reference_vectors_random  1000 4 0 10 uint8
    -reference_save_dataset ref

The similarity search using the previously saved datasets is performed by restoring datasets as follows:

metricknn
    -query_restore_dataset query
    -reference_restore_dataset ref
    -distance L2
    -knn 3

Loading text files

Vectors can also be read from text files. The following example parses float vectors in query.txt and reference.txt and then prints the three closest vectors:

metricknn
    -query_vectors_file      query.txt     float
    -reference_vectors_file  reference.txt float
    -distance L2
    -knn 3

The text file format is one vector per line, each dimension separated by one or more tabs or spaces, e.g. for 3 vectors 4-d:

3892059  9.9864845   17.1167011  30.6644592
7683105  19.110569   25.5137539  6.0781879
1834602  93.5810928  38.5745049  22.1366367

Note: The parsing of text files can be time-comsuming. In order to accelerate the search it is recommendable to save datasets to binary files (-query_save_dataset and -reference_save_dataset).

Distances

The comparisons of objects in the query dataset and objects in the reference datasets are performed by a distance function. MetricKnn provides several distance functions.

The distances are identified by a single code. Some distances may need parameters, which can be introduced by appending them to the distance id, following the format: ID,parameter1=value1,parameter2=value2.

For example, some valid distances are:

-distance "LP,order=3"

-distance "DPF,order=1,pct_discard=0.1"

-distance "EMD,matrix_dims=2x2,matrix=0;0.5;0.5;0"

The list of available distances and their possible parameters can be listed:

metricknn -list_distances

Which produces the following output:

Available distances for domain VECTOR:
    L1
    L2
    LMAX
    LP,order=[float]
    L2SQUARED
    HAMMING
    CHI2
    HELLINGER
    DPF,order=[float],dims_discard=[int],pct_discard=[float],threshold_discard=[float]
    EMD,normalize_vectors=[true|false],costs_file=[filename],rows=[int],cols=[int],matrix=[val1;...;valn]
    COSINE_SIMILARITY,normalize_vectors=[true|false]
    COSINE_DISTANCE,normalize_vectors=[true|false]
Available distances for domain MULTIOBJECT:
    MULTIDISTANCE,distances=dist1;...,normalization=val1;...,weights=weight1;...,normalization_alpha=[float]

The following command shows a more detailed help for a given distance:

metricknn -help_distance DIST

Search Indexes

The search index corresponds to the algorithm used comparing objects and any data structures that may be built in order to improve the search performance. MetricKnn provides some index methods, each one requiring different parameters.

The indexes are identified by a single code. The code for the default index is LINEARSCAN. An index may need parameters either for the construction (included in the -index argument) and for the search (included in the -search argument).

The exact search of LAESA index with two pivots on the previous dataset:

metricknn
    -query_restore_dataset query
    -reference_restore_dataset ref
    -distance L2
    -knn 3
    -index LAESA,num_pivots=2

The approximate search of LAESA index with two pivots, computing only 5% of distances:

metricknn
    -query_restore_dataset query
    -reference_restore_dataset ref
    -distance L2
    -knn 3
    -index LAESA,num_pivots=2
    -search method=APPROX,approximation=0.05

The approximate search with the k-d tree on the same dataset:

metricknn
    -query_restore_dataset query
    -reference_restore_dataset ref
    -distance L2
    -knn 3
    -index FLANN-KDTREE,num_trees=1
    -search num_checks=1

The list of available indexes and their possible parameters can be listed:

metricknn -list_indexes

Which produces the following output:

Available indexes:
    LINEARSCAN
    LAESA
    SNAKETABLE
    FLANN-LINEARSCAN
    FLANN-KDTREE
    FLANN-LSH
    FLANN-KMEANS
    FLANN-AUTO

The following command shows a more detailed help for a given index:

metricknn -help_index DIST

Instead of building the index on every search, it can be saved and then loaded with parameters -index_save_file filename and -index_load_file filename.

The search parameters (-knn -range -search) can also be saved and loaded with -search_save_file filename and -search_load_file filename

Dataset Concatenation

The concatenation consists in joining \( m \) datasets of sizes \( \{ n_1,...,n_m \} \) in order to produce a single larger dataset of size \( \sum_{i=1}^{m} n_i \). All the datasets must be the same domain, i.e., the objects in all the datasets must be identical type.

For example, the following command produces a larger reference dataset by concatenating two datasets loaded from files:

metricknn
    -query_vectors_file          query1.txt     float
    -reference_concatenate
        -reference_vectors_file  reference1.txt float
        -reference_vectors_file  reference2.txt float
    -distance L2
    -knn 3

The concatenation can be performed in both query and reference datasets, e.g.:

metricknn
    -query_concatenate
        -query_vectors_file      query1.txt     float
        -query_vectors_file      query2.txt     float
    -reference_concatenate
        -reference_vectors_file  reference1.txt float
        -reference_vectors_file  reference2.txt float
    -distance L2
    -knn 3

Dataset Combination

The combination consists in joining \( m \) datasets of size \( n \) in order to produce a single dataset of size \( n \) whose objects are arrays of objects of length \( m \). In order to compare those arrays of objects, the distance MULTIDISTANCE computes a linear combination of distances.

For example, the following command creates combined datasets, where each object is an array of size 2 whose first value is a 3-d vector and the second is a 2-d vector. The distance to compare those objects is \( 0.5 * L2 \) distance between first objects plus \( 2 * L1 \) distance between second objects.

metricknn
    -query_combine
        -query_vectors_random 4 3 0 100 uint8
        -query_vectors_random 4 2 0  10 uint8
    -reference_combine
        -reference_vectors_random 1000000 3 0 100 uint8
        -reference_vectors_random 1000000 2 0  10 uint8
    -distance "MULTIDISTANCE,distances=L2;L1,weights=0.5;2"
    -knn 3

The MULTIDISTANCE is a linear combination of metrics, therefore it is also a metric. This is relevant because this combined distance can also be used by metric access methods, hence the following command produces the same result:

metricknn
    -query_combine
        -query_vectors_random 4 3 0 100 uint8
        -query_vectors_random 4 2 0  10 uint8
    -reference_combine
        -reference_vectors_random 1000000 3 0 100 uint8
        -reference_vectors_random 1000000 2 0  10 uint8
    -distance "MULTIDISTANCE,distances=L2;L1,weights=0.5;2"
    -knn 3
    -index LAESA,num_pivots=2

General Usage

The list of options currently supported by the command line are:

metricknn -help

Usage:
metricknn [options] ...

MetricKnn is a library for similarity search.

GENERAL OPTIONS
-help
Shows this detailed help.

-version
Prints version and exits.

QUERY DATASET OPTIONS
The dataset with query objects.

-query_strings_file [filename]
Loads strings from file 'filename'.
File format: one string per line.
The file parsing may take a lot of time, consider using option -query_save_file in order
to generate a binary file to be loaded in future experiments.
This option can be repeated in order to load different files. In that case, one of the
options -query_concatenate or -query_combine must be set.

-query_vectors_file [filename] [dim_datatype]
Loads vectors from file 'filename'.
File format: one vector per line, each coordinate separated by one or more space or tabs.
Each coordinate is casted to a value in 'dim_datatype', which impacts the needed memory to store data.
Valid datatypes are:
Floating point types: FLOAT, DOUBLE.
Integer signed types: INT8, INT16, INT32, INT64.
Integer unsigned types: UINT8, UINT16, UINT32, UINT64.
If text parsing takes a lot of time, consider using option -query_save_file in order
to generate a binary file to be loaded in future experiments.
This option can be repeated in order to load different files. In that case, one of the
options -query_concatenate or -query_combine must be set.

-query_vectors_random [num_vectors] [dimensions] [dim_min_value] [dim_max_value] [dim_datatype]
A dataset is created by generating 'num_vectors' vectors of length 'dimensions'.
Each coordinate is a number between 'dim_min_value' (included) and 'dim_max_value' (excluded).
Each coordinate is casted to a value in 'dim_datatype', which impacts the needed memory to store data.
Valid datatypes are the same than in option -query_vectors_file.
The random generation may take a lot of time, consider using option -query_save_file in order
to generate a binary file to be loaded in future experiments.
This option can be repeated in order to generate different sets. In that case, one of the
options -query_concatenate or -query_combine must be set.

-query_restore_dataset [filename]
Loads objects from file 'filename', which must have been created by -query_save_dataset.
The objects in 'filename' are stored in binary format for efficient loading.
This option can be repeated in order to load different files. In that case, one of the
options -query_concatenate or -query_combine must be set.

-query_concatenate
If the input considers more than one source of objects, i.e., one or more occurrences of
-query_strings_file, -query_vectors_file, -query_vectors_random or -query_restore_dataset,
all the loaded objects will be CONCATENATED one after the other to produce the final dataset.
For example, if dataset A contains N vectors of D coordinates, and dataset B contains M
vectors of D coordinates, the concatenation produces a new dataset containing N+M vectors.
of D coordinates.
In order to produce a valid dataset, all the domains of the object sources must coincide.

-query_combine
If the input considers more than one source of objects, i.e., one or more occurrences of
-query_strings_file, -query_vectors_file, -query_vectors_random or -query_restore_dataset,
all the loaded objects will be MIXED to produce a MULTIOBJECT dataset.
For example, if dataset A contains N vectors of D coordinates, and dataset B contains N vectors
of E coordinates, the combination produces a dataset with N objects, where each object is an array
of length 2 whose first value is a D-dimensional vector and second value is an E-dimensional vector.
In order to produce a MULTIOBJECT dataset, all the sources must contain the same number of objects,
but they can contain different domains. The combined objects must be compared with a multi-metric distance.

-query_save_dataset [filename]
Saves the loaded objects to file 'filename'. If many datasets have been selected (see
options -query_concatenate and -query_combine) all the datasets are saved to 'filename'.

-query_print_txt [filename]
Print all the objects in the dataset in text format to file 'filename'.

REFERENCE DATASET OPTIONS
The dataset with objects to search in.

-reference_strings_file [filename]
-reference_vectors_file [filename] [dim_datatype]
-reference_vectors_random [num_vectors] [dimensions] [dim_min_value] [dim_max_value] [dim_datatype]
-reference_restore_dataset [filename]
-reference_concatenate
-reference_combine
-reference_save_dataset [filename]
-reference_print_txt [filename]
Analogous to Query Dataset Options.

DISTANCE OPTIONS
-distance [distance_string]
Distance id and parameters in the format: ID,parameter1=value1,parameter2=value2...

-distance_save_file [filename]
Filename to save distance.

-distance_load_file [filename]
Filename to load distance.

-list_distances
Lists all pre-defined distances and exits.

-help_distance [id_distance]
Prints help for a given distance and exits.

INDEX OPTIONS
-index [index_string]
Index id and build parameters in the format: ID,parameter1=value1,parameter2=value2....
Default: LINEARSCAN.

-index_save_file [filename]
Filename to save index.

-index_load_file [filename]
Filename to load index.

-list_indexes
Lists all pre-defined indexes and exits.

-help_index [id_index]
Prints help for a given index and exits.

SEARCH OPTIONS
-knn [num]
Number of nearest neighbors to retrieve for each query object.
Default: 2.

-range [value]
Search radius. A value less or equal than 0 means infinity range.
Default: 0.

-search [search_string]
Index search parameters.

-threads [num]
Maximum number of parallel threads to resolve searches.
Default: number of cores.

-search_save_file [filename]
Filename to save search parameters.

-search_load_file [filename]
Filename to load search parameters.

OUTPUT OPTIONS
-file_output [filename]
Print results to a file instead of standard output.

-description_query [filename]
-description_reference [filename]
Loads a file containing a description for each object in the dataset, and in addition
to the position of the object prints the line at the same position in 'filename'.
There must be as many lines in 'filename' as objects in the dataset.

-print_objects_query
-print_objects_reference
In addition to the position, print the corresponding object.

-no_output
Do not print any result.

More Information

More examples can be reviewed in the examples page.

For more info please visit the project's homepage: http://metricknn.org/

Table of Contents