Academia.eduAcademia.edu

Metric Access Methods

description31 papers
group2 followers
lightbulbAbout this topic
Metric access methods are techniques used in computer science and information retrieval to efficiently store, retrieve, and manage data based on distance or similarity measures in metric spaces. These methods facilitate operations such as nearest neighbor search and clustering by organizing data structures that optimize query performance in high-dimensional spaces.
lightbulbAbout this topic
Metric access methods are techniques used in computer science and information retrieval to efficiently store, retrieve, and manage data based on distance or similarity measures in metric spaces. These methods facilitate operations such as nearest neighbor search and clustering by organizing data structures that optimize query performance in high-dimensional spaces.

Key research themes

1. How can indexing structures be designed and optimized to efficiently support similarity search in generic metric spaces?

This research area focuses on developing, refining, and empirically benchmarking data structures (metric access methods) to support efficient similarity search in metric spaces, which accommodate diverse data types and non-Euclidean distance functions. The goal is to accelerate queries such as nearest neighbor or range queries by leveraging metric space properties (notably triangle inequality) combined with effective partitioning, clustering, pivot selection, and disk-memory-aware structures. Optimizing construction, update, storage size, and search efficiency in secondary memory and high-dimensional settings is central to this theme, affecting domains like multimedia retrieval, image databases, and more.

Key finding: This comprehensive survey summarizes a wide range of exact similarity search indexes in metric spaces, providing an extensive categorization of partitioning, pruning, and validation techniques fundamental to accelerating... Read more
Key finding: This work introduces three novel dynamic metric indexes designed for secondary memory that support insertions and deletions in medium-to-high dimensional spaces. It extends in-memory structures like DSAT and LC into... Read more
Key finding: This study identifies key shortcomings in the original Hierarchical Cellular Tree (HCT) design, revising the definition of covering radius to reflect maximum subtree distances and redesigning the retrieval scheme to more... Read more

2. What algorithmic strategies enable efficient approximate self-similarity joins and similarity joins in metric spaces?

This theme investigates computational algorithms to efficiently find similar object pairs within metric spaces, focusing on approximations to enhance scalability and applicability, especially when exact joins and self-joins are computationally prohibitive. Emphasis is placed on balancing query expressivity (like kNN joins), computational complexity, pruning techniques utilizing metric properties, and trade-offs between index utilization and direct computation. These studies have implications for multimedia retrieval, pattern recognition, and data mining, addressing challenges in handling complex or high-dimensional data.

Key finding: This paper proposes a novel heuristic algorithm approximating k-nearest neighbor self-similarity joins in metric spaces, achieving worst-case O(n^{3/2}) distance computations, significantly improving over the naïve quadratic... Read more

3. How do non-metric similarity models and domain-specific indexing impact specialized retrieval tasks such as tandem mass spectrometry identification?

This theme revolves around adapting similarity search techniques—especially in metric and non-metric spaces—to specialized domains with unique characteristics, focusing on tandem mass spectrometry for protein/peptide identification. Research addresses the design of non-metric similarity measures, indexing adaptations, clustering preprocessing, and approximate search methods to tackle challenges such as PTMs and noisy data. Theoretical foundations and applied frameworks that accelerate biochemical sequence identification are central, demonstrating metric space querying principles applied in bioinformatics.

by Jiri Novak and 
1 more
Key finding: The paper demonstrates that applying clustering as a preprocessing step to tandem mass spectra substantially accelerates non-metric similarity searches based on M-tree and TriGen algorithms by over 100x compared to sequential... Read more
Key finding: SimTandem implements a non-metric similarity search framework utilizing parameterized Hausdorff distance and non-metric access methods to accelerate identification of protein and peptide sequences from tandem mass spectra. By... Read more
Key finding: This work introduces the parameterized Hausdorff distance as an effective non-metric similarity measure tailored for tandem mass spectra comparison. It models spectral similarity with robustness to noise and modifications,... Read more

All papers in Metric Access Methods

A study on properties of data sets representing public domain audio and visual content and their relation to their indexability is presented. Data analysis considers the pair-wise distance distributions and various techniques to estimate... more
Medical exams, such as CT scans and mammograms, are obtained and stored every day in hospitals all over the world, including images, patient data, and medical reports. It is paramount to have tools and systems to improve computer-aided... more
Machine learning, data mining and statistics are used to analyze the data and to build models from them. Data privacy for big data needs to find a compromise between data analysis and disclosure risk. Privacy by design machine learning... more
Metric Access Methods (MAM) are employed to accelerate the processing of similarity queries, such as the range and the k-nearest neighbor queries. Current methods improve the query performance minimizing the number of disk accesses,... more
A lot of research efforts have been attracted over the internet by mass of digitized images to supervise the visual data for the development of tools for their fast and effective recovery. Each and every one of the internet users tries to... more
De novo Sequencing of peptides is a challenging task in proteome research. While there exist reliable DNA-sequencing methods, the highthroughput de novo sequencing of proteins by mass spectrometry is still an open problem. Current... more
Metric Access Methods (MAMs) are indexing techniques which allow working in generic metric spaces. Therefore, MAMs are specially useful for Content-Based Image Retrieval systems based on features which use non L p norms as similarity... more
M any applications could benefit from accurately predicting an entity's behavior. For example, researchers have developed methods to predict a terrorist organization's probable actions (such as bombings or kidnappings). 1,2 Likewise, we... more
With the continued digitization of societal processes, we are seeing an explosion in available data. This is referred to as big data. In a research setting, three aspects of the data are often viewed as the main sources of challenges when... more
In this paper we present the Slim-tree, a dynamic tree for organizing metric datasets in pages of fixed size. The Slim-tree uses the "fat-factor" which provides a simple way to quantify the degree of overlap between the nodes in a metric... more
M any applications could benefit from accurately predicting an entity's behavior. For example, researchers have developed methods to predict a terrorist organization's probable actions (such as bombings or kidnappings). 1,2 Likewise, we... more
Motivation: We reformulate the problem of comparing mass-spectra by mapping spectra to a vector space model. Our search method leverages a metric space indexing algorithm to produce an initial candidate set, which can be followed by any... more
Searching in a dataset for objects that are similar to a given query object is a fundamental problem for several applications that use complex data. The general problem of many similarity measures for complex objects is their... more
Searching in a dataset for objects that are similar, with respect to a distance, to a given query object is a fundamental problem for several applications that use complex data, e.g., strings, graphs. The main difficulties are to focus... more
Storing multidimensional data in databases is an important topic both in the industrial and scientific database communities. Arrays are offered as a multidimensional data structure by most programming languages. Conventional database... more
We introduce a method of searching the k nearest neighbours (k-NN) using PM-tree. The PM-tree is a metric access method for similarity search in large multimedia databases. As an extension of M-tree, the structure of PM-tree exploits... more
La version attachée est celle d'un rapport de recherche (CEDRIC Research Report n°1892) A mettre en ligne 1re semaine de mai. A study on properties of data sets representing public domain audio and visual content and their relation to... more
A study on properties of data sets representing public domain audio and visual content and their relation to their indexability is presented. Data analysis considers the pairwise distance distributions and various techniques to estimate... more
Searching in a dataset for objects that are similar to a given query object is a fundamental problem for several applications that use complex data. The general problem of many similarity measures for complex objects is their... more
Retrieval of images based on their contents is a process that requires comparisons of a given query (image) with virtually all the images stored in a database with respect to a given distance function. But this is inapplicable on large... more
In the area of Text Retrieval, processing a query in the vector model has been verified to be qualitatively more effective than searching in the boolean model. However, in case of the classic vector model the current methods of processing... more
As the volume of multimedia data available on internet is tremendously increasing, the content-based similarity search becomes a popular approach to multimedia retrieval. The most popular retrieval concept is the k nearest neighbor (kNN)... more
Metric space is a universal and versatile model of similarity that can be applied in various areas of information retrieval. However, a general, efficient, and scalable solution for metric data management is still a resisting research... more
The quadratic form distance (QFD) has been utilized as an effective similarity function in multimedia retrieval, in particular, when a histogram representation of objects is used. Unlike the widely used Euclidean distance, the QFD allows... more
Hisashi KURASAWA †a) , Daiji FUKAGAWA †b) , Atsuhiro TAKASU † †c) , and Jun ADACHI † †d) , Members SUMMARY When developing an index for a similarity search in metric spaces, how to divide the space for effective search pruning is a... more
Recently, permutation based indexes have attracted interest in the area of similarity search. The basic idea of permutation based indexes is that data objects are represented as appropriately generated permutations of a set of pivots (or... more
Similarity search in high-dimensional metric spaces is a key operation in many applications, such as multimedia databases, image retrieval, object recognition, and others. The high dimensionality of the data requires special index... more
A study on properties of data sets representing public domain audio and visual content and their relation to their indexability is presented. Data analysis considers the pairwise distance distributions and various techniques to estimate... more
La version attachée est celle d'un rapport de recherche (CEDRIC Research Report n°1892) A mettre en ligne 1re semaine de mai. A study on properties of data sets representing public domain audio and visual content and their relation to... more
A study on properties of data sets representing public domain audio and visual content and their relation to their indexability is presented. Data analysis considers the pairwise distance distributions and various techniques to estimate... more
High-throughput proteomics experiments typically generate large amounts of peptide fragmentation mass spectra during a single experiment. There is often a substantial amount of redundant fragmentation of the same precursors among these... more
Similarity search is a widely employed technique in Pattern Recognition. In order to speed up the search many indexing techniques have been proposed. However, the majority of the proposed techniques are static, that is, a fixed training... more
Many fast similarity search techniques relies on the use of pivots (specially selected points in the data set). Using these points, specific structures (indexes) are built speeding up the search when queering. Usually, pivot selection... more
Protein identification is an important objective for proteomic and medical sciences, as well as for pharmaceutical industry. With recent large-scale automation of genome sequencing and the explosion of protein databases, it is important... more
High-throughput proteomics experiments typically generate large amounts of peptide fragmentation mass spectra during a single experiment. There is often a substantial amount of redundant fragmentation of the same precursors among these... more
The similarity search in theoretical mass spectra generated from protein sequence databases is a widely accepted approach for identification of peptides from query mass spectra produced by shotgun proteomics. Growing protein sequence... more
Tandem mass spectrometry is a well-known technique for identification of protein sequences from an "in vitro" sample. To identify the sequences from spectra captured by a spectrometer, the similarity search in a database of hypothetical... more
The similarity search in theoretical mass spectra generated from protein sequence databases is a widely accepted approach for identification of peptides from query mass spectra generated by shotgun proteomics. Since query spectra contain... more
SimTandem is a tool for fast identification of protein and peptide sequences from tandem mass spectra. The identification is based on similarity search of spectra captured by a tandem mass spectrometer in databases of theoretical mass... more
We introduce a method of searching the k nearest neighbours (k-NN) using PM-tree. The PM-tree is a metric access method for similarity search in large multimedia databases. As an extension of M-tree, the structure of PM-tree exploits... more
The M-tree is a dynamic data structure designed to index metric datasets. In this paper we introduce two dynamic techniques of building the M-tree. The first one incorporates a multi-way object insertion while the second one exploits the... more
Abstract. In this paper we introduce the Pivoting M-tree (PM-tree), a metric access method combining M-tree with the pivot-based approach. While in M-tree a metric region is represented by a hyper-sphere, in PM-tree the shape of a metric... more
Tandem mass spectrometry (MS/MS) involves multiple steps of mass selection or analysis and has been widely used to identify peptides and analyze complex mixtures of proteins. In the last decades, many specific techniques for identifying... more
Identifying peptides, which are short polymeric chains of amino acid residues in a protein sequence, is of fundamental importance in systems biology research. The most popular approach to identify peptides is through database search. In... more
De novo sequencing of peptides poses one of the most challenging tasks in data analysis for proteome research. In this paper, a generative hidden Markov model (HMM) of mass spectra for de novo peptide sequencing which constitutes a novel... more
Shotgun proteomics is a widely known technique for identification of protein and peptide sequences from an "in vitro" sample. A tandem mass spectrometer generates tens of thousands of mass spectra which must be annotated with peptide... more
Download research papers for free!