Metric Access Methods

description31 papers

group2 followers

lightbulbAbout this topic

Metric access methods are techniques used in computer science and information retrieval to efficiently store, retrieve, and manage data based on distance or similarity measures in metric spaces. These methods facilitate operations such as nearest neighbor search and clustering by organizing data structures that optimize query performance in high-dimensional spaces.

lightbulbAbout this topic

Key research themes

1. How can indexing structures be designed and optimized to efficiently support similarity search in generic metric spaces?

This research area focuses on developing, refining, and empirically benchmarking data structures (metric access methods) to support efficient similarity search in metric spaces, which accommodate diverse data types and non-Euclidean distance functions. The goal is to accelerate queries such as nearest neighbor or range queries by leveraging metric space properties (notably triangle inequality) combined with effective partitioning, clustering, pivot selection, and disk-memory-aware structures. Optimizing construction, update, storage size, and search efficiency in secondary memory and high-dimensional settings is central to this theme, affecting domains like multimedia retrieval, image databases, and more.

Indexing Metric Spaces for Exact Similarity Search

by Yunjun Gao

2022, ACM Computing Surveys

Key finding: This comprehensive survey summarizes a wide range of exact similarity search indexes in metric spaces, providing an extensive categorization of partitioning, pruning, and validation techniques fundamental to accelerating... Read more

articleView Paper downloadDownload

New dynamic metric indices for secondary memory

by Nora Reyes

2022, Information Systems

Key finding: This work introduces three novel dynamic metric indexes designed for secondary memory that support insertions and deletions in medium-to-high dimensional spaces. It extends in-memory structures like DSAT and LC into... Read more

articleView Paper downloadDownload

Improving retrieval accuracy of Hierarchical Cellular Trees for generic metric spaces

by carles gimeno ventura

2022, Multimedia Tools and Applications

Key finding: This study identifies key shortcomings in the original Hierarchical Cellular Tree (HCT) design, revising the definition of covering radius to reflect maximum subtree distances and redesigning the retrieval scheme to more... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What algorithmic strategies enable efficient approximate self-similarity joins and similarity joins in metric spaces?

This theme investigates computational algorithms to efficiently find similar object pairs within metric spaces, focusing on approximations to enhance scalability and applicability, especially when exact joins and self-joins are computationally prohibitive. Emphasis is placed on balancing query expressivity (like kNN joins), computational complexity, pruning techniques utilizing metric properties, and trade-offs between index utilization and direct computation. These studies have implications for multimedia retrieval, pattern recognition, and data mining, addressing challenges in handling complex or high-dimensional data.

An efficient algorithm for approximated self-similarity joins in metric spaces

by Sebastián Ferrada

2023, Information Systems

Key finding: This paper proposes a novel heuristic algorithm approximating k-nearest neighbor self-similarity joins in metric spaces, achieving worst-case O(n^{3/2}) distance computations, significantly improving over the naïve quadratic... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How do non-metric similarity models and domain-specific indexing impact specialized retrieval tasks such as tandem mass spectrometry identification?

This theme revolves around adapting similarity search techniques—especially in metric and non-metric spaces—to specialized domains with unique characteristics, focusing on tandem mass spectrometry for protein/peptide identification. Research addresses the design of non-metric similarity measures, indexing adaptations, clustering preprocessing, and approximate search methods to tackle challenges such as PTMs and noisy data. Theoretical foundations and applied frameworks that accelerate biochemical sequence identification are central, demonstrating metric space querying principles applied in bioinformatics.

J. Novak, D. Hoksza, J. Lokoc, T. Skopal. On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering

by Jiri Novak and

2012, 8th International Symposium on Bioinformatics Research and Applications (ISBRA)

Key finding: The paper demonstrates that applying clustering as a preprocessing step to tandem mass spectra substantially accelerates non-metric similarity searches based on M-tree and TriGen algorithms by over 100x compared to sequential... Read more

articleView Paper downloadDownload

J. Novak, J. Galgonek, D. Hoksza, T. Skopal. SimTandem: Similarity Search in Tandem Mass Spectra

by Jiri Novak

2011, 5th International Conference on Similarity Search and Applications (SISAP)

Key finding: SimTandem implements a non-metric similarity search framework utilizing parameterized Hausdorff distance and non-metric access methods to accelerate identification of protein and peptide sequences from tandem mass spectra. By... Read more

articleView Paper downloadDownload

Parametrised Hausdorff Distance as a Non-Metric Similarity Model for Tandem Mass Spectrometry

by David Hoksza

2014

Key finding: This work introduces the parameterized Hausdorff distance as an effective non-metric similarity measure tailored for tandem mass spectra comparison. It models spectral similarity with robustness to noise and modifications,... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Metric Access Methods

Estimating the indexability of multimedia descriptors for similarity searching

by Stanislav Barton

2023

A study on properties of data sets representing public domain audio and visual content and their relation to their indexability is presented. Data analysis considers the pair-wise distance distributions and various techniques to estimate... more

descriptionView Paper arrow_downwardDownload

RAFIKI: Retrieval-Based Application for Imaging and Knowledge Investigation

by Jose F Rodrigues-Jr

2023, IEEE International Symposium on Computer-Based Medical Systems (CBMS)

Medical exams, such as CT scans and mammograms, are obtained and stored every day in hospitals all over the world, including images, patient data, and medical reports. It is paramount to have tools and systems to improve computer-aided... more

descriptionView Paper arrow_downwardDownload

Probabilistic Metric Spaces for Privacy by Design Machine Learning Algorithms: Modeling Database Changes

by Julian Salas

2023, Lecture Notes in Computer Science

Machine learning, data mining and statistics are used to analyze the data and to build models from them. Data privacy for big data needs to find a compromise between data analysis and disclosure risk. Privacy by design machine learning... more

descriptionView Paper arrow_downwardDownload

DBM-tree: a dynamic metric access method sensitive to local density data

by Marcos Aurelio Vieira

2023

Metric Access Methods (MAM) are employed to accelerate the processing of similarity queries, such as the range and the k-nearest neighbor queries. Current methods improve the query performance minimizing the number of disk accesses,... more

descriptionView Paper arrow_downwardDownload

Indexing Algorithm Design for Content Based Image Retrieval

by sandhya tarar

2023

A lot of research efforts have been attracted over the internet by mass of digitized images to supervise the visual data for the development of tools for their fast and effective recovery. Each and every one of the internet users tries to... more

descriptionView Paper arrow_downwardDownload

NovoHMM: A Hidden Markov Model for de Novo Peptide Sequencing

by Volker Roth

2022, Analytical Chemistry

De novo Sequencing of peptides is a challenging task in proteome research. While there exist reliable DNA-sequencing methods, the highthroughput de novo sequencing of proteins by mass spectrometry is still an open problem. Current... more

descriptionView Paper arrow_downwardDownload

Improving retrieval accuracy of Hierarchical Cellular Trees for generic metric spaces

by carles gimeno ventura

2022, Multimedia Tools and Applications

Metric Access Methods (MAMs) are indexing techniques which allow working in generic metric spaces. Therefore, MAMs are specially useful for Content-Based Image Retrieval systems based on features which use non L p norms as similarity... more

descriptionView Paper arrow_downwardDownload

by Maria Viviana Martinez

2022, IEEE Intelligent Systems

M any applications could benefit from accurately predicting an entity's behavior. For example, researchers have developed methods to predict a terrorist organization's probable actions (such as bombings or kidnappings). 1,2 Likewise, we... more

descriptionView Paper arrow_downwardDownload

Indexing Metric Spaces for Exact Similarity Search

by Christian S. Jensen and

2022, ACM Computing Surveys

With the continued digitization of societal processes, we are seeing an explosion in available data. This is referred to as big data. In a research setting, three aspects of the data are often viewed as the main sources of challenges when... more

descriptionView Paper arrow_downwardDownload

Slim-Trees: High Performance Metric Trees Minimizing Overlap between Nodes

by A. Traina

2022, Lecture Notes in Computer Science

In this paper we present the Slim-tree, a dynamic tree for organizing metric datasets in pages of fixed size. The Slim-tree uses the "fat-factor" which provides a simple way to quantify the degree of overlap between the nodes in a metric... more

descriptionView Paper arrow_downwardDownload

by Maria Vanina Martinez

2022, IEEE Intelligent Systems

descriptionView Paper arrow_downwardDownload

A fast coarse filtering method for peptide identification by mass spectrometry

by John Prince

2022

Motivation: We reformulate the problem of comparing mass-spectra by mapping spectra to a vector space model. Our search method leverages a metric space indexing algorithm to produce an initial candidate set, which can be followed by any... more

descriptionView Paper arrow_downwardDownload

by Pavel Zezula

2022

descriptionView Paper arrow_downwardDownload

Exploring Intersection Trees for Indexing Metric Spaces

by Zineddine KOUAHLA

2022

Searching in a dataset for objects that are similar to a given query object is a fundamental problem for several applications that use complex data. The general problem of many similarity measures for complex objects is their... more

descriptionView Paper arrow_downwardDownload

Indexing Metric Spaces with Nested Forests of Topological Balls and Hyperplanes

by Zineddine KOUAHLA

2022

Searching in a dataset for objects that are similar, with respect to a distance, to a given query object is a fundamental problem for several applications that use complex data, e.g., strings, graphs. The main difficulties are to focus... more

descriptionView Paper arrow_downwardDownload

Performance evaluation of multidimensional access methods

by Ricardo Rodrigues Ciferri

2021, Proceedings of the eighth ACM international symposium on Advances in geographic information systems - GIS '00

Storing multidimensional data in databases is an important topic both in the industrial and scientific database communities. Arrays are offered as a multidimensional data structure by most programming languages. Conventional database... more

descriptionView Paper arrow_downwardDownload

Nearest Neighbours Search using the PM-tree

by Jaroslav Pokorny

2021, Database Systems for Advanced …

We introduce a method of searching the k nearest neighbours (k-NN) using PM-tree. The PM-tree is a metric access method for similarity search in large multimedia databases. As an extension of M-tree, the structure of PM-tree exploits... more

descriptionView Paper arrow_downwardDownload

On Estimating the Indexability of Multimedia Descriptors for Similarity Searching

by Marta Rukoz

2021

La version attachée est celle d'un rapport de recherche (CEDRIC Research Report n°1892) A mettre en ligne 1re semaine de mai. A study on properties of data sets representing public domain audio and visual content and their relation to... more

descriptionView Paper arrow_downwardDownload

Estimating the indexability of multimedia descriptors for similarity searching

by Marta Rukoz

2021

A study on properties of data sets representing public domain audio and visual content and their relation to their indexability is presented. Data analysis considers the pairwise distance distributions and various techniques to estimate... more

descriptionView Paper arrow_downwardDownload

Indexing in metric spaces

by Zineddine KOUAHLA

2021

Fig. 1. A simplified taxonomy of tree-based indexing techniques in metric spaces

descriptionView Paper arrow_downwardDownload

Exploring intersection trees for indexing metric spaces

by Zineddine KOUAHLA

2021

descriptionView Paper arrow_downwardDownload

A New Intersection Tree for Content-based Image Retrieval

by Zineddine KOUAHLA

2021

Retrieval of images based on their contents is a process that requires comparisons of a given query (image) with virtually all the images stored in a database with respect to a given distance function. But this is inapplicable on large... more

descriptionView Paper arrow_downwardDownload

Metric Indexing for the Vector Model in Text Retrieval

by Jaroslav Pokorny

2021, Lecture Notes in Computer Science

In the area of Text Retrieval, processing a query in the vector model has been verified to be qualitatively more effective than searching in the boolean model. However, in case of the classic vector model the current methods of processing... more

descriptionView Paper arrow_downwardDownload

by Michal Batko

2021, Advances in Database Systems

descriptionView Paper arrow_downwardDownload

Distinct nearest neighbors queries for similarity search in very large multimedia databases

by Michal Batko

2021, Proceeding of the eleventh international workshop on Web information and data management - WIDM '09

As the volume of multimedia data available on internet is tremendously increasing, the content-based similarity search becomes a popular approach to multimedia retrieval. The most popular retrieval concept is the k nearest neighbor (kNN)... more

descriptionView Paper arrow_downwardDownload

Metric Index: An efficient and scalable solution for precise and approximate similarity search

by Michal Batko

2021, Information Systems

Metric space is a universal and versatile model of similarity that can be applied in various areas of information retrieval. However, a general, efficient, and scalable solution for metric data management is still a resisting research... more

descriptionView Paper arrow_downwardDownload

On (not) indexing quadratic form distance by metric access methods

by tomas bartos

2021, … of the 14th International Conference on …

The quadratic form distance (QFD) has been utilized as an effective similarity function in multimedia retrieval, in particular, when a histogram representation of objects is used. Unlike the widely used Euclidean distance, the QFD allows... more

descriptionView Paper arrow_downwardDownload

Margin-based pivot selection for similarity search indexes

by Daiji Fukagawa

2019, IEICE Transactions on Information and Systems

Hisashi KURASAWA †a) , Daiji FUKAGAWA †b) , Atsuhiro TAKASU † †c) , and Jun ADACHI † †d) , Members SUMMARY When developing an index for a similarity search in metric spaces, how to divide the space for effective search pruning is a... more

descriptionView Paper arrow_downwardDownload

Pivot Selection Strategies for Permutation-Based Similarity Search

by Fabrizio Falchi

2019, Similarity Search and Applications, 8th International Conference (SISAP 2013)

Recently, permutation based indexes have attracted interest in the area of similarity search. The basic idea of permutation based indexes is that data objects are represented as appropriately generated permutations of a set of pivots (or... more

Fig. 4. Recall@10 versus the number of candidates accessed (z’) by the PP-Index when using the multiple-query search method with zero (lower left corner) to eight (upper right) of additional queries. Fig. 5. Recall@r obtained by the MI-File Fig.6. Recall@10 obtained by MI-File using [, = 5, varying the number of ranging /, from 1 to 5 retrieved objects r from 1 to 100.

descriptionView Paper arrow_downwardDownload

DAHC-tree: An effective index for approximate search in high-dimensional metric spaces

by Ricardo Torres

2017

Similarity search in high-dimensional metric spaces is a key operation in many applications, such as multimedia databases, image retrieval, object recognition, and others. The high dimensionality of the data requires special index... more

descriptionView Paper arrow_downwardDownload

Estimating the indexability of multimedia descriptors for similarity searching

by Marta Rukoz

2017, Adaptivity Personalization and Fusion of Heterogeneous Information

descriptionView Paper arrow_downwardDownload

On Estimating the Indexability of Multimedia Descriptors for Similarity Searching

by Marta Rukoz

2017

descriptionView Paper arrow_downwardDownload

Estimating the indexability of multimedia descriptors for similarity searching

by Marta Rukoz

2017

descriptionView Paper arrow_downwardDownload

by Giuseppe Amato and

2016

descriptionView Paper arrow_downwardDownload

Implementation and application of a versatile clustering tool for tandem mass spectrometry data

by Kristian Flikka

2016, PROTEOMICS

High-throughput proteomics experiments typically generate large amounts of peptide fragmentation mass spectra during a single experiment. There is often a substantial amount of redundant fragmentation of the same precursors among these... more

descriptionView Paper arrow_downwardDownload

(draft) Experimental analysis of insertion costs in a na¨õve dynamic MDF-tree

by Jose Oncina

2016

Similarity search is a widely employed technique in Pattern Recognition. In order to speed up the search many indexing techniques have been proposed. However, the majority of the proposed techniques are static, that is, a fixed training... more

descriptionView Paper arrow_downwardDownload

Impact of the Initialization in Tree-Based Fast Similarity Search Techniques

by Jose Oncina

2016, Lecture Notes in Computer Science

Many fast similarity search techniques relies on the use of pivots (specially selected points in the data set). Using these points, specific structures (indexes) are built speeding up the search when queering. Usually, pivot selection... more

descriptionView Paper arrow_downwardDownload

BMF: Bitmapped Mass Fingerprinting for Fast Protein Identification

by Kesheng Wu

2016, 2011 IEEE International Conference on Cluster Computing

Protein identification is an important objective for proteomic and medical sciences, as well as for pharmaceutical industry. With recent large-scale automation of genome sequencing and the explosion of protein databases, it is important... more

descriptionView Paper arrow_downwardDownload

Implementation and application of a versatile clustering tool for tandem mass spectrometry data

by Lennart Martens

2016, PROTEOMICS

descriptionView Paper arrow_downwardDownload

On comparison of SimTandem with state-of-the-art peptide identification tools, efficiency of precursor mass filter and dealing with variable modifications

by David Hoksza

2016, Journal of integrative bioinformatics

The similarity search in theoretical mass spectra generated from protein sequence databases is a widely accepted approach for identification of peptides from query mass spectra produced by shotgun proteomics. Growing protein sequence... more

descriptionView Paper arrow_downwardDownload

On optimizing the non-metric similarity search in tandem mass spectra by clustering

by David Hoksza

2016

Tandem mass spectrometry is a well-known technique for identification of protein sequences from an "in vitro" sample. To identify the sequences from spectra captured by a spectrometer, the similarity search in a database of hypothetical... more

descriptionView Paper arrow_downwardDownload

A Statistical Comparison of SimTandem with State-of-the-Art Peptide Identification Tools

by David Hoksza

2016

The similarity search in theoretical mass spectra generated from protein sequence databases is a widely accepted approach for identification of peptides from query mass spectra generated by shotgun proteomics. Since query spectra contain... more

descriptionView Paper arrow_downwardDownload

by David Hoksza

2016, Lecture Notes in Computer Science

SimTandem is a tool for fast identification of protein and peptide sequences from tandem mass spectra. The identification is based on similarity search of spectra captured by a tandem mass spectrometer in databases of theoretical mass... more

descriptionView Paper arrow_downwardDownload

Nearest Neighbours Search Using the PM-Tree

by Jaroslav Pokorny

2015, Lecture Notes in Computer Science

descriptionView Paper arrow_downwardDownload

Revisiting M-Tree Building Principles

by Jaroslav Pokorny and

2015, Lecture Notes in Computer Science

The M-tree is a dynamic data structure designed to index metric datasets. In this paper we introduce two dynamic techniques of building the M-tree. The first one incorporates a multi-way object insertion while the second one exploits the... more

descriptionView Paper arrow_downwardDownload

PM-tree: Pivoting metric tree for similarity search in multimedia databases

by Jaroslav Pokorny

2015, ADBIS (Local Proceedings)

Abstract. In this paper we introduce the Pivoting M-tree (PM-tree), a metric access method combining M-tree with the pivot-based approach. While in M-tree a metric region is represented by a hyper-sphere, in PM-tree the shape of a metric... more

Fig. 4. Distance distribution histogram, 90% distances in interval (dmin, dmaz) interval HR|t] by two 4-byte reals and a pivot distance PD[t] by one 4-byte real. However, when (a part of) the dataset is known in advance we can approximate the 4-byte representation by a 1-byte code. For this reason a distance distribu- tion histogram for each pivot is created by random sampling of objects from the dataset and comparing them against the pivot. Then a distance interval (dmin; Imax) is computed so that most of the histogram distances fall into the interval, see an example in Figure 4 (the d* value is an (estimated) maximum distance of a bounded metric space M).

Fig. 2. Hierarchy of metric regions and the appropriate M-tree

Fig. 7. Construction costs (30D indices): (a) Disk access costs (b) Computation costs

Fig. 3. (a) Region of M-tree (b) Reduced region of PM-tree (using three pivots) Since each hyper-ring region (P;, HR[t]) defines a metric region containing all the objects stored in T(O;), an intersection of all the hyper-rings and the hyper-sphere forms a metric region bounding all the objects in T(O;) as well. Due to the intersection with hyper-sphere the PM-tree metric region is always smaller than the original M-tree region defined just by a hyper-sphere. For a comparison of an M-tree region and an equivalent PM-tree region see Figure 3. The numbers pp, and ppa (both fixed for a PM-tree index lifetime) allow us to specify the ” amount of pivoting”. Obviously, using a suitable pp, > 0 and ppg > 0 the PM-tree can be tuned to achieve an optimal performance (see Section 5).

Fig. 6. Query selectivity: (a) Disk access costs (b) Computation costs Abbreviations in Figures. Each label of form ”PM-tree(x,y)” stands for a PM-tree index where p;,, = x and pPpq = y. A label ”<inder> + SlimDown” denotes an index subsequently post-processed using the slim-down algorithm (for details about the slim-down algorithm we refer to [10]).

Fig. 1. A routing entry and its metric region in the M-tree structure where O; € S is a data object, and ro, is the covering radius. 1] ptr(T(O;)) is pointer to the covering subtree, The routing entry determines a hyper-spherical metric region in M where the object O; is a center of that region and ro, is a radius bounding the region. The precomputed value d(O;,Par(O;)) is used for optimizing most of the M-tree algorithms. In Figure 1 a metric region and

Fig. 9. Dimensionality (query selectivity 50 objects): (a) Disk access costs (b) Com- putation costs n Figure 8 the range query costs (for 30-dimensional indices and query selec- tivity 50 objects) according to the number of pivots are presented. The DAC rapidly decrease with the increasing number of pivots. The PM-tree(128,0) and PM-tree(128, 28) indices need only 27% of DAC spent by the M-tree index. oreover, the PM-tree is superior even after the slim-down algorithm post- processing, e.g. the ”slimmed” PM-tree(128,0) index needs only 23% of DAC spent by the ”slimmed” M-tree index (and only 6.7% of DAC spent by the or- dinary M-tree). The decreasing trend of computation costs is even more steep than for DAC, the PM-tree(128, 28) index needs only 5.5% of the M-tree CC.

Fig. 10. Number of pivots (query selectivity 50 objects): (a) DAC (b) CC In Figure 10a the DAC for increasing number of pivots are presented. We can see that e.g. the slimmed” PM-tree(1024,50) index consumes only 42% of DAC spent by the ”slimmed” M-tree index. The computation costs (see Figure 10b) for p < 64 decrease (down to 36% of M-tree CC). However, for p > 64 the overall computation costs grow since the number of necessarily computed query-to-pivot distances (i.e. p distance computations for each query) is proportionally too large. Nevertheless, this fact is dependent on the database size — obviously, for 100,000 objects (images) the proportion of p query-to-pivot distance computations would be smaller when compared with the overall computation costs. Finally, the costs according to the increasing range query selectivity are pre- sented in Figure 11. The disk access costs stay below 73% of M-tree DAC (below 58% in case of ”slimmed” indices) while the computation costs stay below 43% (49% respectively).

Fig. 11. Query selectivity: (a) Disk access costs (b) Computation costs

Fig. 8. Number of pivots (30-dim. indices, query selectivity 50 objs.): (a) DAC (b) CC ndex construction costs (for 30-dimensional indices) according to the increasing number of pivots are presented in Figure 7. The disk access costs for PM-tree indices with up to 8 pivots are similar to those of M-tree index (see Figure 7a). For PM-tree(128,0) and PM-tree(128, 28) indices the DAC are about 1.4 times higher than for the M-tree index. The increasing trend of computation costs (see Figure 7b) depends mainly on the p object-to-pivot distance computations made during each object insertion — additional computations are needed after eaf splitting in order to create HR arrays of the new routing entries.

descriptionView Paper arrow_downwardDownload

Summative Report on Bioinformatics Case Studies

by David Cruz

2015

Tandem mass spectrometry (MS/MS) involves multiple steps of mass selection or analysis and has been widely used to identify peptides and analyze complex mixtures of proteins. In the last decades, many specific techniques for identifying... more

descriptionView Paper arrow_downwardDownload

A Scalable Parallel Approach for Peptide Identification from Large-Scale Mass Spectrometry Data

by Douglas Baxter

2015, 2009 International Conference on Parallel Processing Workshops

Identifying peptides, which are short polymeric chains of amino acid residues in a protein sequence, is of fundamental importance in systems biology research. The most popular approach to identify peptides is through database search. In this approach, an experimental spectrum ("query") generated from fragments of a target peptide using mass spectrometry is computationally compared with a database of already known protein sequences. The goal is to detect database peptides that are most likely to have generated the target peptide. The exponential growth rates and overwhelming sizes of biomolecular databases make this an ideal application to benefit from parallel computing. However, the present generation of software tools is not expected to scale to the magnitudes and complexities of data that will be generated in the next few years. This is because they are all either serial algorithms or parallel strategies that have been designed over inherently serial methods, thereby requiring high spaceand time-requirements. In this paper, we present an efficient parallel approach for peptide identification through database search. Three key factors distinguish our approach from that of existing solutions: i) (space) Given p processors and a database with N residues, we provide the first space-optimal algorithm (O( N p )) under distributed memory machine model; ii) (time) Our algorithm uses a combination of parallel techniques such as one-sided communication and masking of communication with computation to ensure that the overhead introduced due to parallelism is minimal; and iii) (quality) The run-time savings achieved using parallel processing has allowed us to incorporate highly accurate statistical models that have previously been demonstrated to ensure high quality prediction albeit on smaller scale data. We present the design and evaluation of two different algorithms to implement our approach. Experimental results using 2.65 million microbial proteins show linear scaling up to 128 processors of a Linux commodity cluster, with parallel efficiency at ∼50%. We expect that this new approach will be critical to meet the data-intensive and qualitative demands stemming from this important application domain.

descriptionView Paper arrow_downwardDownload

NovoHMM: A Hidden Markov Model for de Novo Peptide Sequencing

by Jonas Grossmann

2015, Analytical Chemistry

De novo sequencing of peptides poses one of the most challenging tasks in data analysis for proteome research. In this paper, a generative hidden Markov model (HMM) of mass spectra for de novo peptide sequencing which constitutes a novel... more

descriptionView Paper arrow_downwardDownload

by Jiri Novak

2015

Shotgun proteomics is a widely known technique for identification of protein and peptide sequences from an "in vitro" sample. A tandem mass spectrometer generates tens of thousands of mass spectra which must be annotated with peptide... more

descriptionView Paper arrow_downwardDownload

Metric Access Methods

Key research themes

1. How can indexing structures be designed and optimized to efficiently support similarity search in generic metric spaces?

2. What algorithmic strategies enable efficient approximate self-similarity joins and similarity joins in metric spaces?

3. How do non-metric similarity models and domain-specific indexing impact specialized retrieval tasks such as tandem mass spectrometry identification?

Related Topics

All papers in Metric Access Methods