Carl Barton

Proceedings of the 2nd International Conference on Algorithms for Big Data

On-Line Pattern Matching on Uncertain Sequences and Applications

Lecture Notes in Computer Science, 2016

We study the fundamental problem of pattern matching in the case where the string data is weighte... more We study the fundamental problem of pattern matching in the case where the string data is weighted: for every position of the string and every letter of the alphabet a probability of occurrence for this letter at this position is given. Sequences of this type are commonly used to represent uncertain data. They are of particular interest in computational molecular biology as they can represent different kind of ambiguities in DNA sequences: distributions of SNPs in genomes populations; position frequency matrices of DNA binding profiles; or even sequencing-related uncertainties. A weighted string may thus represent many different strings, each with probability of occurrence equal to the product of probabilities of its letters at subsequent positions. In this article, we present new average-case results on pattern matching on weighted strings and show how they are applied effectively in several biological contexts. A free open-source implementation of our algorithms is made available.

Linear-time Computation of Minimal Absent Words Using Suffix Array

arXiv (Cornell University), Jun 24, 2014

An absent word of a word y of length n is a word that does not occur in y. It is a minimal absent... more An absent word of a word y of length n is a word that does not occur in y. It is a minimal absent word if all its proper factors occur in y. Minimal absent words have been computed in genomes of organisms from all domains of life; their computation provides a fast alternative for measuring approximation in sequence comparison. There exists an O(n)time and O(n)-space algorithm for computing all minimal absent words on a fixed-sized alphabet based on the construction of suffix automata (Crochemore et al., 1998). No implementation of this algorithm is publicly available. There also exists an O(n 2)-time and O(n)-space algorithm for the same problem based on the construction of suffix arrays (Pinho et al., 2009). An implementation of this algorithm was also provided by the authors and is currently the fastest available. In this article, we bridge this unpleasant gap by presenting an O(n)-time and O(n)-space algorithm for computing all minimal absent words based on the construction of suffix arrays. Experimental results using real and synthetic data show that the respective implementation outperforms the one by Pinho et al. Species Genome size (bp) M11

format_quoteAn upper bound of O(σn) on the number of minimal absent words suggests potential for linear-time comparative sequence analysis.format_quote

Download

Identification of All Exact and Approximate Inverted Repeats in Regular and Weighted Sequences

Springer eBooks, 2013

The detection of various types of repeats is a fundamental and well studied problem in stringolog... more The detection of various types of repeats is a fundamental and well studied problem in stringology. In this paper we present extensions to this problem with applications to bioinformatics. In this paper we consider the detection of all exact and approximate inverted repeats, as well as all exact and approximate weighted inverted repeats and give efficient algorithms for their computation.

On the repetitive collection indexing problem

ABSTRACT In large data sets such as genomes from a single species, large sets of reads, and versi... more ABSTRACT In large data sets such as genomes from a single species, large sets of reads, and version control data it is often noted that each entry only differs from another by a very small number of variations. This leads to a large set of data with a great deal of redundancy and repetitiveness. Rapid development in DNA sequencing technologies has caused a drastic growth in the size of publicly available sequence databases with such data. DNA sequencing has become so fast and cost-effective that sequencing individual genomes will soon become a common task [9] making querying and storing such sets of data an important task. In this paper, we propose an indexing structure for highly repetitive collections of sequence data based on a multilevel g-gram model. In particular, the proposed algorithm accommodates variations that may occur in the target sequence with respect to the reference sequence. The paper is organized as follows. Section [1] and [2] introduce the basic concepts and go through the related literature. In Section [3] we present notions and facts. Details of the proposed data structure/algorithm will be given in Section [5] and [4], Section [6] discusses complexity analysis and Section [7] gives conclusions of future work.

Accurate and Efficient Methods to Improve Multiple Circular Sequence Alignment

Lecture Notes in Computer Science, 2015

Multiple sequence alignment is a core computational task in bioinformatics and has been extensive... more Multiple sequence alignment is a core computational task in bioinformatics and has been extensively studied over the past decades. This computation requires an implicit assumption on the input data: the left- and right-most position for each sequence is relevant. However, this is not the case for circular structures; for instance, MtDNA. Efforts have been made to address this issue but it is far from being solved. We have very recently introduced a fast algorithm for approximate circular string matching Barton et al., Algo Mol Biol, 2014. Here, we first show how to extend this algorithm for approximate circular dictionary matching; and, then, apply this solution with agglomerative hierarchical clustering to find a sufficiently good rotation for each sequence. Furthermore, we propose an alternative method that is suitable for more divergent sequences. We implemented these methods in BEAR, a programme for improving multiple circular sequence alignment. Experimental results, using real and synthetic data, show the high accuracy and efficiency of these new methods in terms of the inferred likelihood-based phylogenies.

On the average-case complexity of pattern matching with wildcards

Theoretical Computer Science

Pattern matching with wildcards is the problem of finding all factors of a text t of length n tha... more Pattern matching with wildcards is the problem of finding all factors of a text t of length n that match a pattern x of length m, where wildcards (characters that match everything) may be present. In this paper we present a number of fast average-case algorithms for pattern matching where wildcards are restricted to either the pattern or the text, however, the results are easily adapted to the case where wildcards are allowed in both. We analyse the average-case complexity of these algorithms and show the first non-trivial time bounds. These are the first results on the average-case complexity of pattern matching with wildcards which, as a by product, provide with first provable separation in time complexity between exact pattern matching and pattern matching with wildcards in the word RAM model. 1 We use the notation of [20] as it is more understandable than [6].

Download

Fast Weighted String Matching

A weighted string over an alphabet of size σ is a string in which a set of letters may occur at e... more A weighted string over an alphabet of size σ is a string in which a set of letters may occur at each position with respective occurrence probabilities. Weighted strings, also known as position weight matrices or uncertain sequences, naturally arise in many contexts. In this article, we study the problem of weighted string matching with a special focus on average-case analysis. Given a weighted pattern string x of length m, a text string y of length n > m, and a cumulative weight threshold 1/z, defined as the minimal probability of occurrence of factors in a weighted string, we present an algorithm requiring average-case search time o(n) for pattern matching for weight ratio z m < min{ 1 log z , log σ log z(log m+log log σ) }. For a pattern string x of length m, a weighted text string y of length n > m, and a cumulative weight threshold 1/z, we present an algorithm requiring average-case search time o(σn) for the same weight ratio. The importance of these results lies on the fact that these algorithms work in average-case sublinear search time in the size of the text, and in linear preprocessing time and space in the size of the pattern, for these ratios.

format_quoteResults indicate algorithms function in average-case sublinear time with linear preprocessing, optimizing resource use effectively across varying text sizes.format_quote

Download

Nanoscale 3D DNA tracing in single human cells visualizes loop extrusion directly in situ

SummaryThe spatial organization of the genome is essential for its functions, including gene expr... more SummaryThe spatial organization of the genome is essential for its functions, including gene expression, DNA replication and repair, as well as chromosome segregation. Biomolecular condensates and loop extrusion have been proposed as the principal driving forces that underlie the formation of chromatin compartments and topologically associating domains, respectively. However, whether the actual 3D-fold of DNA in single cells is consistent with these mechanisms has been difficult to address in situ. Here, we present LoopTrace, a workflow for nanoscale 3D imaging of the genome sequence in structurally well-preserved nuclei in single human cells. Tracing the in situ structure of DNA in thousands of individual cells reveals that genomic DNA folds as a flexible random coil in the absence of loop extruding enzymes such as Cohesin. In the presence of Cohesin and its boundary factor CTCF, reproducibly positioned loop structures dominate the folds, while Cohesin alone leads to randomly posit...

Download

Global Sequence Alignment with a Bounded Number of Gaps

Techniques and Approaches, 2015

Accurate and Efficient Methods to Improve Multiple Circular Sequence Alignment

Lecture Notes in Computer Science, 2015

Multiple sequence alignment is a core computational task in bioinformatics and has been extensive... more Multiple sequence alignment is a core computational task in bioinformatics and has been extensively studied over the past decades. This computation requires an implicit assumption on the input data: the left- and right-most position for each sequence is relevant. However, this is not the case for circular structures; for instance, MtDNA. Efforts have been made to address this issue but it is far from being solved. We have very recently introduced a fast algorithm for approximate circular string matching Barton et al., Algo Mol Biol, 2014. Here, we first show how to extend this algorithm for approximate circular dictionary matching; and, then, apply this solution with agglomerative hierarchical clustering to find a sufficiently good rotation for each sequence. Furthermore, we propose an alternative method that is suitable for more divergent sequences. We implemented these methods in BEAR, a programme for improving multiple circular sequence alignment. Experimental results, using real and synthetic data, show the high accuracy and efficiency of these new methods in terms of the inferred likelihood-based phylogenies.

GapsMis

Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics, 2007

ABSTRACT Motivation: Recent developments in next-generation sequencing technologies have renewed ... more ABSTRACT Motivation: Recent developments in next-generation sequencing technologies have renewed interest in pairwise sequence alignment techniques, particularly so for the application of re-sequencing---the assembly of a genome directed by a reference sequence. After the fast alignment between a factor of the reference sequence and the high-quality fragment of a short read, an important problem is to find the best possible alignment between a succeeding factor of the reference sequence and the remaining low-quality part of the read; allowing a number of mismatches and the insertion of gaps in the alignment. Results: We present GapsMis, a tool for pairwise global and semi-global sequence alignment with a variable, but bounded, number of gaps. It is based on a new algorithm, which computes a different version of the traditional dynamic programming matrix. Millions of pairwise sequence alignments, performed under realistic conditions based on the properties of real full-length genomes, show that GapsMis can increase the accuracy of extending short-read alignments end-to-end compared to more traditional approaches. Availability: http://www.exelixis-lab.org/gapmis

Global and local sequence alignment with a bounded number of gaps

Theoretical Computer Science, 2015

Pairwise sequence alignment techniques have gained renewed interest in recent years, primarily du... more Pairwise sequence alignment techniques have gained renewed interest in recent years, primarily due to their applications in re-sequencing-the assembly of a genome directed by a reference sequence. In this article, we show that adding the flexibility of bounding the number of gaps inserted in an alignment strengthens the classical sequence alignment scheme of scoring matrices and affine gap penalty scores. We present GapsMis, an algorithm for pairwise global sequence alignment with a variable, but bounded, number of gaps. It is based on computing a variant of the traditional dynamic programming matrix for global sequence alignment. We also present GapsMis-L, the analogous algorithm for pairwise local sequence alignment with a variable, but bounded, number of gaps. To test the accuracy of GapsMis and GapsMis-L we performed millions of pairwise sequence alignments under realistic conditions, based on the properties of real full-length genomes. The results show that GapsMis and GapsMis-L can increase the accuracy of extending short-read alignments compared to the traditional approaches. The importance of our contribution is underlined by the fact that the provided algorithms may be seamlessly integrated into any biological pipeline. The open-source code of our implementation is freely available at http :/ /www.inf .kcl .ac .uk /research /projects /gapmis/.

format_quoteGapsMis operates with time and space complexity of O(mk), optimizing alignment with allowed gaps in sequences.format_quote

Download

Spike protein mutations and structural insights of pangolin lineage B.1.1.25 with implications for viral pathogenicity and ACE2 binding affinity

Scientific Reports

Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), the causative agent of COVID -19, i... more Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), the causative agent of COVID -19, is constantly evolving, requiring continuous genomic surveillance. In this study, we used whole-genome sequencing to investigate the genetic epidemiology of SARS-CoV-2 in Bangladesh, with particular emphasis on identifying dominant variants and associated mutations. We used high-throughput next-generation sequencing (NGS) to obtain DNA sequences from COVID-19 patient samples and compared these sequences to the Wuhan SARS-CoV-2 reference genome using the Global Initiative for Sharing All Influenza Data (GISAID). Our phylogenetic and mutational analyzes revealed that the majority (88%) of the samples belonged to the pangolin lineage B.1.1.25, whereas the remaining 11% were assigned to the parental lineage B.1.1. Two main mutations, D614G and P681R, were identified in the spike protein sequences of the samples. The D614G mutation, which is the most common, decreases S1 domain flexibility, wh...

Download

The Medaka Inbred Kiyosu-Karlsruhe (MIKK) panel

Genome Biology

Background Unraveling the relationship between genetic variation and phenotypic traits remains a ... more Background Unraveling the relationship between genetic variation and phenotypic traits remains a fundamental challenge in biology. Mapping variants underlying complex traits while controlling for confounding environmental factors is often problematic. To address this, we establish a vertebrate genetic resource specifically to allow for robust genotype-to-phenotype investigations. The teleost medaka (Oryzias latipes) is an established genetic model system with a long history of genetic research and a high tolerance to inbreeding from the wild. Results Here we present the Medaka Inbred Kiyosu-Karlsruhe (MIKK) panel: the first near-isogenic panel of 80 inbred lines in a vertebrate model derived from a wild founder population. Inbred lines provide fixed genomes that are a prerequisite for the replication of studies, studies which vary both the genetics and environment in a controlled manner, and functional testing. The MIKK panel will therefore enable phenotype-to-genotype association s...

format_quoteEstablished the MIKK panel of 80 near-isogenic lines through 9 generations of inbreeding from a wild medaka population.format_quote

Download

Genomic variations and epigenomic landscape of the Medaka Inbred Kiyosu-Karlsruhe (MIKK) panel

Genome Biology

Background The teleost medaka (Oryzias latipes) is a well-established vertebrate model system, wi... more Background The teleost medaka (Oryzias latipes) is a well-established vertebrate model system, with a long history of genetic research, and multiple high-quality reference genomes available for several inbred strains. Medaka has a high tolerance to inbreeding from the wild, thus allowing one to establish inbred lines from wild founder individuals. Results We exploit this feature to create an inbred panel resource: the Medaka Inbred Kiyosu-Karlsruhe (MIKK) panel. This panel of 80 near-isogenic inbred lines contains a large amount of genetic variation inherited from the original wild population. We use Oxford Nanopore Technologies (ONT) long read data to further investigate the genomic and epigenomic landscapes of a subset of the MIKK panel. Nanopore sequencing allows us to identify a large variety of high-quality structural variants, and we present results and methods using a pan-genome graph representation of 12 individual medaka lines. This graph-based reference MIKK panel genome r...

Download

Longest Common Prefixes with k-Errors and Applications

String Processing and Information Retrieval, 2018

Although real-world text datasets, such as DNA sequences, are far from being uniformly random, av... more Although real-world text datasets, such as DNA sequences, are far from being uniformly random, average-case string searching algorithms perform significantly better than worst-case ones in most applications of interest. In this paper, we study the problem of computing the longest prefix of each suffix of a given string of length n over a constantsized alphabet that occurs elsewhere in the string with k-errors. This problem has already been studied under the Hamming distance model. Our first result is an improvement upon the state-of-the-art average-case time complexity for non-constant k and using only linear space under the Hamming distance model. Notably, we show that our technique can be extended to the edit distance model with the same time and space complexities. Specifically, our algorithms run in O(n log k n log log n) time on average using O(n) space. We show that our technique is applicable to several algorithmic problems in computational biology and elsewhere.

Download

On the Average-case Complexity of Pattern

Pattern matching with wildcards is the problem of finding all factors of a text t of length n tha... more Pattern matching with wildcards is the problem of finding all factors of a text t of length n that match a pattern x of length m, where wildcards (characters that match everything) may be present. In this paper we present a number of fast average-case algorithms for pattern matching where wildcards are restricted to either the pattern or the text, however, the results are easily adapted to the case where wildcards are allowed in both. We analyse the average-case complexity of these algorithms and show the first non-trivial time bounds. These are the first results on the average-case complexity of pattern matching with wildcards which, as a by product, provide with first provable separation in time complexity between exact pattern matching and pattern matching with wildcards in the word RAM model. 1 We use the notation of [20] as it is more understandable than [6].

Download

Efficient Index for Weighted Sequences

ArXiv, 2016

The problem of finding factors of a text string which are identical or similar to a given pattern... more The problem of finding factors of a text string which are identical or similar to a given pattern string is a central problem in computer science. A generalised version of this problem consists in implementing an index over the text to support efficient on-line pattern queries. We study this problem in the case where the text is weighted: for every position of the text and every letter of the alphabet a probability of occurrence of this letter at this position is given. Sequences of this type, also called position weight matrices, are commonly used to represent imprecise or uncertain data. A weighted sequence may represent many different strings, each with probability of occurrence equal to the product of probabilities of its letters at subsequent positions. Given a probability threshold $1/z$, we say that a pattern string $P$ matches a weighted text at position $i$ if the product of probabilities of the letters of $P$ at positions $i,\ldots,i+|P|-1$ in the text is at least $1/z$. I...

format_quoteIntroduces an O(nz) index with improved pattern matching query times by a factor of z log z over existing methods.format_quote

Download

Whole genome mapping and identification of single nucleotide polymorphisms of four Bangladeshi individuals and their functional significance

BMC Research Notes, 2021

Objective The major objective of the study was to sequence the whole genome of four Bangladeshi i... more Objective The major objective of the study was to sequence the whole genome of four Bangladeshi individuals and identify variants that are known to be associated with functional changes or disease states. We also carried out an ontology analysis to identify the functions and pathways most likely to be affected by these variants. Results We identified around 900,000 common variants and close to 5 million unique ones in all four of the individuals. This included over 11,500 variants that caused nonsynonymous changes in proteins. Heart function associated pathways were heavily implicated by the ontology analysis; corroborating previous studies that claimed the Bangladeshi population as highly susceptible to heart disorders. Two variants were found that have been previously identified as pathogenic factors in familial hypercholesteremia and structural disorders of the heart. Other pathogenic variants we found were associated with pseudoxanthoma elasticum, cancer progression, polyaggluti...

Download

Uploads

Papers by Carl Barton

Log In