btllib: A C++ library with Python interface for efficient genomic sequence processing
Journal of Open Source Software
https://doi.org/10.21105/JOSS.04720…
4 pages
Sign up for access to the world's latest research
Abstract
Bioinformaticians often do not have software engineering training or background, and software quality is not the top priority of research groups due to limited time and funding (Georgeson et al., 2019). Additionally, one-off scripts or code is frequently written to perform a specific task instead of reusing existing code. This could be because the pre-existing computer programming code is either not well written, not widely available, insufficiently documented, inefficient, or not general enough. This practice leads to lower quality and non-reusable code. As bioinformatics analyses are increasingly complex and deal with ever more data, high quality code is needed to handle the complexities of the analyses reliably and productively. The solution to this is well designed and documented libraries. For example, SeqAn (Reinert et al., 2017) is a C++ library that implements algorithms and data structures commonly used in bioinformatics. Not all programmers are well versed in C++, so for users of widely used and accessible higher level programming languages such as Python, Biopython (Cock et al., 2009) is available as a set of Python modules with implementations of commonly needed algorithms. Here, we present the btllib library as an addition to this ecosystem with the goal of providing highly efficient, scalable, and ergonomic implementations of bioinformatics algorithms and data structures.
Related papers
arXiv (Cornell University), 2022
Non-sharable sensitive data collection and analysis in large-scale consortia for genomic research is complicated. Time-consuming issues in installing software arise due to different operating systems, software dependencies and running the software. Therefore, easier, more standardized, automated protocols and platforms can be a solution to overcome these issues. We have developed one such solution for genomic data analysis using software container technologies. The platform, "COGEDAP", consists of different software tools placed into Singularity containers with corresponding pipelines and instructions on how to perform genome-wide association studies (GWAS) and other genomic data analysis via corresponding tools. Using a provided helper script written in Python, users can obtain auto-generated scripts to conduct the desired analysis both on high-performance computing (HPC) systems and on personal computers. The analyses can be done by running these auto-generated scripts with the software containers. The helper script also performs minor re-formatting of the input/output data, so that the end user can work with a unified file format regardless of which genetic software is used for the analysis. COGEDAP is actively being used by users from different countries/projects to conduct their genomic data analyses. Thanks to this platform, users can easily run GWAS and other genomic analyses without spending much effort on software installation, data formats, and other technical requirements.
BMC bioinformatics, 2018
Automated bioinformatics workflows are more robust, easier to maintain, and results more reproducible when built with command-line utilities than with custom-coded scripts. Command-line utilities further benefit by relieving bioinformatics developers to learn the use of, or to interact directly with, biological software libraries. There is however a lack of command-line utilities that leverage popular Open Source biological software toolkits such as BioPerl ( http://bioperl.org ) to make many of the well-designed, robust, and routinely used biological classes available for a wider base of end users. Designed as standard utilities for UNIX-family operating systems, BpWrapper makes functionality of some of the most popular BioPerl modules readily accessible on the command line to novice as well as to experienced bioinformatics practitioners. The initial release of BpWrapper includes four utilities with concise command-line user interfaces, bioseq, bioaln, biotree, and biopop, speciali...
GigaScience
With the decreasing cost of sequencing and the rapid developments in genomics technologies and protocols, the need for validated bioinformatics software that enables efficient large-scale data processing is growing. Here we present GenPipes, a flexible Python-based framework that facilitates the development and deployment of multi-step workflows optimized for High Performance Computing clusters and the cloud. GenPipes already implements 12 benchmarked and scalable pipelines for various genomics applications, including RNA-Seq, ChIP-Seq, DNA-Seq, Methyl-Seq, Hi-C, capture Hi-C, metagenomics and PacBio long read assembly. The software is available under a GPLv3 open source license and is continuously updated to follow recent advances in genomics and bioinformatics. The framework has been already configured on several servers and a docker image is also available to facilitate additional installations. In summary, GenPipes offers genomic researchers a simple method to analyze different types of data, customizable to their needs and resources, as well as the flexibility to create their own workflows.
Algorithms for Molecular Biology, 2007
This paper presents a software library, nicknamed BATS, for some basic sequence analysis tasks. Namely, local alignments, via approximate string matching, and global alignments, via longest common subsequence and alignments with affine and concave gap cost functions. Moreover, it also supports filtering operations to select strings from a set and establish their statistical significance, via z-score computation. None of the algorithms is new, but although they are generally regarded as fundamental for sequence analysis, they have not been implemented in a single and consistent software package, as we do here. Therefore, our main contribution is to fill this gap between algorithmic theory and practice by providing an extensible and easy to use software library that includes algorithms for the mentioned string matching and alignment problems. The library consists of C/C++ library functions as well as Perl library functions. It can be interfaced with Bioperl and can also be used as a stand-alone system with a GUI. The software is available at
Bioinformation, 2010
Rapid genome sequencing enriched biological databases with enormous sequence data. Yet it remains a daunting task to unravel this information. However experimental and computational researchers lead their own way in analyzing sequence information. Here we introduce a standalone portable tool named "SeqCalc" that would assist the research personnel in computational sequence analysis and automated experimental calculations. Although several tools are available online for sequence analysis they serve only for one or two purposes. SeqCalc is a package of offline program, developed using Perl and TCL/Tk scripts that serve ten different applications. This tool would be an initiative to both experimental and computational researchers in their routine research. SeqCalc is executable in all windows operating systems.
BMC bioinformatics, 2006
A large number of bioinformatics applications in the fields of bio-sequence analysis, molecular evolution and population genetics typically share input/output methods, data storage requirements and data analysis algorithms. Such common features may be conveniently bundled into re-usable libraries, which enable the rapid development of new methods and robust applications. We present Bio++, a set of Object Oriented libraries written in C++. Available components include classes for data storage and handling (nucleotide/amino-acid/codon sequences, trees, distance matrices, population genetics datasets), various input/output formats, basic sequence manipulation (concatenation, transcription, translation, etc.), phylogenetic analysis (maximum parsimony, markov models, distance methods, likelihood computation and maximization), population genetics/genomics (diversity statistics, neutrality tests, various multi-locus analyses) and various algorithms for numerical calculus. Implementation of...
2012
We present a new Python/C++ framework, BioLite, for implementing bioinformatics pipelines for Next-Generation Sequencing (NGS) data. BioLite tracks provenance of analyses, automates the collection and reporting of diagnostics (such as summary statistics and plots at intermediate stages), and profiles computational requirements. These diagnostics can be accessed across multiple stages of a pipeline, from other pipelines, and in HTML reports. Finally, we describe several use cases for diagnostics in our own analyses.
References (10)
- Afshinfard, A., Jackman, S. D., Wong, J., Coombe, L., Chu, J., Nikolic, V., Dilek, G., Malkoç, Y., Warren, R. L., & Birol, I. (2022). Physlr: Next-generation physical maps. DNA, 2(2), 116-130. https://doi.org/10.3390/dna2020009
- Chu, J., Mohamadi, H., Erhan, E., Tse, J., Chiu, R., Yeo, S., & Birol, I. (2020). Mismatch- tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex bloom filters. Proceedings of the National Academy of Sciences, 117 (29), 16961-16968. https://doi.org/10.1073/pnas.1903436117
- Cock, P. J. A., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., & Hoon, M. J. L. de. (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics.
- Bioinformatics, 25(11), 1422-1423. https://doi.org/10.1093/bioinformatics/btp163
- Coombe, L., Li, J. X., Lo, T., Wong, J., Nikolic, V., Warren, R. L., & Birol, I. (2021). LongStitch: High-quality genome assembly correction and scaffolding using long reads. BMC Bioinformatics, 22(1). https://doi.org/10.1186/s12859-021-04451-7
- Coombe, L., Nikolić, V., Chu, J., Birol, I., & Warren, R. L. (2020). ntJoin: Fast and lightweight assembly-guided scaffolding using minimizer graphs. Bioinformatics, 36(12), 3885-3887. https://doi.org/10.1093/bioinformatics/btaa253
- Georgeson, P., Syme, A., Sloggett, C., Chung, J., Dashnow, H., Milton, M., Lonsdale, A., Powell, D., Seemann, T., & Pope, B. (2019). Bionitio: demonstrating and facilitating best practices for bioinformatics command-line software. GigaScience, 8(9). https://doi.org/10. 1093/gigascience/giz109
- Li, H. (2016). Seqtk. https://github.com/lh3/seqtk.
- Mohamadi, H., Chu, J., Vandervalk, B. P., & Birol, I. (2016). ntHash: recursive nucleotide hashing. Bioinformatics, 32(22), 3492-3494. https://doi.org/10.1093/bioinformatics/ btw397
- Reinert, K., Dadi, T. H., Ehrhardt, M., Hauswedell, H., Mehringer, S., Rahn, R., Kim, J., Pockrandt, C., Winkler, J., Siragusa, E., Urgese, G., & Weese, D. (2017). The SeqAn c++ template library for efficient sequence analysis: A resource for programmers. Journal of Biotechnology, 261, 157-168. https://doi.org/10.1016/j.jbiotec.2017.07.017
Vladimir Nikolić