Proteins are the workhorses in a living cell. They can perform a variety of tasks because of thei... more Proteins are the workhorses in a living cell. They can perform a variety of tasks because of their diverse properties. Isoelectric point is just one of the many properties of proteins. Although we focus on isoelectric point (pI) in this chapter, the same large-scale comparative studies illustrated here can be performed on other properties as well. DAMBE (Xia 2013, 2017d) can take a set of protein sequences and compute pI for each sequence. It also implements many other functions for describing and characterizing protein sequences.
We used 94 RAPD primers of different nucleotide composition to probe the genomic differences betw... more We used 94 RAPD primers of different nucleotide composition to probe the genomic differences between a highly virulent P. multocida strain and an attenuated vaccine strain derived from the virulent strain after culturing the latter under increasing temperature for approximately 14,400 generations. The GC content of the vaccine strain is significantly (P < 0.05) lower than that of the virulent strain, contrary to the popular hypothesis of covariation between the GC content and temperature. The frequencies of AA, TA, and TT dinucleotides were higher, and those of AT, GC, and CG dinucleotides were lower, in the vaccine strain than in the virulent strain. A statistic called genomic RAPD entropy is formulated to measure the randomness of the genome, and the genome of the vaccine strain is more random than that of the virulent strain. These differences between the virulent and vaccine strains are interpreted in terms of mutation and selection under increased culturing temperature. A me...
All vertebrate genomes are heavily methylated at CpG dinucleotide sites, and methylated CpG dinuc... more All vertebrate genomes are heavily methylated at CpG dinucleotide sites, and methylated CpG dinucleotides are prone to CpG->TpG mutations through spontaneous deamination. This leaves different footprints on coding and noncoding sequences. We capture these different fingerprints by five indices that can be used to discriminate between coding and non-coding (intron) sequences. We also show that a linear discriminant function derived from a training set of coding and intron sequences from human chromosome 22 can be successfully used in gene-finding of the zebrafish genome.
Substitution rate matrices are used to correct multiple hits at the same sites, which requires th... more Substitution rate matrices are used to correct multiple hits at the same sites, which requires the derivation of transition probabilities and evolutionary distances from substitution rate matrices. The derivation is essential in molecular phylogenetics and phylogenomics, and represents the only statistically sound way for developing scoring matrices used in sequence alignment and local string matching (e.g., BLAST and FASTA). Three different approaches are frequently used for deriving transition probabilities and evolutionary distances: 1) The probability reasoning, 2) Solving partial differential equations, and 3) Matrix exponential and logarithm. The first approach demands the least amount of mathematical skills but offers the best way for conceptual understanding, and can often generate nice mathematical expressions of transition probabilities and evolutionary distances. This review represents the most systematic and comprehensive numerical illustration of the first approach.
Previous phylogenetic analyses of tetrapod 18S ribosomal RNA (rRNA) sequences support the groupin... more Previous phylogenetic analyses of tetrapod 18S ribosomal RNA (rRNA) sequences support the grouping of birds with mammals, whereas other molecular data, and morphological and paleontological data favor the grouping of birds with crocodiles. The 18S rRNA gene has consequently been considered odd, serving as "definitive evidence of different genes providing significantly different estimates of phylogeny in higher organisms" (p. 156; Huelsenbeck et al., 1996, Trends Ecol. Evol. 11:152-158). Our research indicates that the previous discrepancy of phylogenetic results between the 18S rRNA gene and other genes is caused mainly by (1) the misalignment of the sequences, (2) the inappropriate use of the frequency parameters, and (3) poor sequence quality. When the sequences are aligned with the aide of the secondary structure of the 18S rRNA molecule and when the frequency parameters are estimated either from all sites or from the variable domains where substitutions have occurred, the 18S rRNA sequences no longer support the grouping of the avian species with the mammalian species. [alignment; 18S rRNA; RNA secondary structure; Indel; molecular phylogenetics; tetrapod phylogeny.]
Hidden Markov Models and Protein Secondary Structure Prediction
Hidden Markov model (HMM) is for inferring hidden states of a Markov model based on observed data... more Hidden Markov model (HMM) is for inferring hidden states of a Markov model based on observed data. For example, intron and exon are hidden states and need to be inferred from the observed nucleotide sequences. Similarly, secondary structural elements such as alpha helices and beta sheets are hidden states and need to be inferred from observed amino acid sequences. The accuracy of HMM in inferring hidden states depends on the transition probability matrix and emission probability matrix derived from training HMM with representative observations. If different states have very different probability to transit into each other, and if the emission probability matrix of the hidden states are highly different from each other, then HMM can be quite accurate. This chapter details the key algorithms used in HMM, such as Viterbi algorithm for reconstructing the hidden states and the forward algorithm to compute the probability of the observed sequence of events. Both Viterbi and forward algori...
Comparative genomics was previously misguided by the naı̈ve dogma that what is true in E. coli is... more Comparative genomics was previously misguided by the naı̈ve dogma that what is true in E. coli is also true in the elephant. With the rejection of such a dogma, comparative genomics has been positioned in proper evolutionary context. Here I numerically illustrate the application of phylogeny-based comparative methods in comparative genomics involving both continuous and discrete characters to solve problems from characterizing functional association of genes to detection of horizontal gene transfer and viral genome recombination, together with a detailed explanation and numerical illustration of statistical significance tests based on the false discovery rate (FDR). FDR methods are essential for multiple comparisons associated with almost any large-scale comparative genomic studies. I discuss the strength and weakness of the methods and provide some guidelines on their proper applications.
Unique Shine–Dalgarno Sequences in Cyanobacteria and Chloroplasts Reveal Evolutionary Differences in Their Translation Initiation
Microorganisms require efficient translation to grow and replicate rapidly, and translation is of... more Microorganisms require efficient translation to grow and replicate rapidly, and translation is often rate-limited by initiation. A prominent feature that facilitates translation initiation in bacteria is the Shine–Dalgarno (SD) sequence. However, there is much debate over its conservation in Cyanobacteria and in chloroplasts which presumably originated from endosymbiosis of ancient Cyanobacteria. Elucidating the utilization of SD sequences in Cyanobacteria and in chloroplasts is therefore important to understand whether 1) SD role in Cyanobacterial translation has been reduced prior to chloroplast endosymbiosis or 2) translation in Cyanobacteria and in plastid has been subjected to different evolutionary pressures. To test these alternatives, we employed genomic, proteomic, and transcriptomic data to trace differences in SD usage among Synechocystis species, Microcystis aeruginosa, cyanophages, Nicotiana tabacum chloroplast, and Arabidopsis thaliana chloroplast. We corrected their m...
The rate of protein synthesis depends on both the rate of initiation of translation and the rate ... more The rate of protein synthesis depends on both the rate of initiation of translation and the rate of elongation of the peptide chain. The rate of initiation depends on the encountering rate between ribosomes and mRNA; this rate in turn depends on the concentration of ribosomes and mRNA. Thus, patterns of codon usage that increase transcriptional efficiency should increase mRNA concentration, which in turn would increase the initiation rate and the rate of protein synthesis. An optimality model of the transcriptional process is presented with the prediction that the most frequently used ribonucleotide at the third codon sites in mRNA molecules should be the same as the most abundant ribonucleotide at the third codon sites in mRNA molecules should be the same as the most abundant ribonucleotide in the cellular matrix where mRNA is transcribed. This prediction is supported by four kinds of evidence. First, A-ending codons are the most frequently used synonymous codons in mitochondria, w...
The quality of a microarray experiment is measured by sensitivity and specificity which depend on... more The quality of a microarray experiment is measured by sensitivity and specificity which depend on hybridization efficiency and non-specific cross-hybridization. The length and GC% of probe sequences are known to strongly affect hybridization and cross-hybridization. However, the joint effect of both the length and GC% of the probe sequences on microarray signal intensity has not been systematically assessed. Here I use a set of yeast microarray data with the GC% of probe sequences varying from 12.5% to 68.75% and with the probe length varying from 27 to 40nt to simultaneously assess both the effect of probe length and GC% on DNA hybridization. Both probe length and GC% have significant impact on signal intensity (SI) and a model derived from the data shows how changes in probe GC% can be compensated by the probe length and why such compensation did not work in some previous studies. SI increases sigmoidally with the probe GC% based on a data set where the probe length is constant. O...
The polyketide griseofulvin is a natural antifungal compound and research in griseofulvin has bee... more The polyketide griseofulvin is a natural antifungal compound and research in griseofulvin has been key in establishing our current understanding of polyketide biosynthesis. Nevertheless, the griseofulvin gsf biosynthetic gene cluster (BGC) remains poorly understood in most fungal species, including Penicillium griseofulvum where griseofulvin was first isolated. To elucidate essential genes involved in griseofulvin biosynthesis, we performed third-generation sequencing to obtain the genome of Penicillium griseofulvum strain D-756. Furthermore, we gathered publicly available genome of 11 other fungal species in which gsf gene cluster was identified. In a comparative genome analysis, we annotated and compared the gsf BGC of all 12 fungal genomes. Our findings show no gene rearrangements at the gsf BGC. Furthermore, seven gsf genes are conserved by most genomes surveyed whereas the remaining six were poorly conserved. This study provides new insights into differences between gsf BGC and...
Multiple sequence alignment (MSA) is the basis for almost all sequence comparison and molecular p... more Multiple sequence alignment (MSA) is the basis for almost all sequence comparison and molecular phylogenetic inferences. Large-scale genomic analyses are typically associated with automated progressive MSA without subsequent manual adjustment, which itself is often error-prone because of the lack of a consistent and explicit criterion. Here, I outlined several commonly encountered alignment errors that cannot be avoided by progressive MSA for nucleotide, amino acid, and codon sequences. Methods that could be automated to fix such alignment errors were then presented. I emphasized the utility of position weight matrix as a new tool for MSA refinement and illustrated its usage by refining the MSA of nucleotide and amino acid sequences. The main advantages of the position weight matrix approach include (1) its use of information from all sequences, in contrast to other commonly used methods based on pairwise alignment scores and inconsistency measures, and (2) its speedy computation, m...
The design of Pfizer/BioNTech and Moderna mRNA vaccines involves many different types of optimiza... more The design of Pfizer/BioNTech and Moderna mRNA vaccines involves many different types of optimizations. Proper optimization of vaccine mRNA can reduce dosage required for each injection leading to more efficient immunization programs. The mRNA components of the vaccine need to have a 5′-UTR to load ribosomes efficiently onto the mRNA for translation initiation, optimized codon usage for efficient translation elongation, and optimal stop codon for efficient translation termination. Both 5′-UTR and the downstream 3′-UTR should be optimized for mRNA stability. The replacement of uridine by N1-methylpseudourinine (Ψ) complicates some of these optimization processes because Ψ is more versatile in wobbling than U. Different optimizations can conflict with each other, and compromises would need to be made. I highlight the similarities and differences between Pfizer/BioNTech and Moderna mRNA vaccines and discuss the advantage and disadvantage of each to facilitate future vaccine improvement...
SARS-CoV-2 can transmit efficiently in humans, but it is less clear which other mammals are at ri... more SARS-CoV-2 can transmit efficiently in humans, but it is less clear which other mammals are at risk of being infected. SARS-CoV-2 encodes a Spike (S) protein that binds to human ACE2 receptor to mediate cell entry. A species with a human-like ACE2 receptor could therefore be at risk of being infected by SARS-CoV-2. We compared between 132 mammalian ACE2 genes and between 17 coronavirus S proteins. We showed that while global similarities reflected by whole ACE2 gene alignments are poor predictors of high-risk mammals, local similarities at key S protein-binding sites highlight several high-risk mammals that share good ACE2 homology with human. Bats are likely reservoirs of SARS-CoV-2, but there are other high-risk mammals that share better ACE2 homologies with human. Both SARS-CoV-2 and SARS-CoV are closely related to bat coronavirus. Yet, among host-specific coronaviruses infecting high-risk mammals, key ACE2-binding sites on S proteins share highest similarities between SARS-CoV-2...
Bioinformatics of Genome Regulation and Structure II
All vertebrate genomes are heavily methylated at CpG dinucleotide sites, and methylated CpG dinuc... more All vertebrate genomes are heavily methylated at CpG dinucleotide sites, and methylated CpG dinucleotides are prone to CpG→TpG mutations through spontaneous deamination. This leaves different footprints on coding and noncoding sequences. We capture these different fingerprints by five indices that can be used to discriminate between coding and non-coding (intron) sequences. We also show that a linear discriminant function derived from a training set of coding and intron sequences from human chromosome 22 can be successfully used in gene-finding of the zebrafish genome.
All dating studies involving SARS-CoV-2 are problematic. Previous studies have dated the most rec... more All dating studies involving SARS-CoV-2 are problematic. Previous studies have dated the most recent common ancestor (MRCA) between SARS-CoV-2 and its close relatives from bats and pangolins. However, the evolutionary rate thus derived is expected to differ from the rate estimated from sequence divergence of SARS-CoV-2 lineages. Here, I present dating results for the first time from a large phylogenetic tree with 86,582 high-quality full-length SARS-CoV-2 genomes. The tree contains 83,688 genomes with full specification of collection time. Such a large tree spanning a period of about 1.5 years offers an excellent opportunity for dating the MRCA of the sampled SARS-CoV-2 genomes. The MRCA is dated 16 August 2019, with the evolutionary rate estimated to be 0.05526 mutations/genome/day. The Pearson correlation coefficient (r) between the root-to-tip distance (D) and the collection time (T) is 0.86295. The NCBI tree also includes 10 SARS-CoV-2 genomes isolated from cats, collected over ...
Marsh Spot Disease and Its Causal Factor, Manganese Deficiency in Plants: A Historical and Prospective Review
Agricultural Sciences
This review provides an examination of the marsh spot disease in beans and the roles played by it... more This review provides an examination of the marsh spot disease in beans and the roles played by its causal factor, manganese (Mn) deficiency. The discovery of the marsh spot disease, its relation with Mn deficiency, and how it can be treated are discussed. Mn serves as a cofactor and a catalyst in various metabolic processes in different cell compartments, such as the oxygen-evolving complex of photosystem II (PSII) or reactive oxygen species scavenging. Some major quantitative trait loci (QTL) and putative candidate genes associated with Mn content in plants, especially in plant seeds, have been identified. Marsh spot disease in cranberry common bean is controlled by several major genes with significant additive and epistatic effects. They provide valuable clues for QTL candidate gene prediction and an improved understanding of the genetic mechanisms responsible for marsh spot resistance in plants.
Uploads
Papers by Xuhua Xia