From Anderson Lab
Jump to: navigation, search


Next Generation Sequencing Software

The development of Next-Generation Sequencing, or NGS, has heightened the throughput of traditional sequencing by using RNA reads to work backward, assembling a transcriptome from expressed fragments. These methods are designed to lower the cost of DNA sequencing by analyzing vast quantities of RNA fragments but have led to computational and storage obstacles. RNA sequencing produces large files that can be in excess of 20GB, making them unwieldy to the general researcher. Vast data resources have been created to extract the information contained in these massively parallel data. These resources and their associated tools are heterogeneous and highly distributed. This requires the scientist to create and execute highly complex customized analyses that involve gathering and organizing data from heterogeneous sources while interfacing with a variety of software tools. This is often infeasible without the support of computer specialists and significant hardware upgrades.

The objective of this work is to combine the strengths of the scientific workflow project, Galaxy, with traditional high performance computing resources and affordable cloud-based data storage that encourages collaboration. The specific goals are to create a Galaxy tool that executes Trinity on a remote cluster and to develop software to download, upload, and manage NGS results to and from the cloud. Trinity is a computationally expensive de novo assembler that processes large RNA files to produce a FASTA transcriptome. Our system facilitates broad-based collaboration and distribution by building around the Google Drive cloud storage solution, where the processing of RNA sequences can be both shared and analyzed with a single upload and be reused for multiple purposes. The system is also able to remotely execute Trinity and other NGS tools on traditional high performance computing infrastructure that is often available as a shared resource at universities. Specifically, we develop a Trinity-based workflow executed in a heterogeneous environment with batch HPC resources that produces the abundances’ of RNA fragments to discover patterns of expression.

Click here for more information on this project

Metabolomics Software


The metabolomics analysis toolbox (Metabolink) is a collection of software developed by the Bioinformatics Research Groups at the College of Charleston and Wright State University to help in metabolomics research. It consists of GUI software to help in various phases of the analysis of the complex NMR spectra produced by biofluids (blood, urine, cerebrospinal fluid, tissue homogenates, etc).


Step 1

You must install the MATLAB runtime environment before you are able to run this program. This only needs to be done once. The download depends on the operating system that you are running:

Step 2

There are two ways to download Metabolink:

  1. If you are interested in automatically syncing the programs included in Metabolink, then you can request automatic distribution using Google Drive by e-mailing Dr. Paul Anderson.
  2. Or you can download a current snapshot of the software:
    1. Windows (32 bit)
    2. Windows (64 bit)
    3. Mac - Mac users must also download colors.mat and copy the file to Macintosh HD.

Generalized Model For Metabolomic Analysis


As metabolomic technology expands, validated techniques for analyzing highly dimensional categorical data are becoming increasingly important. This manuscript presents a novel latent vector-based methodology for analyzing complex data sets with multiple groups that include both high and low doses using orthogonal projections to latent structures (OPLS) coupled with hierarchical clustering. This general methodology allows complex experimental designs (e.g., multiple dose and time combinations) to be encoded and directly compared. Further, it allows for the inclusion of low dose samples that do not exhibit a strong enough individual response to be modeled independently. A dose- and time-responsive metabolomic study was completed to evaluate and demonstrate this methodology. Single doses (0.1 to 100 mg/kg body weight) of α-naphthylisothiocyanate (ANIT), a common model of hepatic cholestasis, were administered orally in corn oil to male Fischer 344 rats. Urine samples were collected predose and daily through day-4 postdose. Blood samples were collected predose and on days 1 and 4 postdose. Urine samples were analyzed by 1H-NMR spectroscopy, and the spectra were adaptively binned to reduce dimensionality. The proposed methodology for NMR-based urinary metabolomics was sensitive enough to detect ANIT-induced effects with respect to both dose and time at doses below the threshold of clinical toxicity. A pattern of ANIT-dependent effects established at the highest dose was seen in the 50 mg/kg and 20 mg/kg dose groups, an effect not directly identifiable with individual PCA. Coupling the pattern found by the OPLS algorithm and hierarchical clustering revealed a relationship between the 100, 50 and 20 mg/kg dose groups, suggesting a characteristic effect of exposure. These studies demonstrate the use of a metabolomics approach with flexible binning of 1H spectra and appropriate application of multivariate analyses can reveal biologically relevant information about the temporal metabolic perturbations caused by exposure and toxicity.

For more information click here.


More information on PCA coming soon.

Mahle, D. A., Anderson, P. E., DelRaso, N. J., Raymer, M. L., Neuforth, A. E., and Reo, N. V. (2010). A Generalized Model for Metabolomic Analyses: Application to Dose and Time Dependent Toxicity. Metabolomics, 7(2), 206-216.

Localized Deconvolution

The interpretation of nuclear magnetic resonance (NMR) experimental results for metabolomics studies requires intensive signal processing and multivariate data analysis techniques. Standard quantification techniques attempt to minimize effects from variations in peak positions caused by sample pH, ionic strength, and composition. These techniques fail to account for adjacent signals which can lead to drastic quantification errors. Attempts at full spectrum deconvolution have been limited in adoption and development due to the computational resources required. Herein, we develop a novel localized deconvolution algorithm for general purpose quantification of NMR-based metabolomics studies. Localized deconvolution decreases average absolute quantification error by 97% and average relative quantification error by 88%. When applied to a 1H metabolomics study, the cross-validation metric, Q^2, improved 16% by reducing within group variability. This increase in accuracy leads to additional computing costs that are overcome by translating the algorithm to the map-reduce design paradigm.

Click here for more information.

Anderson, P. E., Ranabahu, A. H., Mahle, D.A., Sheth, A. P., and DelRaso, N. J. (2012). Cloud-based Map-Reduce NMR Spectral Deconvolution: Adjacent Deconvolution. In press BIOCOMP 2012.

Dynamic Adaptive Binning

The interpretation of nuclear magnetic resonance (NMR) experimental results for metabolomics studies requires intensive signal processing and multivariate data analysis techniques. A key step in this process is the quantification of spectral features, which is commonly accomplished by dividing an NMR spectrum into several hundred integral regions or bins. Binning attempts to minimize effects from variations in peak positions caused by sample pH, ionic strength, and composition, while reducing the dimensionality for multivariate statistical analyses. Herein we develop an improved novel spectral quantification technique, dynamic adaptive binning. With this technique, bin boundaries are determined by optimizing an objective function using a dynamic programming strategy. The objective function measures the quality of a bin configuration based on the number of peaks per bin. This technique shows a significant improvement over both traditional uniform binning and other adaptive binning techniques. This improvement is quantified via synthetic validation sets by analyzing an algorithm’s ability to create bins that do not contain more than a single peak and that maximize the distance from peak to bin boundary. The validation sets are developed by characterizing the salient distributions in experimental NMR spectroscopic data. Further, dynamic adaptive binning is applied to a 1H NMR-based experiment to monitor rat urinary metabolites to empirically demonstrate improved spectral quantification.

Anderson, P. E., Raymer, M. L., Reo, N. V., DelRaso, N. J., and Doom, T. E. (2010). Dynamic Adaptive Binning: Dynamic quantification of nuclear magnetic resonance spectroscopic data. In press in the journal Metabolomics.Available online at

For more information click here.


Click here for documentation.

Contribute to the project

The source code for the project is hosted at GitHub. You can contribute at


The recent flood of Web-enabled and Web-based tools for eScience has provided scientists with a wide array of new methods to collect, report, analyze and share their data. These Web-based technologies have demonstrated a great potential to enable broader collaboration and to facilitate the sharing and re-use of experimental data. The wide availability of quality data sets and tested process flows is indeed a welcome addition for an eScientist, yet there are a number of challenges to overcome. In scientific domains, such as bioinformatics, lack of standardization for scientific methods, algorithms, and data sharing can be seen as the greatest challenge for broader adoption.

Many eScience applications, including workflows, have high performance computation requirements. Cloud computing is increasingly seen as a natural choice for such high demand computing tasks because minimal upfront investments in computational resources are required. Using computing clouds also avoids the complications of catering for periodic peak usage, frequent software and hardware updates and the need of trained IT staff. However, these advantages come with its own set of challenges. The need to program for a specific cloud platform, also called vendor lock in, is one of the major hindrances to the wide adoption of clouds, especially in the scientific domains.

We present SCALE, an eScience experimental data analysis platform to support the following features to meet above challenges:

  • enable easy collaboration across a community of scientists with minimal operational support from IT and computer scientists,
  • use cloud computing resources (private or public) in a platform agnostic manner to provide high performance computing when applicable.

For more information see the SCALE page at METABOLINK.ORG.

A limited online tool is available here.

Ajith Ranabahu, Paul Anderson, and Amit Sheth. (2011). The Cloud Agnostic e-Science Analysis Platform, IEEE Internet Computing, vol. 15, no. 6, pp. 85-89, Nov./Dec. 2011, doi:10.1109/MIC.2011.159


GenoStat is a forensic DNA analysis java-based application that generates DNA profile match statistics (RMP, CPI and things like sibling match probabilities) as well performing mixture resolution (separating mixtures into their contributor components).

For more information

Cloud-based Laboratory Informatics Management System

The emergence of informatics as a field of study has brought about tremendous advances in the life sciences - the sequencing of the human genome; inferring the structure of proteins; identifying the activity of microRNA; and the list goes on. And yet, a remarkable characteristic of some of this pioneering work is that it is not “collaboration” at all in the traditional sense of the term. In the old model, scientists with a common interest would team up and address a problem together. But this sort of one-to-one collaboration is rapidly being replaced by something far more exciting. In the age of informatics experimental results from tens of thousands of similar experiments conducted by scientists worldwide are available in online libraries. These repositories serve as much more than simple archives. They are the digital playgrounds for computational scientists with innovative ideas for data visualization, analysis, and management.

The problem of data collection, exchange, analysis, and collaboration is not new. Current practices and available software lead individual labs to develop, test, and distribute their own architectures and analysis pipelines, which are often incompatible, domain specific, and inflexible. We propose to develop community-enabled cyberinfrastructure that will include four significant contributions: (1) a platform as a service based laboratory information management system built upon Google Apps; (2) a community driven annotation system that facilitates peer-review of online repositories; (3) a framework that supports the correctness guarantees of relational databases while leveraging the high availability of a NoSQL datastore; and (4) a semantic web interface with faceted search that allows scientists to leverage inference without data science expertise. The cumulative result will be a fast, dynamic, and cost-effective model for data collection, exchange, analysis, and collaboration that can be used for a variety of applications, such as machine learning, data mining, pattern recognition, and inference.

While the target and impact of this work is broad, we will begin by focusing on the field of high-throughput nuclear magnetic resonance (NMR) metabolomics - a relative newcomer to the informatics realm - that has yet to benefit from the development of a standard repository and query system for experimental data. This system will provide the first community driven, unifying framework for NMR-based spectroscopic identification, annotation, and quantification. The utility of this software will be demonstrated through ongoing collaborations with the Air Force Research Laboratory, where datasets of human fatigue and toxicology will be analyzed (data available).

Personal tools