Charleston Computational Genomics Group
Developing infrastructure and methodology to efficiently analyze genomic and bioinformatic data with the express purpose of training undergraduate students to become competent scientists in the field.
Distributed computing over several clusters. BiRG lab has an two clusters, one newly created for this project. College of Charleston also has a traditional cluster. We will be distributing computation between these clusters and the one at MUSC.
1) Build cyber and hardware infrastructure
2) Develop novel software for the distributed computation
3) Using the system for genomic analysis
- The express purpose being the training of students in genomic and bioinformatic sciences to be utilized at local and foreign institutions.
The PI for the project is Paul Anderson of the College of Charleston Computer Science Department. Co-PIs are Andrew Shedlock of the College of Charleston Biology Department and Dennis Watsonof MUSC's Department of Pathology and Laboratory Medicine. Other members of the team are Starr Hazard, who handles MUSC's cluster and computation, Bob Wilson, Director of the genomics core. Current students working on the project are Jeremy Morgan, Connor Stanley, Matt Paul and Tori McCaffrey.
Data Mining and Machine Learning
While the development of cross-disciplinary fields, such as bioinformatics and data science, has changed scientific inquiry in many respects, the realization of their full potential has in many instances been hindered by the separation of algorithm developers and domain experts. Cloud-based, platform independent services and software have emerged as a mechanism to lower the technical barriers of scientific computing. A byproduct of this trend, Learn2Mine is an open-source cloud-based informatics platform engineered for data exploration. Combining the machine learning techniques from Weka with a user-interface that relaxes the learning curve and ease-of-use given by other software, Learn2Mine provides scientists with a single integrated platform to learn data mining techniques and apply them to their biological datasets. As scientists learn work through the online lessons, their confidence in their own analytical results will increase. Thus, scientists will be able to focus on evaluating methods for analyzing their data and use Learn2Mine’s integrated cloud-storage to share data and results with collaborators and the community. The platform is also designed to teach data mining and machine learning techniques to a computer scientist, allowing them to apply classifiers, clustering algorithms, etc to datasets that have been shared by the community. This allows students with a variety of backgrounds to share and compare their results, thus, providing context in their data mining application while their knowledge of statistical techniques improves.
Learn2Mine utilizes a decomposition, or drill-down, method of input and output, as opposed to workflow methods, most prominent in typical systems, such as RapidMiner, Taverna, and Galaxy. This lets Learn2Mine operate at higher levels of abstraction, encapsulating complex workflows based on their input and output. Scientists can customize shared templates for statistical analysis, combining algorithms and tools to better fit their data. For example, a scientist observing genetics data may need a custom template that integrates a Nearest Neighbor Algorithm with a Support Vector Machine in order to produce better results with the required utility. Novel customization is supported and new templates by scientists can persist indefinitely in Learn2Mine, in addition to being shared back to the community, allowing Learn2Mine to evolve. These capabilities and features are accessible by cloud-based software as a service that is currently being hosted on Google App Engine with storage and sharing of biological datasets provided by integrating with Google Drive. In addition to these advantages, Learn2Mine also teaches data science through the implementation of gameful experience, which allows students to visit the site and immediately start programmatic lessons with built-in bioinformatics practice datasets and specialized feedback. Classifiers, like abilities, can be unlocked by completing lessons and unlocked abilities can be viewed in the skill tree on a student’s profile, linked through Google. The system’s initial implementation and associated feedback integration will be tested on its ability to analyze standard pedagogical datasets (e.g., Iris dataset). Second, the utility of this software will be demonstrated through ongoing collaborations with the Air Force Research Laboratory, where metabolomics spectroscopic datasets of human fatigue and toxicology will be analyzed.
The upcoming science of metabolomics is a relatively young field that requires intensive data analysis for interpretation of experimental results, quantifying and annotating hundreds of metabolite levels for each sample analyzed. Regardless of the data collection method, metabolomics experiments require substantial computational and statistical support, and researchers require significant infrastructure for storage, visualization, and peer-review of online repositories of this highly multidimensional data. Currently, Individual labs spend time enginering their own systems and analysis pipelines which are often inflexible, incompatible, and domain specific. For these reasons, we present a general purpose Google-based laboratory information management system (gLIMS) which will serve as a scalable community-driven annotation system for the field of metabolomics. This system combines the intuitive Google Drive interface and storage system with applications that empower the data, including data-visualization, annotation, and exposure to the semantic web. The system is flexible, efficient, available, and easy to integrate with scientific workflow software by providing a representational state transfer (RESTful) web service interface. This system is implemented using Google’s platform as a service (PAAS) technologies to produce a robust, available, scalable, and extensible lab information management system running entirely in-browser which facilitates collaboration on a worldwide scale.
After a user uploads his or her raw data files to Drive, gLIMS then reads, parses, and persists results back to Google Drive as individual files grouped into collections based on the associated metadata. The inverse operation is possible as well; gLIMS reassembles and formats the data for export. The metadata hierarchy is also represented with resource description framework (RDF) graphs for exposure to the semantic web, which may be queried using the protocol and RDF query language, SPARQL. The combined features of Google Drive and gLIMS reduce the need to think about developing resources and allow the researcher to focus on analysis. Our interface visualizes data collections in a folder-like hierarchy. Categories and subcategories can be browsed or searched, empowering researchers to filter irrelevant data. Google’s built-in access control lists make it easy to choose what to share and with whom.
In addition to the standard functionality mentioned above, gLIMS has some features which are particularly useful for metabolomics research. gLIMS graphing functionality provides visualization and annotation. Multiple data-sets can be graphed simultaneously to contrast experimental results. Plots can be annotated and the resulting metadata stored and added to the repository providing valuable labels for machine learning algorithms and other research. The system will aggregate the results from community experiments providing a large data repository to fuel machine learning, data-mining, pattern-recognition, and inference algorithms. The target and impact of this application are broad, but our initial focus is nuclear magnetic resonance (NMR) metabolomics. In particular, United States Air Force human fatigue and toxicology datasets are available as a preliminary demonstration of the usefulness of our system.
Next Generation Genomic Sequencing
The development of Next-Generation Sequencing, or NGS, has heightened the throughput of traditional sequencing by using RNA reads to work backward, assembling a transcriptome from expressed fragments. These methods are designed to lower the cost of DNA sequencing by analyzing vast quantities of RNA fragments but have led to computational and storage obstacles. RNA sequencing produces large files that can be in excess of 20GB, making them unwieldy to the general researcher. Vast data resources have been created to extract the information contained in these massively parallel data. These resources and their associated tools are heterogeneous and highly distributed. This requires the scientist to create and execute highly complex customized analyses that involve gathering and organizing data from heterogeneous sources while interfacing with a variety of software tools. This is often infeasible without the support of computer specialists and significant hardware upgrades.
The objective of this work is to combine the strengths of the scientific workflow project, Galaxy, with traditional high performance computing resources and affordable cloud-based data storage that encourages collaboration. The specific goals are to create a Galaxy tool that executes Trinity on a remote cluster and to develop software to download, upload, and manage NGS results to and from the cloud. Trinity is a computationally expensive de novo assembler that processes large RNA files to produce a FASTA transcriptome. Our system facilitates broad-based collaboration and distribution by building around the Google Drive cloud storage solution, where the processing of RNA sequences can be both shared and analyzed with a single upload and be reused for multiple purposes. The system is also able to remotely execute Trinity and other NGS tools on traditional high performance computing infrastructure that is often available as a shared resource at universities. Specifically, we develop a Trinity-based workflow executed in a heterogeneous environment with batch HPC resources that produces the abundances’ of RNA fragments to discover patterns of expression.
To demonstrate our workflow system with cloud-based storage and sharing, we will analyze duplicate samples of ovarian biopsies that have been generated by 72 base, paired-end sequencing (RNAseq) performed on an Illumina Genome Analyzer IIX. Sequencing was performed utilizing a balanced block design with pooled barcoded samples from 8 fish run in each lane and duplicate lanes employed as sequencing technical replicates. As the Illumina-based RNAseq approach exhibits much higher sensitivity for low abundance transcripts and far deeper sequencing coverage than the Roche 454 pyro-sequencing originally utilized, we expect that thousands of new ovarian gene transcripts will be revealed by this approach after RNAseq reads are quality filtered, assembled de novo into contigs and compared to the existing striped bass ovarian transcriptome, and especially to the tens of thousands of singleton sequences that we obtained but have not previously verified or published. This process will yield a far more comprehensive ovarian transciptome that will represent the overwhelming majority of genes expressed in the ovary of striped bass during oocyte growth and maturation.
Metabolomics is the exhaustive characterization of metabolite concentrations in biofluids and tissues. The use of NMR and chromatography-linked mass spectrometry to assay metabolic profiles of tissue homogenates and biofluids has been increasingly recognized as a powerful tool for biological discovery. In recent years metabolomics techniques have been applied to a wide variety of diagnostic, preclinical, systems biology, and ecological studies. Working with Dr. Nick Reo's NMR spectroscopy lab at Wright State University, we are developing standards-based tools and web services for the pre-processing, normalization/standardization, exploratory and comparative analysis, and visualization of NMR spectra from biofluids.
NMR-based metabolomics has been used to associate an organism’s health status to its metabolite profile measured in biofluids (e.g, urine, blood, fecal and tissue extracts) or tissue biopsies. Coupled with multivariate data analyses, the 1H NMR-based metabolomics approach is a fast, accurate and reproducible analytical technique for visualization of biochemical changes in biofluids or tissues. This methodology involves correlating observed changes in metabolite levels to the biological effects related to physiological stimuli or genetic modification, toxicological, pathophysiological or environmental conditions. Studies have highlighted its potential for the successful identification and characterization of toxicity, metabolic pathways perturbed in various cancers, disease-related stages in chronic lymphocytic leukemia, and other pathophysiological conditions.
The recent flood of Web-enabled and Web-based tools for eScience has provided scientists with a wide array of new methods to collect, report, analyze and share their data. These Web-based technologies have demonstrated a great potential to enable broader collaboration and to facilitate the sharing and re-use of experimental data. The wide availability of quality data sets and tested process flows is indeed a welcome addition for an eScientist, yet there are a number of challenges to overcome. In scientific domains, such as bioinformatics, lack of standardization for scientific methods, algorithms, and data sharing can be seen as the greatest challenge for broader adoption.
Many eScience applications, including workflows, have high performance computation requirements. Cloud computing is increasingly seen as a natural choice for such high demand computing tasks because minimal upfront investments in computational resources are required. Using computing clouds also avoids the complications of catering for periodic peak usage, frequent software and hardware updates and the need of trained IT staff. However, these advantages come with its own set of challenges. The need to program for a specific cloud platform, also called vendor lock in, is one of the major hindrances to the wide adoption of clouds, especially in the scientific domains.
We present SCALE, an eScience experimental data analysis platform to support the following features to meet above challenges:
- enable easy collaboration across a community of scientists with minimal operational support from IT and computer scientists,
- use cloud computing resources (private or public) in a platform agnostic manner to provide high performance computing when applicable.
As metabolomic technology expands, validated techniques for analyzing highly dimensional categorical data are becoming increasingly important. This manuscript presents a novel latent vector-based methodology for analyzing complex data sets with multiple groups that include both high and low doses using orthogonal projections to latent structures (OPLS) coupled with hierarchical clustering. This general methodology allows complex experimental designs (e.g., multiple dose and time combinations) to be encoded and directly compared. Further, it allows for the inclusion of low dose samples that do not exhibit a strong enough individual response to be modeled independently. A dose- and time-responsive metabolomic study was completed to evaluate and demonstrate this methodology. Single doses (0.1 to 100 mg/kg body weight) of α-naphthylisothiocyanate (ANIT), a common model of hepatic cholestasis, were administered orally in corn oil to male Fischer 344 rats. Urine samples were collected predose and daily through day-4 postdose. Blood samples were collected predose and on days 1 and 4 postdose. Urine samples were analyzed by 1H-NMR spectroscopy, and the spectra were adaptively binned to reduce dimensionality. The proposed methodology for NMR-based urinary metabolomics was sensitive enough to detect ANIT-induced effects with respect to both dose and time at doses below the threshold of clinical toxicity. A pattern of ANIT-dependent effects established at the highest dose was seen in the 50 mg/kg and 20 mg/kg dose groups, an effect not directly identifiable with individual PCA. Coupling the pattern found by the OPLS algorithm and hierarchical clustering revealed a relationship between the 100, 50 and 20 mg/kg dose groups, suggesting a characteristic effect of exposure. These studies demonstrate the use of a metabolomics approach with flexible binning of 1H spectra and appropriate application of multivariate analyses can reveal biologically relevant information about the temporal metabolic perturbations caused by exposure and toxicity.
Mechanistically based computational models of the host immune response to biothreat agents are important tools in predicting the human response to infection. These in silico models can also help elucidate virulence mechanisms for high threat pathogens whose mechanisms of pathogenesis have yet to be well characterized. We are currently developing a global model of the host immune response to pathogens which is being linked with lung deposition models we have previously developed to simulate the outcome of in vivo aerosol exposure studies and extrapolate to human response predictions. The global immune system model structure incorporates cellular members of innate and adaptive immunity as well as cytokines. Our computational approach allows the actions of these members to be enhanced or suppressed to simulate mechanisms of immune subversion. We have applied the model to tularemia, simulating the innate response of mice to different Francisella tularensis strains (LVS, U112, and SchuS4). The approach is to train the models with experimental animal response data, and then input the human physiological parameters to extrapolate from the animal data. Time course profiles of macrophages, dendritic cells, neutrophils, monocytes, bacterial load, and pro-inflammatory cytokines were predictive of data from experimental mouse models of tularemia. Among our next steps is the performance of sensitivity analysis in order to identify those model parameters which have the greatest impact on the outcome predictions. This analysis may help generate new hypotheses in elucidating the F. tularensis strain-specific mechanisms of pathogenesis that contribute to the highly variable levels of virulence between type A and type B strains. Ultimately, this in silico model can be used to study questions that are difficult to answer using traditional animal model approaches, such as quantifying the risk posed by a potential human exposure scenario or predicting the efficacy of vaccines or therapeutics in humans.