NGS Analysis lung cancer study

The majority of the work that we do at C2G2 has ben differential expression analysis. Therefore, we have been working extensively to streamline the workflow and process. This analysis uses a set of samples from lung cancer patients. About half of the patients involved have had their cancer relapse after treatment. Therefore, by finding which genes are differentially expressed between the two groups, we can find genes that may be used for a diagnostic or treatment. I have provided detailed information on our workflow thus far.

= C2G2's Reader's Digest -> Our Workflow = Following this section, is an extended tutorial from [CONNOR STUFF], however, this is a summary of the steps involved to create an RNA-seq tutorial, using the Tuxedo suite.

Here are the steps to make this RNA-seq workflow. On the Galaxy page, click on "Workflows" on the top bar. Click on "Create new workflow" and name it. Click on the workflow, and "edit" you will be presented with an empty workflow. Workflows work by stringing the output of one or more tools into the input of another tool, creating a chain that can extend indefinitely. One first creates the workflow, and when it is run, then it will ask for the necessary input files. With RNA-seq, we start with the RNA reads files (fastq). After uploading those files (and renaming them to match the sample name and if it is R1 or R2), we will need to trim them of their adapters. So that will be the first tool. Click on the Trimmomatic tool within the Galaxy toolbar and it should pop up on your work space. Specify the options - in our workflow we used PE and our steps were "ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36" {CONNOR EXPLAIN STEPS}. The resulting output is a log file, two unpaired reads and 2 paired read outputs.

The next tool is the FastQ Groomer. Add this tool twice into the work space. Connect po1 (paired output 1) to the first Groomer, and po2 to the second. In both tools change the "quality scores type" appropriately, in our instance we changed it to Illumina. {CONNOR EXPLAIN GROOMER}

The next tool is TopHat2. {EXPLAIN]. Connect the first groomer's output to the forward reads input of TopHat2, and the second to the reverse. If the "library is mate-paired" change that option, as we did. Their are 5 outputs of TopHat2, but we are only interested in the accepted hits (bam file).

Our next tool is Cufflinks. [EXPLAIN] Connect accepted-hits to the aligned RNA-seq reads input and change your options. We changed "perform quartile normalization", "Perform bias correction", and "multi-read correct" to YES. We set "Use Reference Annotation" to "use reference annotation" and "Reference sequence" to "history". Notice there are 3 empty input files to this tool. You will supply these when run.

The next tool is CuffMerge [EXPLAIN]. But because it will be using the Cufflinks outputs of all are samples (21 of them), we will have to run out workflow for each sample.

So now that you have your workflow, click on "options" and "run". Here is where you supply the input files. The first file is your sample_R1.fastq file. The second is sample_R2.fastq file. Reference annotation is "genes.gtf" and reference file is "genome.fa". [EXPLAIN]

Next we ran CuffDiff on the merged files from CuffMerge and performed replicate analysis which involved inputing TopHat2 accepted_hits files according to the groups and added new replicates according to our number of replicates. We changed "perform quartile normalization", "Perform bias correction", and "multi-read correct" to YES. We set "Reference sequence" to "history". Also had a "false discovery rate" of 0.05 and a "Min align count" of 10.

More things to discuss - on going
CuffMerge result success Can't access replicate gene expression levels from Cuff merge. Combine results from CuffMerge and CuffLinks for use in visualization tools - manually? automatic tool?