mRNAseq normalization and assembly

Digital normalization

We’re going to do digital normalization on the mRNAseq data – this is a research output from my lab.

Make a working directory and link in the data:

cd
mkdir assembly
cd assembly
ln -fs ~/mrnaseq-demo/*.gz .

Run digital normalization:

normalize-by-median.py -p -k 20 -C 20 -N 4 -x 1e8  --savehash normC20k20.kh *.pe.qc.fq.gz

normalize-by-median.py -C 20 --loadhash normC20k20.kh --savehash normC20k20.kh *.se.qc.fq.gz

Do error trimming:

filter-abund.py -V normC20k20.kh *.keep

Extract remaining interleaved reads:

for i in *.pe.*.abundfilt;
do
   extract-paired-reads.py $i
done

Combine and rename and make the data files friendly:

for i in *.se.qc.fq.gz.keep.abundfilt
do
   pe_orphans=$(basename $i .se.qc.fq.gz.keep.abundfilt).pe.qc.fq.gz.keep.abundfilt.se
   newfile=$(basename $i .se.qc.fq.gz.keep.abundfilt).se.qc.keep.abundfilt.fq.gz
   cat $i $pe_orphans | gzip -c > $newfile
done

for i in *.abundfilt.pe
do
   newfile=$(basename $i .fq.gz.keep.abundfilt.pe).keep.abundfilt.fq
   mv $i $newfile
   gzip $newfile
done

Running the assembler

Type:

cd ~/assembly

for i in *.pe.qc.keep.abundfilt.fq.gz
do
    split-paired-reads.py $i
done

cat *.1 > left.fq
cat *.2 > right.fq

gunzip -c *.se.qc.keep.abundfilt.fq.gz >> left.fq

to extract the files into ‘right’ and ‘left’; then run:

export PATH=$PATH:~/software/bowtie:~/software/samtools

~/software/trinity/Trinity.pl --left left.fq --right right.fq --seqType fq -JM 1G

This will produce an output file trinity_out_dir/Trinity.fasta.

Searching the resulting transcriptome

Do:

cp trinity_out_dir/Trinity.fasta ./assembly.fa
formatdb -i assembly.fa -o T -p F

and then let’s search for a gene; NP_075023, a zinc transporter, is a good one (because I know it’s in the assembly :).

You can go grab it yourself (go to http://www.ncbi.nlm.nih.gov/protein/NP_075023.2, send to file) OR find it in the file browser under Training Files, NGS bootcamp biologists, NP_0705023.fasta.

Once you have the file, you can now BLAST it against your assembly:

blastall -i NP_075023.fasta -d assembly.fa -p tblastn -o blast.txt -e 1e-12

And now do:

less blast.txt

to take a look at the results.