Session 1: command line & BLAST¶
Welcome!¶
Materials from the Next-Gen Sequence Analysis Workshop including the instructions on how to install and configure all software that you will need can be found here: http://ged.msu.edu/angus/tutorials-2013/
TODO:
- start up your virtual machine
- start up a browser
- start up a terminal!
- copying and pasting commands
- copying common files around.
Copying & pasting commands¶
Tips on copying and pasting commands:
- left mouse button to drag-select (copy) from this Web page;
- middle mouse button (push scroll wheel) to paste.
Try copy/pasting this into the terminal:
echo 'hello, world'
Try copying multiple lines at once:
ls
echo 'this is my directory'
Done! You’re now an expert!
BLAST¶
Download and install some useful scripts:
git clone https://github.com/ngs-docs/ngs-scripts ~/software/ngs-scripts
Create a working directory under the main user directory, and change to that working directory:
cd
mkdir blast
cd blast
(‘pwd’ should show that you are in /home/student/blast).
Download the E. coli MG1655 protein data set:
curl -O http://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.faa
This grabs that URL and saves the contents of ‘NC_000913.faa’ to the local disk.
Grab a Prokka-generated set of proteins (see e.g. http://2013-caltech-workshop.readthedocs.org/en/latest/prokka-annotation.html):
curl -O http://athyra.idyll.org/~t/ecoli0104.faa
Let’s take a quick look at these files:
head ecoli0104.faa
head NC_000913.faa
These are FASTA protein files.
Format it for BLAST and run BLAST of the O104 protein set against the MG1655 protein set:
formatdb -i NC_000913.faa -o T -p T
blastall -i ecoli0104.faa -d NC_000913.faa -p blastp -e 1e-12 -o 0104.x.NC -a 8
Look at the output file:
head -20 0104.x.NC
Let’s convert ‘em to a CSV file:
python ~/software/ngs-scripts/blast/blast-to-csv-with-names.py ecoli0104.faa NC_000913.faa 0104.x.NC > 0104.x.NC.csv
This creates a file ‘0104.x.NC.csv’, which you can open in a spreadsheet program like Excel or Google Docs/Spreadsheet.
grep ^'>' fasta_file | wc -l
Reciprocal BLAST calculation¶
Do the reciprocal BLAST, too:
formatdb -i ecoli0104.faa -o T -p T
blastall -i NC_000913.faa -d ecoli0104.faa -p blastp -e 1e-12 -o NC.x.0104 -a 8
Extract reciprocal best hit:
python ~/software/ngs-scripts/blast/blast-to-ortho-csv.py ecoli0104.faa NC_000913.faa 0104.x.NC NC.x.0104 > ortho.csv
This generates a file ‘ortho.csv’, containing the ortholog assignments and their annotations.