Identifying an organism using command line Bioinformatics?

Mr Nobody · May 31, 2017

I need to identify an organism using results generated from Illumina sequencing but I have no idea where to start.

I know that I need to check my results quality and then make an assembly but I don't know where to begin...

PS, I'm new to the whole command line Bioinformatics thing

The Helper · May 31, 2017

When trying to identify an organism, a De Novo Assembly needs to be constructed. For an experimental work flow for bioinformatics approaches, after sample collection, DNA extraction, library preparation and sequencing the data you have received, in this case DNA sequences, needs to have its quality checked.

There are many ways of doing is, one being the use of FastQC which allows for the analysis of sequenced DNA.

For assistance with FastQC reports - https://biof-edu.colorado.edu/videos/dowell-short-read-class/day-4/fastqc-manual

Phred Ascii score - http://www.drive5.com/usearch/manual/quality_score.html

After the quality of the sequences have been analysed, there tends to be regions of poor quality or the presence of adapters (As seen in the FastQC manual).

Typically, when using command line, a trimming tool called Trimmomatic is used for the removal of adapters and regions of the read that show poor quality. Since Trimmomatic is a Java application, no module needs to loaded and once called from its location, parameters can be set

(Eg. java –jar (application path) PE or SE –phred40 inputfile1 inputfile2 paired output file1 paired output file2 unpaired outfile1 unpaired outfile2 [Options]).

For assistance with Trimmomatic - http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf

For assistance with cutting adapters - https://secure.clcbio.com/helpspot/index.php?pg=kb.page&id=377

Cutadapt is also used for the removal of adapters and poor quality reads at either ends of the reads. The basic command-line for cutadapt is: cutadapt -a AAGTCAT -o output.fastq input.fastq

This will result in the removal of the adapter sequence AAGTCAT. All reads in the input file will also be present in the output file, however, some will be trimmed while others not.

Input file formats: FASTA (.fasta, .fa or .fna) and FASTQ (.fastq and .fq) Even when in a compressed file, cutadapt can read it

For assistance with cutadapt - https://media.readthedocs.org/pdf/cutadapt/stable/cutadapt.pdf

Once cut, reads will have to be filter, this can be done using fastq quality filter (http://hannonlab.cshl.edu/fastx_toolkit/commandline.html)

Eg. fastq_quality_filter –Q40 –v –q30 –p100 –i inputfile.fastq –o outputfile.fasq

Once the reads have been quality trimmed, assembly can begin. Again, there are multiple application for generating a De Novo Assembly but one of the most commonly used is Velvet. Velvet makes use of 2 “sub-applications” velveth and velvetg

Velveth is used to make a dataset which will be used by velvetg (generates 2 output files in a directory).

(Eg. velveth output_directory (kmerLength) –file_type(fastq) –separate –read_type(shortPaired) x2 input files)

Velvetg is used to generate the de Bruijn graph

(Eg. velvetg output_directory –ins_length 250 –min_contig_lgth 100 –exp_cov 100 –cov_cutoff 10)

Once assembled, the quality of the assembly can be checked by reading the “contigs.fa” file. This will allow you to see the N50 of the contigs (50% of the whole assembly contains contigs equal or larger than that value). The number of contigs that are supposed to be generates should be between 250 – 1000. If the kmer size is unknown, begin at 21 and optimise from there (remember when loading the module, some Velvet programmes have different programmes for different kmer length).

For assistance with Velvet - http://computing.bio.cam.ac.uk/local/doc/velvet.pdf

After constructing a De Novo assembly, the only thing left to do is run the assembly through BLAST in order to identify the organism. This can be done by loading the specific NCBI module followed by the script that will BLAST your assembly.

(Eg. blastn –db (programme path) –query inputFile.fa –num_threads 9 –max_target_seqs 5 – outfmt “(output format as seen in the weblink)” –out outputFile.txt)

For assistance with BLAST - https://www.ncbi.nlm.nih.gov/books/NBK279675/

I hope this help

LemurLady18 · May 31, 2017

This is a great explanation! Just to add, here is an example of a trimmomatic command that i used for my paired-end analysis data:

$java -jar <path to trimmomatic app> PE -phred33 read1_input.fq read2_input.fq read1_output.fastq read1_output_unpaired.fastq read2_output.fastq read2_output_unpaired.fastq ILLUMINACLIP:<path to adapters>/TruSeq-PE.fa:2:20:3 SLIDINGWINDOW:5:20 HEADCROP:15 MINLEN:40

PE.. if yours in Single-end you use SE

ILLUMINACLIP:

TruSeq2-PE.fa is the paired-end TruSeq adapters, because my library prep used these
2:20:3 means 2 mismatches are allowed before a match is rejected
2:20:3 means the match between the adapter ligated reads must be above 20
2:20:3 means match between any adapter sequence and read must be above 3

SLINDINGWINDOW:

5:20 means that cutting of the read will happen when the average quality over a window of 5 nucleotides is below Phred 20 - prevents removal of large good quality regions of reads when only one base is bad

HEADCROP:

15 - trims 15 nucleotides off the 5' end of the reads

MINLEN

40 - will discard reads that are less than 40 bases long after trimming has been done

I hope this also helps

Edited May 31, 2017 by LemurLady18

LemurLady18 · June 1, 2017

Just some things to note for velvet:

the output file of both velvetg and velveth commands need to be the same
keep cov_cutoff low and then play with min_contig_lgth, exp_cov and word-size to get optimum N50 and contig number

Just some things to note for NCBI blast:

in command line, in order to submit the command script using the contig file produced by velvet, you need to be "positioned" in the velvet output file, otherwise it won't be able to find the contig file (unless you specify the location using a relative root
to open the text file produced by blast, use the command "nano file.txt" and it will show up

weetBIX · June 1, 2017

Struggled to understand the difference between these

paired-end: interleaved or separate data

Interleaved: Reads of both ends of the sequencing run occur in one file. Read 2 of a pair directly follows read Read1

Separate: Forward and Reverse reads of a sequencing run occur in separate files. The order of reads in the file must be the same, i.e. the first read in the forward file and the first read in the reverse file are paired.

**swansont** · June 1, 2017

!

Moderator Note

I don't know why you are pretending to have a Q & A when at least three of you are posting from the same computer. This nonsense stops now.

You get one account per person.

Sign In

Identifying an organism using command line Bioinformatics?

Recommended Posts

Mr Nobody

The Helper

LemurLady18

LemurLady18

weetBIX

swansont

Browse

Activity

Important Information