Mr Nobody Posted May 31, 2017 Posted May 31, 2017 I need to identify an organism using results generated from Illumina sequencing but I have no idea where to start. I know that I need to check my results quality and then make an assembly but I don't know where to begin... PS, I'm new to the whole command line Bioinformatics thing
The Helper Posted May 31, 2017 Posted May 31, 2017 When trying to identify an organism, a De Novo Assembly needs to be constructed. For an experimental work flow for bioinformatics approaches, after sample collection, DNA extraction, library preparation and sequencing the data you have received, in this case DNA sequences, needs to have its quality checked. There are many ways of doing is, one being the use of FastQC which allows for the analysis of sequenced DNA. For assistance with FastQC reports - https://biof-edu.colorado.edu/videos/dowell-short-read-class/day-4/fastqc-manual Phred Ascii score - http://www.drive5.com/usearch/manual/quality_score.html After the quality of the sequences have been analysed, there tends to be regions of poor quality or the presence of adapters (As seen in the FastQC manual). Typically, when using command line, a trimming tool called Trimmomatic is used for the removal of adapters and regions of the read that show poor quality. Since Trimmomatic is a Java application, no module needs to loaded and once called from its location, parameters can be set (Eg. java –jar (application path) PE or SE –phred40 inputfile1 inputfile2 paired output file1 paired output file2 unpaired outfile1 unpaired outfile2 [Options]). For assistance with Trimmomatic - http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf For assistance with cutting adapters - https://secure.clcbio.com/helpspot/index.php?pg=kb.page&id=377 Cutadapt is also used for the removal of adapters and poor quality reads at either ends of the reads. The basic command-line for cutadapt is: cutadapt -a AAGTCAT -o output.fastq input.fastq This will result in the removal of the adapter sequence AAGTCAT. All reads in the input file will also be present in the output file, however, some will be trimmed while others not. Input file formats: FASTA (.fasta, .fa or .fna) and FASTQ (.fastq and .fq) Even when in a compressed file, cutadapt can read it For assistance with cutadapt - https://media.readthedocs.org/pdf/cutadapt/stable/cutadapt.pdf Once cut, reads will have to be filter, this can be done using fastq quality filter (http://hannonlab.cshl.edu/fastx_toolkit/commandline.html) Eg. fastq_quality_filter –Q40 –v –q30 –p100 –i inputfile.fastq –o outputfile.fasq Once the reads have been quality trimmed, assembly can begin. Again, there are multiple application for generating a De Novo Assembly but one of the most commonly used is Velvet. Velvet makes use of 2 “sub-applications” velveth and velvetg Velveth is used to make a dataset which will be used by velvetg (generates 2 output files in a directory). (Eg. velveth output_directory (kmerLength) –file_type(fastq) –separate –read_type(shortPaired) x2 input files)Velvetg is used to generate the de Bruijn graph (Eg. velvetg output_directory –ins_length 250 –min_contig_lgth 100 –exp_cov 100 –cov_cutoff 10) Once assembled, the quality of the assembly can be checked by reading the “contigs.fa” file. This will allow you to see the N50 of the contigs (50% of the whole assembly contains contigs equal or larger than that value). The number of contigs that are supposed to be generates should be between 250 – 1000. If the kmer size is unknown, begin at 21 and optimise from there (remember when loading the module, some Velvet programmes have different programmes for different kmer length). For assistance with Velvet - http://computing.bio.cam.ac.uk/local/doc/velvet.pdf After constructing a De Novo assembly, the only thing left to do is run the assembly through BLAST in order to identify the organism. This can be done by loading the specific NCBI module followed by the script that will BLAST your assembly. (Eg. blastn –db (programme path) –query inputFile.fa –num_threads 9 –max_target_seqs 5 – outfmt “(output format as seen in the weblink)” –out outputFile.txt) For assistance with BLAST - https://www.ncbi.nlm.nih.gov/books/NBK279675/ I hope this help 2
LemurLady18 Posted May 31, 2017 Posted May 31, 2017 (edited) This is a great explanation! Just to add, here is an example of a trimmomatic command that i used for my paired-end analysis data: $java -jar <path to trimmomatic app> PE -phred33 read1_input.fq read2_input.fq read1_output.fastq read1_output_unpaired.fastq read2_output.fastq read2_output_unpaired.fastq ILLUMINACLIP:<path to adapters>/TruSeq-PE.fa:2:20:3 SLIDINGWINDOW:5:20 HEADCROP:15 MINLEN:40 PE.. if yours in Single-end you use SE ILLUMINACLIP: TruSeq2-PE.fa is the paired-end TruSeq adapters, because my library prep used these 2:20:3 means 2 mismatches are allowed before a match is rejected 2:20:3 means the match between the adapter ligated reads must be above 20 2:20:3 means match between any adapter sequence and read must be above 3 SLINDINGWINDOW: 5:20 means that cutting of the read will happen when the average quality over a window of 5 nucleotides is below Phred 20 - prevents removal of large good quality regions of reads when only one base is badHEADCROP: 15 - trims 15 nucleotides off the 5' end of the readsMINLEN 40 - will discard reads that are less than 40 bases long after trimming has been doneI hope this also helps Edited May 31, 2017 by LemurLady18 1
LemurLady18 Posted June 1, 2017 Posted June 1, 2017 Just some things to note for velvet: the output file of both velvetg and velveth commands need to be the same keep cov_cutoff low and then play with min_contig_lgth, exp_cov and word-size to get optimum N50 and contig number Just some things to note for NCBI blast: in command line, in order to submit the command script using the contig file produced by velvet, you need to be "positioned" in the velvet output file, otherwise it won't be able to find the contig file (unless you specify the location using a relative root to open the text file produced by blast, use the command "nano file.txt" and it will show up
weetBIX Posted June 1, 2017 Posted June 1, 2017 Struggled to understand the difference between these paired-end: interleaved or separate data Interleaved: Reads of both ends of the sequencing run occur in one file. Read 2 of a pair directly follows read Read1 Separate: Forward and Reverse reads of a sequencing run occur in separate files. The order of reads in the file must be the same, i.e. the first read in the forward file and the first read in the reverse file are paired.
swansont Posted June 1, 2017 Posted June 1, 2017 ! Moderator Note I don't know why you are pretending to have a Q & A when at least three of you are posting from the same computer. This nonsense stops now.You get one account per person. 1
Recommended Posts