When trying to identify an organism, a De Novo Assembly needs to be constructed. For an experimental work flow for bioinformatics approaches, after sample collection, DNA extraction, library preparation and sequencing the data you have received, in this case DNA sequences, needs to have its quality checked.
There are many ways of doing is, one being the use of FastQC which allows for the analysis of sequenced DNA.
For assistance with FastQC reports - https://biof-edu.colorado.edu/videos/dowell-short-read-class/day-4/fastqc-manual
Phred Ascii score - http://www.drive5.com/usearch/manual/quality_score.html
After the quality of the sequences have been analysed, there tends to be regions of poor quality or the presence of adapters (As seen in the FastQC manual).
Typically, when using command line, a trimming tool called Trimmomatic is used for the removal of adapters and regions of the read that show poor quality. Since Trimmomatic is a Java application, no module needs to loaded and once called from its location, parameters can be set
(Eg. java –jar (application path) PE or SE –phred40 inputfile1 inputfile2 paired output file1 paired output file2 unpaired outfile1 unpaired outfile2 [Options]).
For assistance with Trimmomatic - http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf
For assistance with cutting adapters - https://secure.clcbio.com/helpspot/index.php?pg=kb.page&id=377
Cutadapt is also used for the removal of adapters and poor quality reads at either ends of the reads. The basic command-line for cutadapt is: cutadapt -a AAGTCAT -o output.fastq input.fastq
This will result in the removal of the adapter sequence AAGTCAT. All reads in the input file will also be present in the output file, however, some will be trimmed while others not.
Input file formats: FASTA (.fasta, .fa or .fna) and FASTQ (.fastq and .fq) Even when in a compressed file, cutadapt can read it
For assistance with cutadapt - https://media.readthedocs.org/pdf/cutadapt/stable/cutadapt.pdf
Once cut, reads will have to be filter, this can be done using fastq quality filter (http://hannonlab.cshl.edu/fastx_toolkit/commandline.html)
Eg. fastq_quality_filter –Q40 –v –q30 –p100 –i inputfile.fastq –o outputfile.fasq
Once the reads have been quality trimmed, assembly can begin. Again, there are multiple application for generating a De Novo Assembly but one of the most commonly used is Velvet. Velvet makes use of 2 “sub-applications” velveth and velvetg
Velveth is used to make a dataset which will be used by velvetg (generates 2 output files in a directory).
(Eg. velveth output_directory (kmerLength) –file_type(fastq) –separate –read_type(shortPaired) x2 input files)Velvetg is used to generate the de Bruijn graph
(Eg. velvetg output_directory –ins_length 250 –min_contig_lgth 100 –exp_cov 100 –cov_cutoff 10)
Once assembled, the quality of the assembly can be checked by reading the “contigs.fa” file. This will allow you to see the N50 of the contigs (50% of the whole assembly contains contigs equal or larger than that value). The number of contigs that are supposed to be generates should be between 250 – 1000. If the kmer size is unknown, begin at 21 and optimise from there (remember when loading the module, some Velvet programmes have different programmes for different kmer length).
For assistance with Velvet - http://computing.bio.cam.ac.uk/local/doc/velvet.pdf
After constructing a De Novo assembly, the only thing left to do is run the assembly through BLAST in order to identify the organism. This can be done by loading the specific NCBI module followed by the script that will BLAST your assembly.
(Eg. blastn –db (programme path) –query inputFile.fa –num_threads 9 –max_target_seqs 5 – outfmt “(output format as seen in the weblink)” –out outputFile.txt)
For assistance with BLAST - https://www.ncbi.nlm.nih.gov/books/NBK279675/
I hope this help