The Helper

Members

View Profile See their activity

Posts
2
Joined
May 31, 2017
Last visited
June 1, 2017

Content Type

All Activity

Profiles

Forums

Topics
Posts

Events

Everything posted by The Helper

Bioinformatic differential expression analysis

The Helper replied to Mr Nobody's topic in Genetics

In order to do this type of analysis, after RNA samples have been extracted and a library has been constructed, the sequences have to be sequenced and assembled. Remember before any assembly can occur, the quality of the reads needs to be checked for any adapter content or poor quality regions. These can be trimmed using several different programmes. Once trimmed, assembly can begin. When using command line, the Tuxedo Suite is the most often used for assembly and transcriptome analysis. Since a transcriptome analysis works with multiple RNA sequences, the amount of overrepresented sequences and duplicates will be high. Tuxedo Suite: Bowtie - Allows for fast and simple alignment. Needed to form the base of Tophat alignment. Needs a reference genome (.fa) Tophat - Uses output file from Bowtie and aligns RNA sequences in a splice-aware way. It allows for the discovery of new splice junctions. This will be repeated for every read you have. (Eg. tophat2 –p 5 --library-type fr-firststrand –o outputDirectory (Reference file name) inputFile.fq) Cufflinks - Assembles transcripts (Eg. cufflinks –g(reference .gtf file) –b(reference.fa) –u --library-type fr-firststrand –o outputDirectory inputfile.bam) Cuffmerge - Merges multiple transcript assemblies into 1 file Often to reduce the complexity of the script, a text file is made which contains the path to the .gtf file needed (Eg. cuffmerge –o outputDirectory –g reference.gtf –s reference.fa pathfile.txt) Cuffdiff - Differential expression analysis for Transcriptome analysis (Eg. cuffdiff –p 5 –b reference.fa –u mergedfile.gtf CaseInputfiles.bam(separated by a comma) ControlInputfile.bam –o outputDirectory) For assistance with Tophat - https://www.illumina.com/documents/products/technotes/RNASeqAnalysisTopHat.pdf https://ccb.jhu.edu/software/tophat/manual.shtml For assistance with Cufflinks - https://www.google.co.za/url?sa=t&rct=j&q=&esrc=s&source=web&cd=5&cad=rja&uact=8&ved=0ahUKEwin5rqd1pfUAhWIAMAKHUEfAH4QFgg2MAQ&url=https%3A%2F%2Fwww.researchgate.net%2Ffile.PostFileLoader.html%3Fid%3D544651e5d3df3edb2b8b463a%26assetKey%3DAS%253A273626954174476%25401442249157340&usg=AFQjCNHlfzwAeAOVgHNwH4gfae3r_YPjig&sig2=n5gn1ZMpiTTMiAojbmxadw
- June 1, 2017
- 4 replies
Identifying an organism using command line Bioinformatics?

The Helper replied to Mr Nobody's topic in Genetics

When trying to identify an organism, a De Novo Assembly needs to be constructed. For an experimental work flow for bioinformatics approaches, after sample collection, DNA extraction, library preparation and sequencing the data you have received, in this case DNA sequences, needs to have its quality checked. There are many ways of doing is, one being the use of FastQC which allows for the analysis of sequenced DNA. For assistance with FastQC reports - https://biof-edu.colorado.edu/videos/dowell-short-read-class/day-4/fastqc-manual Phred Ascii score - http://www.drive5.com/usearch/manual/quality_score.html After the quality of the sequences have been analysed, there tends to be regions of poor quality or the presence of adapters (As seen in the FastQC manual). Typically, when using command line, a trimming tool called Trimmomatic is used for the removal of adapters and regions of the read that show poor quality. Since Trimmomatic is a Java application, no module needs to loaded and once called from its location, parameters can be set (Eg. java –jar (application path) PE or SE –phred40 inputfile1 inputfile2 paired output file1 paired output file2 unpaired outfile1 unpaired outfile2 [Options]). For assistance with Trimmomatic - http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf For assistance with cutting adapters - https://secure.clcbio.com/helpspot/index.php?pg=kb.page&id=377 Cutadapt is also used for the removal of adapters and poor quality reads at either ends of the reads. The basic command-line for cutadapt is: cutadapt -a AAGTCAT -o output.fastq input.fastq This will result in the removal of the adapter sequence AAGTCAT. All reads in the input file will also be present in the output file, however, some will be trimmed while others not. Input file formats: FASTA (.fasta, .fa or .fna) and FASTQ (.fastq and .fq) Even when in a compressed file, cutadapt can read it For assistance with cutadapt - https://media.readthedocs.org/pdf/cutadapt/stable/cutadapt.pdf Once cut, reads will have to be filter, this can be done using fastq quality filter (http://hannonlab.cshl.edu/fastx_toolkit/commandline.html) Eg. fastq_quality_filter –Q40 –v –q30 –p100 –i inputfile.fastq –o outputfile.fasq Once the reads have been quality trimmed, assembly can begin. Again, there are multiple application for generating a De Novo Assembly but one of the most commonly used is Velvet. Velvet makes use of 2 “sub-applications” velveth and velvetg Velveth is used to make a dataset which will be used by velvetg (generates 2 output files in a directory). (Eg. velveth output_directory (kmerLength) –file_type(fastq) –separate –read_type(shortPaired) x2 input files)Velvetg is used to generate the de Bruijn graph (Eg. velvetg output_directory –ins_length 250 –min_contig_lgth 100 –exp_cov 100 –cov_cutoff 10) Once assembled, the quality of the assembly can be checked by reading the “contigs.fa” file. This will allow you to see the N50 of the contigs (50% of the whole assembly contains contigs equal or larger than that value). The number of contigs that are supposed to be generates should be between 250 – 1000. If the kmer size is unknown, begin at 21 and optimise from there (remember when loading the module, some Velvet programmes have different programmes for different kmer length). For assistance with Velvet - http://computing.bio.cam.ac.uk/local/doc/velvet.pdf After constructing a De Novo assembly, the only thing left to do is run the assembly through BLAST in order to identify the organism. This can be done by loading the specific NCBI module followed by the script that will BLAST your assembly. (Eg. blastn –db (programme path) –query inputFile.fa –num_threads 9 –max_target_seqs 5 – outfmt “(output format as seen in the weblink)” –out outputFile.txt) For assistance with BLAST - https://www.ncbi.nlm.nih.gov/books/NBK279675/ I hope this help
- May 31, 2017
- 5 replies
- 2

Sign In

The Helper

Posts

Joined

Last visited

Content Type

Profiles

Forums

Events

Everything posted by The Helper

Bioinformatic differential expression analysis

Identifying an organism using command line Bioinformatics?

Browse

Activity

Important Information