Series of scripts to analyse 454 and Illumina sequences in SSU amplicon against MaarjAM database. This is pipeline that has been developed and sed in paper “Comparison of 454 and Illumina sequencing methods to study arbuscular mycorrhizal fungal community diversity” (Vasar et al.xxxx). Both 454 and Illumina pipelines are optimized on analyzing SSU sequences against MaarjAM database using BLAST based OTU picking approach. For 454, we assume that sequences are multiplexed into one file and for Illumina sequences are demultiplexed into separate fastq or fastq.tar.gz (packed) pairs for each sample as it is general approach.
Prerequisite software
BLAST+
USEARCH
FLASh
Python 2.7
Install software and clone project
First download ssu-pipeline
git clone https://github.com/ut-planteco/ssu-pipeline
Update repository and install mandatory software
sudo apt-get update
sudo apt-get install ncbi-blast+
wget https://sourceforge.net/projects/flashpage/files/FLASH-1.2.11.tar.gz/download
You need to download and install USEARCH from their homepage as they provide download link and license per e-mail.
Example datasets for 454 and Illumina are located in example_data.tar.gz and unpacking will create folders 454 and illumina. All the example commands are using these files. Unpack dataset with following command
tar -xzvf example_data.tar.gz
Inside maarjam folder is located MaarjAM database (status October 2016) with FASTA and BLAST+ formatted file formats that can be directly used to identify sequences. Use BLAST+ formatted files as it will allow to use multiple cores compared to only using FASTA file. You can format FASTA file into BLAST+ format as following
makeblastdb -in reference.fasta -dbtype nucl -title CustomDB -out
reference
Clean raw Illumina sequences
Clean Illumina sequences by defining the folder -folder where paired reads are located and provide forward and reverse for both primers with average quality. Make sure that demultiplexed file names coming from Illumina MiSeq platform are correct. Script will gather files named as SAMPLE_R1_001.fastq or SAMPLE_R1_001.fastq.tar.gz. Script will interleave correct forward and reverse reads together that can be easily used by FLASh software to pair them. Because FLASh output FASTQ, we need to convert it to FASTA to make it understandable for BLAST. We also define Illumina Nextera adapters first 10 nucleotides to remove sequences containing part of the adapter for forward -fadapter and reverse -radapter reads. As the example data is using tagmentation based Illumina, we do not need to define forward -fprimer and reverse -rprimer primers. Finally we define that average quality -quality for both reads needs to be at least 30 (0-40). In order to skip intermediate files, pipe each step into one command as following:
python pipeline_clean_illumina.py -folder illumina/ -fprimer "" -rprimer "" -fadapter CTGTCTCTTA -radapter CTGTCTCTTA -quality 30 | ~/applications/FLASH/flash -m 10 -M 300 --interleaved-input - -c | python pipeline_fastq_fasta.py > illumina.cleaned.fasta
Nextera adapters R1
GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG and R2 TGTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG needs to be reverse complement. We only need to match first ten bases to find adapters. These 10 bases have been checked against MaarjAM database and no interference using short adapter sequence is found to catch false positives.
Command help
python pipeline_clean_illumina.py
-folder FOLDER [-fprimer SEQUENCE] [-rprimer SEQUENCE]
[-fadapter SEQUENCE] [-radapter SEQUENCE] [-quality QUALITY]
[-phred QUALITY]
arguments:
-folder FOLDER define FOLDER where FASTQ or FASTQ.tar.gz files are stored
-fprimer SEQUENCE define forward read primer
-rprimer SEQUENCE define reverse read primer
-fadapter SEQUENCE define adapter for forward read
-radapter SEQUENCE define adapter for reverse read
-quality QUALITY average quality of sequence to be accepted
-phred QUALITY FASTQ file phred quality score (33)
Correct strand of the sequences for Illumina
Tagmentation based Illumina produces sequences that are not in the same direction, but USEARCH software needs to have all the reads in same direction as the reference database, we need to change them into correct strand. All the sequences should start from NS31 primer and end with AML2 primer. To achieve this, we use MaarjAM database with our cleaned sequences and run BLAST+ software to identify strand of the sequences. Sequences identified as +/- by the BLAST+ needs to be reverse complemented.
blastn -query illumina.cleaned.fasta -evalue 1e-50 -max_target_seqs 1 -num_threads 4 -db maarjam/maarjam -outfmt 5 | python pipeline_parse_blast.py > illumina.strand.blast
Now run python script that reads BLAST results and fasta input to change direction of the sequences
python pipeline_correct_direction.py -f illumina.cleaned.fasta -b illumina.strand.blast > illumina.correct.fasta
Remove chimeric sequences
Once files are cleaned, we need to remove chimeric sequences that are introduced using PCR. We use USEARCH in reference database mode against MaarjAM database. Make sure to use correct input file for 454 and Illumina, 454.cleaned.fasta and illumina.correct.fasta respectivelly.
usearch -uchime_ref 454.cleaned.fasta -db maarjam/maarjam.fasta -nonchimeras 454.cf.fasta -strand plus usearch -uchime_ref illumina.correct.fasta -db maarjam/maarjam.fasta -nonchimeras illumina.cf.fasta -strand plus
Identify reads against reference database
Once we have removed chimeric reads, we can start identifying sequences using BLAST+ software and MaarjAM database.
blastn -query 454.cf.fasta -evalue 1e-50 -max_target_seqs 1 -num_threads 4 -db maarjam/maarjam -outfmt 5 | python pipeline_parse_blast.py > 454.cf.blast blastn -query illumina.cf.fasta -evalue 1e-50 -max_target_seqs 1 -num_threads 4 -db maarjam/maarjam -outfmt 5 | python pipeline_parse_blast.py > illumina.cf.blast
Summarize BLAST results
Finally, we can summarize BLAST result using parsed output. Providing FASTA file will output also nohit selection that can be used for further BLAST against additional databases. We use parameters -vs and -ve do define reference database variable region location. Because we use MaarjAM database in this example, all the referene sequences start after NS31 primer and variable region on the amplicon is located from 70nt to 300nt after the NS31 primer. We also define hit identity -i to be at least 97% and alignment length -l for the hit at least 95% to be counted as a hit.
python pipeline_summarize_blast.py -f 454.cf.fasta -b 454.cf.blast -i 97 -l 95 -t 0 -vs 70 -ve 300
python pipeline_summarize_blast.py -f illumina.cf.fasta -b illumina.cf.blast -i 97 -l 95 -t 0 -vs 70 -ve 300
Command help
python pipeline_summarize_blast.py -b BLAST_FILE [-f FASTA_FILE] -i IDENTITY[0-100] -l ALIGNMENT[0-100] [-vs VARIABLE_START] [-ve VARIABLE_END] -t BLAST_TYPE[0-2] arguments: -b BLAST_FILE BLAST tabulated output that was generated with pipeline_parseblast.py -f FASTA_FILE FASTA file to be used to output list of bad hits that did not match thresholds -i IDENTITY[0-100] hit identity in percentage to be accepted as a hit, recommended 97 -l ALIGNMENT[0-100] hit aliginment length in percentage to be accepted a hit, recommended 95 -vs VARIABLE_START reference sequence variable region start -ve VARIABLE_END reference sequence variable region end -t BLAST_TYPE[0-2] defines which section of the BLAST to be used to summarize results. 0 - suitable for MaarjAM, only last portion of hit description is used, 1 - all hit description is used, 2 - hit identificator is used
Final results
Two files are generated from previous step. One is named as .nohits.fasta, where sequences that did not get significant hit against reference database are written out and .tsv, where results are written as pivot table with samples and hits sorted in descending order.
BLAST against INSDC (Optional)
To identify nohits we can use INSDC database to understand what else the sequences are containing. We first need to download database from NCBI FTP server and conduct BLAST on the downloaded database. BLAST can be run with all the INSDC data partitions together (large memory usage) or if memory usage is limited, by separately.
Download INSDC database from NCBI FTP server.
Please check number of NT sequences in FTP to download them all, change 41 from below line accordingly. As the database contains only GenBank accessions, GenBank ID and short description, we need to download taxonomy information (gi_taxid_nucl.dmp.gz, taxdump.tar.gz) to build taxonomy tree for each hit.
for i in {00..41}; do echo "Downloading NT.$i"; wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.$i.tar.gz; tar xzvf nt.$i.tar.gz; done wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz gunzip gi_taxid_nucl.dmp.gz gunzip taxdump.tar.gz
Run BLAST against INSDC
To run BLAST with all the INSDC data partitions together, we can simply define database parameter in BLAST as nt or if we want to run them separately, we need to define all the partions one by one or use for loop.
blastn -query 454.cf.blast.i97.a95.nohits.fasta -evalue 1e-50 -max_target_seqs 1 -num_threads 4 -db nt -outfmt 5 | python pipeline_parse_blast.py > 454.nohits.blast
blastn -query illumina.cf.blast.i97.a95.nohits.fasta -evalue 1e-50 -max_target_seqs 1 -num_threads 4 -db nt -outfmt 5 | python pipeline_parse_blast.py > illumina.nohits.blast
or run separately and combine BLAST results together by selecting best hut for each sequence based on BLAST score (change number 41 accordinly to downloaded partitions)
for i in {00..41}; do blastn -query 454.cf.blast.i97.a95.nohits.fasta -evalue 1e-50 -max_target_seqs 1 -num_threads 4 -db nt.$i -outfmt 5 | python pipeline_parse_blast.py > 454.nohits.$i.blast; done less 454.nohits..blast | python pipeline_merge_blasts.py > 454.nohits.blast for i in {00..41}; do blastn -query illumina.cf.blast.i97.a95.nohits.fasta -evalue 1e-50 -max_target_seqs 1 -num_threads 4 -db nt.$i -outfmt 5 | python pipeline_parse_blast.py > illumina.nohits.$i.blast; done less illumina.nohits..blast | python pipeline_merge_blasts.py > illumina.nohits.blast
Summarize BLAST results
We use relaxed parameters to filter potential hits by reducing identity threshold to be at least 90% and length at least 90%. We also need to define files, where taxonomy information is stored for nt database. Warning: as the *.dmp files are relatively large and below script is not optimized, it can use large ammount of memory and will take time to process file gi_taxid_nucl.dmp.
python pipeline_summarize_gbblast.py -b 454.nohits.blast -i 90 -l 90 -ti gi_taxid_nucl.dmp -tt names.dmp -tn nodes.dmp
python pipeline_summarize_gbblast.py -b illumina.nohits.blast -i 90 -l 90 -ti gi_taxid_nucl.dmp -tt names.dmp -tn nodes.dmp
Command help
python pipeline_summarize_gbblast.py [-h] -b BLAST_FILE -ti ID_FILE -tt TAXONOMY_FILE -tn NODE_FILE -i IDENTITY[0-100] -l ALIGNMENT[0-100]
arguments:
-h, --help show this help message and exit
-b BLAST_FILE BLAST tabulated output that was generated with pipeline_parseblast.py
-ti ID_FILE Taxonomy file, where for each GenBank ID node ID is specified
-tt TAXONOMY_FILE Taxonomy file, where for each node ID scientific name is provided
-tn NODE_FILE Taxonomy file, where full tree node connections of node IDs are provided to build full taxonomy tree
-i IDENTITY[0-100] hit identity in percentage to be accepted as a hit, recommended 90
-l ALIGNMENT[0-100] hit aliginment length in percentage to be accepted a hit, recommended 90
Use following citation when using our python scripts:
Vasar, M., Andreson, R., Davison, J., Jairus, T., Moora, M., Remm, M., Young, J.P.W., Zobel, M. and Öpik, M., 2017. Increased sequencing depth does not increase captured diversity of arbuscular mycorrhizal fungi. Mycorrhiza, 27(8), pp.761-773. License: CC-BY