CUSHAW2 - fast, parallel and accurate gapped read alignment

Introduction

CUSHAW is a well-established leading next-generation sequencing read alignment software package based on multi-core and many-core computing. CUSHAW2, the second distribution of CUSHAW software package for next-generation sequencing read alignment, is a fast and parallel (multi-threaded and GPU-enabled) gapped read alignment to large genomes, such as the human genome. The performance evaluation, by aligning simulated and real datasets to the human genome, shows that CUSHAW2 is consistently among the highest-ranked aligners in terms of alignment quality for both single-end and paired-end alignment,while demonstrating highly competitive speed. Furthermore, our aligner shows good parallel scalability with respect to the number of CPU threads.

CUSHAW2 is presented in the paper "Long read alignment based on maximal exact match seeds". This algorithm has been further accelerated using GPU computing and implemented as CUSHAW2-GPU (refer to the paper CUSHAW2-GPU: empowering faster gapped short-read alignment using GPU computing).

Note:

How to produce all mappings per read? specify a very large integer value to options "-multi" and "-max_occ" simultaneously.

Downloads

CUSHAW2 (v2.4.3)
New features in the new CUSHAW2 v2.4 series are described here.We have evaluated the alignment quality, as well as the variant calling performance, using the public benchmark datasets in GCAT (Genome Comparison & Analytic Testing). Click the following links to get the results and comparisons with graphical interface.^NEW
CUSHAW2 (v2.1.11)
We have evaluated the alignment quality, as well as the variant calling performance, using the public benchmark datasets in GCAT (Genome Comparison & Analytic Testing). Click the following links to get the results and comparisons with graphical interface. ^NEW
The following list the ready-made BWT indexes for versions 2.1.x.
CUSHAW2-GPU (v2.1.8-r16)
This program (based on CUSHAW2 v2.1.8) is designed and optimized for Kepler-based GPUs, but still workable on earlier-generation Fermi-based ones. By default, CUSHAW2-GPU supports short-reads of lengths <= 320 and thus requires at most 4GB GPU device memory and about 6 GB peak resident host memory for the human genome. This program has been tested on Tesla K20 and Tesla C2075. Welcome any bug reports and suggesitons! The detailed implementation of the algorithm has been presented in the paper "CUSHAW2-GPU: empowering faster gapped short-read alignment using GPU computing" (accepted by IEEE Design & Test of Computers, 2013).
Genome indexer for CUSHAW2 version 2.1.x and CUSHAW2-GPU
Users can download this independent source code of CUSHAW2 genome indexer to build genome indices. Additionally, this code can also be available in the source tarball of CUSHAW2 and CUSHAW2-GPU.
Simulated reads
1. in the CUSHAW2 paper
2. in the CUSHAW2-GPU paper

Citation

Yongchao Liu and Bertil Schmidt: "Long read alignment based on maximal exact match seeds". Bioinformatics, 2012, 28(18): i318-324
Yongchao Liu and Bertil Schmidt: "CUSHAW2-GPU: empowering faster gapped short-read alignment using GPU computing". IEEE Design & Test, 2014, 31(1): 31-39

Other related papers

Yongchao Liu, Bertil Schmidt, and Douglas L. Maskell: "A fast CUDA compatible short read aligner to large genomes". GPU Technology Conference 2012 (GTC 2012), San Jose, USA, 2012
Yongchao Liu, Bertil Schmidt, and Douglas L. Maskell: "CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform". Bioinformatics, 2012, 28(14): 1830-1837
Yongchao Liu, Bernt Popp, and Bertil Schmidt: "CUSHAW3: sensitive and accurate base-space and color-space short-read alignment with hybrid seeding." PLOS ONE, 2014, 9(1):e86869
Yongchao Liu and Bertil Schmidt: "CUSHAW Software Package: harnessing CUDA-enabled GPUs for next generation sequencing read alignment". GPU Technology Conference 2014 (GTC 2014), San Jose, USA, 2014
Jorge González-Domínguez, Yongchao Liu, Bertil Schmidt: "Parallel and scalable short-read alignment on multi-core clusters using UPC++". PLoS One, 2016, 11(1): e0145490.
Yongchao Liu, Thomas Hankeln, and Bertil Schmidt: "Parallel and space-efficient construction of Burrows-Wheeler transform and suffix array for big genome data". IEEE Transactions on Computational Biology and Bioinformatics, 2016, 13(3): 592-598.
Yongchao Liu and Bertil Schmidt: "CUSHAW Suite: parallel and efficient algorithms for NGS read alignment". Algorithms for Next-Generations Sequencing Data: Techniques, Approaches and Applications, edited by Mourad Elloumi, Springer, 2017.
Yuandong Chan, Kai Xu, Haidong Lan, Weiguo Liu, Yongchao Liu and Bertil Schmidt: "PUNAS: a parallel ungapped-alignment-featured seed verification for next-generation sequencing read alignment", 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2017), 2017, pp. 52-61.

Parameters for all 2.1.x versions

Input:

-r <string> (the file name base for the reference genome)
-f <string> file1 [file2] (single-end sequence files in FASTA/FASTQ format)
-b <string> file1 [file2] (single-end sequence files in BAM format)
-s <string> file1 [file2] (single-end sequence files in SAM format)
-q <string> file1_1 file1_2 [file2_1 file2_2] (paired-end sequence files in FASTA/FASTQ format)

Output:

-o <string> (SAM output file path, default = STDOUT)
-unique <bool> (only output unique alignments [with mapping quality scores > 0], default = 0)
-multi <int> (output <= #int [1, 100] equivalent best alignments, default = 1)
-rgid <string> (read group identifier [tag RG:ID])
-rgsm <string> (read group sample name [tag RG:SM], required if #rgid is given)
-rglb <string> (read group library [tag RG:LD], ineffective if #rgid is not given)
-rgpl <string> (read group platform/technology [tag RG:PL], ineffective if #rigid is not given)
supported values: capillary, ls454, illumina, solid, helicos, iontorrent, and pacbio
-rgpu <string> (read group platform unit identifier [tag RG:PU], ineffective if #rgid is not given)
-rgcn <string> (name of sequencing center produced the read [tag RG:CN], ineffiective if #rgid is not given)
-rgds <string> (description about the reads [tag RG:DS], ineffiective if #rgid is not given)
-rgdt <string> (date on which the run was produced [tag RG:DT], ineffiective if #rgid is not given)
-rgpi <string> (predicated median insert size [tag RG:PI], ineffiective if #rgid is not given)

Scoring:

-match <int> (score for a match, default = 1)
-mismatch <int> (penalty for a mismatch, default = 3)
-gopen <int> (gap open penalty, default = 5)
-gext <int> (gap extension penalty, default = 2)

Align:

-sensitive (concerned more about the sensitivity, only using min_score)
-min_score <int> (minimal optimal local alignment score divided by matching score, default = 30)
-min_id <float> (minimal identity of optical local alignments, default = 0.90)
-min_ratio <float> (minimal ratio of reads in optimal local alignments, default = 0.80)

Seed:

-min_seed <int> (lower bound of minimal seed size, default = 13)
-max_seed <int> (upper bound of minimal seed size, default = 49)
-miss_prob <float> (missing probability to estimate the seed sizes, default = 0.04)
-max_occ <int> (maximal number of occurrences per seed, default = 1024)
-max_seeds_per_batch <int> (maximum number of seeds per batch for top hit selection, default = 16384)

Pairing:

-avg_ins <int> (insert size for paired-end reads, estimated from input if not specified)
-ins_std <int> (standard deviation of insert size, estimated from input if not specified)
-ins_npairs <int> (top number of read pairs for insert size estimation [#int times 0x10000], default = 1)
-ins_mapq <int> (minimal mapping quality score of a SE alignment for insert size estimation, default = 20)
-max_seedpairs <int> (maximal number of seed pairs per read pair, default = 10000000)
-no_rescue (do not rescue the mate using Smith-Waterman for an un-paired read)
-no_rescue_twice (do not attempt the second rescuing for an un-paired read)

Compute:

-t <int> (number of threads, default = 1)

Others:

-h (print out the usage of the program)

Installation and Usage

Installation from source code

Preparation

Users can configure CUSHAW2 to use SSE2, instead of SSE4, when SSSE3 is not available on your CPUs. This configuration can be done by changing "have_ssse3 = 1" to "have_ssse3 = 0" in the Makefile. If users do not know whether their CPUs support SSSE3 or not, please just simply change to "have_ssse3=0" in the makefile because SSE2 is supported in nearlly all Intel and AMD CPUs.
How to known when to modify the Makefile to determine the use of either SSSE3 or SSSE2?
- run command "cat /proc/cpuinfo" to check the CPU information. In the "flags" line, check the existence of word "ssse3". If existing, it means that your CPU support SSSE3 and otherwise, not support.
- When you failed to compile CUSHAW2, please first check whether it is caused by unidentified SSSE3 assembly instructions.

Source code

For all 2.1.x versions (including CUSHAW2-GPU):

the sub-directory "cushasw2_index" contains the source code for the BWT construction.
the sub-directory "src" contains the source code for CUSHAW2 aligner

For all 2.4.x versions:

The genome indexing and the alignment algorithms have been integrated into a single program.
Three commands have been given, namely "index", "align" and "calign" for the genome indexing construction, base-space alignment and color-space alignment, respectively.

Compile the genome indexer and read aligner

For all 2.1.x versions (including CUSHAW2-GPU):

Genome indexing
- Enter the sub-directory "cushaw2_index" in the source code tarball
- Type "make" command to generate the "cushaw2_index" executable binary.
Read aligner
- Type "make" command, in the root directory of the source code tarball, to compile the aligner "cushaw2".

For all 2.4.x versions:

Type "make" command in the root directory to compile the aligner.

Build the BWT and the FM-index

For all 2.1.x versions(including CUSHAW2-GPU):

use command "cushaw2_index -a bwtsw genome.fa" to create BWT and FM-index data for large genomes.
use command "cushaw2_index -a is genome.fa" for small genomes.

For all 2.4.x versions:

ust use command "cushaw2 index -a genome.fa" to construct BWT and FM-index for genomes, regardless of the genome size.

Typical Usage

For all 2.1.x versions (including CUSHAW2-GPU):

cushaw2 -r bwt_file_base -f infile1.fa -t 12 -o myalignment.sam
cushaw2 -r bwt_file_base -f infile1.fa infile2.fq -t 12 -o myalignment.sam
cushaw2 -r bwt_file_base -q infile_1.fq infile_1.fq -o myalignment.sam

For all 2.4.x versions:

cushaw2 align -r bwt_file_base -f infile1.fa -t 12 -o myalignment.sam
cushaw2 align -r bwt_file_base -f infile1.fa infile2.fq -t 12 -o myalignment.sam
cushaw2 align -r bwt_file_base -q infile_1.fq infile_1.fq -o myalignment.sam
cushaw2 calign -r bwt_file_base -q infile1.fa infile2.fq -t 12 -o myalignment.sam

Want all mappings per read?

specify a very large integer value to options "-multi" and "-max_occ" simultaneously. Please do not exceed the range of the signed integer type.

Important Notes:

gzip-Compressed FASTA and FASTQ formats, SAM and BAM foramts are supported as input.
When inputing multiple paired-end read files, the paired-end reads must have the same insert-size information.
The default scoring scheme is generally good enough for long read alignment. Certainly, better performance might be able to be obtained after making more efforts to finely tune the scoring scheme.
By default, only a single "best" alignment will be output for a single read. Users can get more top alignments using parameter "-multi".
Both aligned and unaligned reads are printed out to the SAM output file. In addition, for paired-end alignment, if an aligned read failed to be paired, it is outputted in single-end mode.
By default, CUSHAW2 estimates the insert size information from the input. The insert size is estimated from a fixed number of read pairs starting from the head of the inputs. This will take some extra time at startup time (e.g. takes about 1 minute using a single thread for the first 65536 100-bp read pairs). However, since this estimation is only conducted once, this extra time can be neligible. If users customize the insert size, this automatic estimation will be disabled, thus saving some time.
For all 2.4.x versions, we have masked all ambiguous bases in the reference. Users can disabling this masking by disable the macro "CHECK_UNKOWN_GENOME_BASES" in the Makefile while compiling.

Change Log

September 10, 2014 (v2.1.11)
1. Added read group tags information in the SAM output.
October 22, 2013 (v2.4.3)
1. Fixed a boundary-case bug in semi-global alignment.
May 02, 2013 (v2.4.2)
1. Making the SAM format consisitent with the requirements of Picard and GATK.
April 18, 2013 (v2.4.1 and v2.1.10)
1. Fixed a bug when rescuing alignments using a smaller minimal maximal exact match seed size. This possibly occurred for very short reads (e.g. 15bps).
April 4, 2013 (v2.4)
Some new features have been incorporated into this new version, described as follows.
1. Different reference genome indexing from previous versions. Type command "cushaw2 index" to get options for genome index building.
2. Support ABI SOLiD color-space single-end and paired-end/mate-paired alignment. Type command "cushaw2 calign" to get options for color-space alignment.
  By default, CUSHAW2 aligns color-space reads in a conservative manner, i.e. our aligner will only ouput unique alignments with <= 5 edit distances (an alignment is deemed to be unique if its mapping quality score >= 30, as the quite small read lengths). For paired-end/mate-paired alignment, the two alignments of both ends will be reported if either alignment is unique. To yield higher sensitivity, users can run the program in an aggressive manner by specifying parameters "-min_qual 0 -max_edit_dist -1". Use the option "-mode" to specify if the reads are paired-end or mate-paired.
3. Improved base-space single-end and paired-end alignment quality. Type "cushaw2 align" to get options for base-space alignment (e.g. Illumina, 454, Ion Torrent and PacBio).
4. Add a new option "-max_edit_dist" to allow users to specify the maximum edit distance of a reported alignment.
5. Each alignment in the SAM file has two tags given: NM (edit distance) and AS (alignment score).
March 21, 2013 (v2.1.9)
1. Add a new option "-max_seedpairs" for paired-end alignments to confine the maximal number of seed pairs for the seed-pairing heuristic.
2. Add a new option "-max_seeds_per_batch" to confine the peak memory footprint for top-hits selection.
3. Fixed a memory leak in versions 2.1.8 and 2.1.9
4. Deleted all versions (<2.1.8) from the homepage as they are obsolete.

Contact

If any questions or improvements, please feel free to contact Liu, Yongchao.

CUSHAW2 - fast, parallel and accurate gapped read alignment

Site Map

Project Links

List of My Software

Big Data

Machine Learning

Scientific Computing

Sequence Alignment

Motif Discovery

NGS Read Alignment

NGS Read Error Correction

NGS de novo Assembly

NGS SNV calling

NGS Metagenomics

Inspire Innovation

Introduction

Downloads

Citation

Other related papers

Parameters for all 2.1.x versions

Input:

Output:

Scoring:

Align:

Seed:

Pairing:

Compute:

Others:

Installation and Usage

Installation from source code

Preparation

Source code

Compile the genome indexer and read aligner

Build the BWT and the FM-index

Typical Usage

Want all mappings per read?

Important Notes:

Change Log

Contact