Introduction

CUSHAW2 (the second distribution of CUSHAW software package for next-generation sequencing read alignment) is a fast and parallel gapped read alignment to large genomes, such as the human genome. The performance evaluation, by aligning simulated and real datasets to the human genome, shows that CUSHAW2 is consistently among the highest-ranked aligners in terms of alignment quality for both single-end and paired-end alignment,while demonstrating highly competitive speed. Furthermore, our aligner shows good parallel scalability with respect to the number of CPU threads.

CUSHAW2 is presented in the paper "Long read alignment based on maximal exact match seeds". This algorithm has been further accelerated using GPU computing and implemented as CUSHAW2-GPU (refer to the paper CUSHAW2-GPU: empowering faster gapped short-read alignment using GPU computing).

Note:


Downloads


Citation

Other related papers


Parameters for all 2.1.x versions

Input:

Output:

Scoring:

Align:

Seed:

Pairing:

Compute:

Others:


Installation and Usage

Installation from source code

Preparation

  1. Users can configure CUSHAW2 to use SSE2, instead of SSE4, when SSSE3 is not available on your CPUs. This configuration can be done by changing "have_ssse3 = 1" to "have_ssse3 = 0" in the Makefile. If users do not know whether their CPUs support SSSE3 or not, please just simply change to "have_ssse3=0" in the makefile because SSE2 is supported in nearlly all Intel and AMD CPUs.

  2. How to known when to modify the Makefile to determine the use of either SSSE3 or SSSE2?
    • run command "cat /proc/cpuinfo" to check the CPU information. In the "flags" line, check the existence of word "ssse3". If existing, it means that your CPU support SSSE3 and otherwise, not support.
    • When you failed to compile CUSHAW2, please first check whether it is caused by unidentified SSSE3 assembly instructions.

Source code

Compile the genome indexer and read aligner

Build the BWT and the FM-index

Typical Usage

Want all mappings per read?

  1. specify a very large integer value to options "-multi" and "-max_occ" simultaneously. Please do not exceed the range of the signed integer type.

Important Notes:

  1. gzip-Compressed FASTA and FASTQ formats, SAM and BAM foramts are supported as input.
  2. When inputing multiple paired-end read files, the paired-end reads must have the same insert-size information.
  3. The default scoring scheme is generally good enough for long read alignment. Certainly, better performance might be able to be obtained after making more efforts to finely tune the scoring scheme.
  4. By default, only a single "best" alignment will be output for a single read. Users can get more top alignments using parameter "-multi".
  5. Both aligned and unaligned reads are printed out to the SAM output file. In addition, for paired-end alignment, if an aligned read failed to be paired, it is outputted in single-end mode.
  6. By default, CUSHAW2 estimates the insert size information from the input. The insert size is estimated from a fixed number of read pairs starting from the head of the inputs. This will take some extra time at startup time (e.g. takes about 1 minute using a single thread for the first 65536 100-bp read pairs). However, since this estimation is only conducted once, this extra time can be neligible. If users customize the insert size, this automatic estimation will be disabled, thus saving some time.
  7. For all 2.4.x versions, we have masked all ambiguous bases in the reference. Users can disabling this masking by disable the macro "CHECK_UNKOWN_GENOME_BASES" in the Makefile while compiling.

Change Log


Contact

If any questions or improvements, please contact Liu Yongchao (Email: yliu860 (at) gatech (dot) edu).