Introduction
Duk is a fast, accurate,and memory efficent DNA sequence matching tool. It finds whether a query sequence partially or totally matches given reference sequences or not, but it does not give how a query matches a reference sequence. The common application is to group sequencing reads into small manageable chunks for downstream analysis in assessing quality of a sequencing run, which includes contaminant removal (with contaminant sequences known), organelle genome separation, and assembly refinement.
It uses Kmer hashing method to index reference sequences and Poisson model to calculate a p-value to estimate the reliability of matches. DUK is implemented in C++ using object oriented design.
- Requirement
- Duk should run in any 64 bit Linux/Unix machine with g++.
- Compile Instruction
- Just type make to compile the tool.
>make
- Manual
- duk [options] ref.fa query
- OPTIONS:
-o, -output
print log information to file, default is stdout. -n, -nomatch output the not matched reads to file, the opton value - stands for standard output -m, -match output matched reads to file. -k, -kmer the k mer size, default is 16. -s, -step the step size, default is 4. -c, -cutoff the cut off threshold for matched reads, default is 1. -h, -help print out the help information Identify the reads in query file whether they match to ref.fa reads. The ref.fa must be in fasta format and the query can be in fastq or fasta format. If there is no query file, the tool gets reads from standard input k=16 and c=1 are recommended for small genome like bacteria or adapter removal tasks. k=20 and c=2 are recommended for large genome such as plant. Example: ./duk Illumina.Artifacts.fa sample.fastq - License
- The full DUK package is distributed under FreeBSD License.