Trimmomatic
Trimmomatic is a command line tool that performs a variety of functions for Illumina paired-end and single-end sequencing data, including sequence trimming and adapter removal. It is a flexible and efficient NGS preprocessing tool, which greatly improves downstream applications.
Description
- Trimmomatic stand-alone command-line application
Trimmomatic is a fast, multithreaded command-line tool for preprocessing illumina sequencing data. It has been specially developed for the purification of raw sequence reads by removing low quality bases and adapter sequences prior to further downstream analyses such as mapping or assembly. Trimming your data correctly is an important first step that can significantly improve the accuracy and reliability of your results. The tool performs a series of user-defined trimming steps on FASTQ files. It can process both single-end (SE) and paired-end (PE) data and generate clean reads for analysis.
Trimmomatic is available at the official Github page https://github.com/usadellab/Trimmomatic/.
- Understanding the Trimmomatic trimming modes
With simple trimming, each adapter sequence is tested against the reads, and if a sufficiently accurate match is detected, the read is clipped appropriately.
In paired-end data Trimmomatic uses a highly accurate method for adapter removal called palindrome trimming. This mode is specifically designed to handle a common scenario in library preparation where the DNA insert is shorter than the read length. When the DNA fragment being sequenced is shorter than the actual sequencing read length, the sequencing machinery reads through the entire fragment and continues into the adapter sequence ligated to the other end.
The palindrome mode utilises the properties of paired-end reads. The forward and reverse reads are essentially reverse complements of the same original DNA fragment.
- Initial Search
Trimmomatic first performs a standard search, looking for known adapter sequences in your provided FASTA file. - Palindrome Alignment
After the initial search, it takes the read pairs and attempts to align them to each other. Because they originate from the same fragment, the overlapping sections should be perfect reverse complements. - Accurate Clipping
A strong palindromic alignment between the two reads is an extremely reliable sign that they have read each other. The tool can then precisely identify the start of the adapter sequence and clip both reads accurately.
This method is far more sensitive than the simple search for adapter sequences, as it uses the information from the read's partner to confirm the presence of adapter contamination.
- How to run Trimmomatic
When you call up the trimmomatic.jar programme on the command line, use the first parameter to specify whether the sequencing data to be trimmed is of the Paired-End (PE) or Single-End (SE) type.
In addition, there are several optional parameters available, for example to specify the number of CPU threads to use for multi-threading and the type of quality score encoding. See the box below for all optional parameters.
Optional parameters
-threads <threads>
specifies the number of CPU threads to use for multi-threading
-phred33 | -phred64
specifies the quality score encoding. phred33 is the standard for modern Illumina data
-trimlog <log file>
writes a detailed log file
-summary <summary file>
writes a summary of trimming results to a file
-basein <template input file>
sets path to one of the paired-end input files (its mate is auto-detected)
-baseout <template output file>
sets template path used to generate the four paired-end output files
-validatePairs
performs an extra validation step on paired-end reads before trimming to ensure read pairs
are consistent
-compressLevel <compression level>
sets the compression level for BZIP2/GZ output files (1=fastest, 9=best compression)
-compressStream | -compressBlock
specifies the compression mode. Block compression is the default
-quiet
suppresses progress output to the console
-version
prints the version tag
Finally, the processing pipeline is set up by selecting and parameterising at least one trimming step. All further trimming steps and their order are optional. The available trimming steps are executed in the order in which they are added to the command line. It is recommended that the trimming step ILLUMINACLIP, if required, is done as early as possible. The trimming works with FASTQ formatted files, either uncompressed or compressed (the gzip format is determined based on the .gz extension).
trimmomatic.jar running in paired-end (PE) mode requires two input FASTQ files (forward and reverse reads). This mode generates four output files: two for reads that are left as pairs and two for reads that are left as singletons.
java -jar <path to trimmomatic.jar> PE [optional parameters] \
<input_forward> <input_reverse> \
<output_forward_paired> <output_forward_unpaired> \
<output_reverse_paired> <output_reverse_unpaired> \
<step #1> [further optional trimming steps]
trimmomatic.jar running in single-end (SE) mode requires one input FASTQ file and generates one output file.
java -jar <path to trimmomatic.jar> SE [optional parameters] \
<input> <output> \
<step #1> [further optional trimming steps]
- Trimmomatic trimming steps
The FASTQ format uses phred+33 or phred+64 quality scores (depending on the Illumina pipeline used). For example, a score of 20 means 99% and a score of 40 means 99.99% accuracy for the base call (for further details see the Wikipedia site about Phred quality scores). The <quality> parameter in several trimming steps is expected to be a numeric value that refers to phred+33/phred+64 quality scores.
ILLUMINACLIP:<fastaWithAdapters>:<seedMismatches>:<palindromeClipThreshold>:<simpleClipThreshold>
[:<minAdapterLengthPalindrome>:keepBothReads]
cuts adapter sequences and other Illumina-specific technical sequences from the reads. This is the most common first step.
fastaWithAdapters | path to a fasta file containing all the adapters
seedMismatches | maximum number of sequence mismatches allowed in initial search for adapter (2 is a common value)
palindromeClipThreshold | minimum quality score required for an alignment between pairs of reads (30 is a common value)
simpleClipThreshold | minimum quality score required for an alignment between an adapter sequence and a read (10 is a common value)
minAdapterLengthPalindrome | minimum adapter length in palindrome mode (optional)
keepBothReads | specifies if both reads should be kept in palindrome mode (optional)
LEADING :<quality>
cuts off bases with from the 5'-end of a read
quality | minimum quality score required to keep a base, common values are between 3 and 20
TRAILING :<quality>
cuts off bases with from the 3'-end of a read
quality | minimum quality score required to keep a base, common values are between 3 and 20
HEADCROP :<length>
cuts a specified number of bases from the 5'-end of a read
length | number of bases to remove
TAILCROP :<length>
cuts a specified number of bases from the 3'-end of a read
length | number of bases to remove
CROP :<length>
trims a read to a specified length by cutting the 3'-end
length | number of bases to keep
SLIDINGWINDOW :<windowSize>:<quality>
performs a sliding window trim as soon as the average quality score within the window falls below a threshold value
windowSize | length of the sliding window
quality | average quality score required, common values are between 10 and 40
MAXINFO :<targetLength>:<strictness>
trims a read to maximize useful information, balancing for read length and quality
targetLength | user-defined ideal length for a read
strictness | specifies the trade-off between length and quality, numeric value between 0.0 and 1.0
MAXLEN :<length>
discards a read that is longer than a certain length
length | maximum permitted length
MINLEN :<length>
discards a read that is shorter than a certain length (after all previous trimming steps)
length | minimum required length
AVGQUAL :<quality>
discards a read if its average quality is below a threshold value
quality | minimum required average quality score, common values are between 10 and 30
BASECOUNT :<bases>:<minCount>:<maxCount>
discards a read if the frequency of a base (or multiple bases) is not within a certain range, is below a specified minimum or above a specified maximum
bases | specifies a base (or multiple bases), single string of concatenated characters, each representing a base (e.g. N or GC)
minCount | minimum required frequency of a base (or multiple bases) to keep a read
maxCount | maximum permitted frequency of a base (or multiple bases) to keep a read
TOPHRED33
converts quality scores to Phred+33 quality scores
TOPHRED64
converts quality scores to Phred+64 quality scores