Illumina CASAVA-1.8 FASTQ Filter

Download, Installation, Usage, More Details, Alternatives, Galaxy Wrapper, License, Contact.

What is it?

The recent version of Illumina's CASAVA pipeline (Version 1.8) produces FASTQ files with both reads that pass filtering and reads that don't.

The new READ-ID (the @ line) contains many new fields, one of them indicates whether the read is filtered or not.

This program can filter FASTQ files produced by CASAVA 1.8, and keep/discard reads based on this filter flag.

scroll down for more details


Version Release Date Source code Precompiled static binaries
0.1 05-Aug-2011 fastq_illumina_filter-0.1.tar.gz Linux-x86_64 (64bit), Linux-i686 (32bit)


$ wget
$ tar -xzf fastq_illumina_filter-0.1.tar.gz
$ cd fastq_illumina_filter-0.1
$ make
$ sudo cp fastq_illumina_filter /usr/local/bin


$ fastq_illumina_filter -h
fastq_illumina_filter (version 0.1) - Filters a FASTQ file generated by CASAVA 1.8
Copyright (C) 2011 - A. Gordon  - Released under AGPLv3 or later.

Usage: fastq_illumina_filter  [--keep Y/N] [-NYhv] [-o OUTPUT] [INPUT]

   [INPUT]   = Input file. Reads from STDIN if no file is specified.
   [-o OUTPUT] = Output file. Default is STDOUT if no file is specified.
   --keep N  = Keep reads that were NOT filtered.
               (Reads that have 'N' in the read-ID line.)
   --keep Y  = Keep reads that were filtered.
               (Reads that have 'Y' in the read-ID line.)
   -N      = same as '--keep N'
   -Y      = save as '--keep Y'
   -v      = Report read counts to STDERR
             Use twice to show progress while procesing the file.
             When combined with '-o FILE', report goes to STDOUT.
   -h      = This helpful help screen

  Reads that were filtered (have 'Y' in the read-ID)
  are the LOW QUALITY reads. You most likely DO NOT want them.
  (In previous CASAVA version, those are reads that have not passed filtering)

  Reads that were NOT filtered are the better-quality reads.
  So using '-N' or '--keep N' is probably the option you want to use.

FASTQ files from CASAVA-1.8 Should have the following READ-ID format:
@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<index sequence>

Example 1:
  $ gunzip -c NA10831_ATCACG_L002_R1_001.fastq.gz | head -n 4
  @EAS139:136:FC706VJ:2:5:1000:12850 1:N:18:ATCACG

  $ gunzip -c NA10831_ATCACG_L002_R1_001.fastq.gz | fastq_illumina_filter -vvN | gzip > good_reads.fq.gz
  Processed 4,000,000 reads
  fastq_illumina_filter (--keep N) statistics:
  Input: 4,000,000 reads
  Output: 3,845,435 reads (96%)

Example 2 (The input and output files are not compressed):
  $ fastq_illumina_filter --keep N -v -v -o good_reads.fq NA10831_ATCACG_L002_R1_001.fastq
  Processed 4,000,000 reads
  fastq_illumina_filter (--keep N) statistics:
  Input: 4,000,000 reads
  Output: 3,845,435 reads (96%)

More details

The Illumina CASAVA-1.8 User guide (Part # 15011196 Rev B) has this to say about the new FASTQ file format:

NOTE: The grep command in the manual is WRONG. It will produce invalid FASTQ files.
The incorrect parameter is the "-A 4". It will produce extra lines with two dashs, and also output an extra line (4 instead of 3).
From GNU's GREP manual page:
       -A NUM, --after-context=NUM
              Print  NUM  lines  of  trailing context after matching lines.  Places a line
              containing a group separator (--)  between  contiguous  groups  of  matches.
              With  the  -o or --only-matching option, this has no effect and a warning is
Here's a sample file if you want to experiment with the new FASTQ file format.

Simpler Alternatives

You might think that compiling a C++ program just for this simple filtering is an overkill.
No problem, here are some command-line alternatives:

GREP (correct way to use it)

$ cat input.fq |
  grep -A 3 '^@.* [^:]*:N:[^:]*:' |
  grep -v "^--$" > output.fq
(try it without the second 'grep', and you'll see the group-separator lines)


$ cat input.fq |
  awk -v FS=: '{ getline a; getline b; getline c;
                 if ($8=="N") {
                   print $0; print a; print b; print c;
               }' > output.fq

So why use a compiled program?

  1. This program performs strict input validation on all lines, detecting invalid FASTQ files and giving informative errors.
    The above grep/awk command will process invalid files with no warnings
  2. This program prints a friendly report at the end, indicating how many reads passed filtering
Granted, these are just niceties, non of the above is critical for filtering.

Galaxy Wrapper

A Galaxy XML tool wrapper is included in the tar file (<tarball>/galaxy/tools/fastx_toolkit/fastq_illumina_filter.xml). Copy this file to your local Galaxy installation, and add the following to your tool_conf.xml file:
    <tool file="fastx_toolkit/fastq_illumina_filter.xml" />
Galaxy interace


AGPLv3 or later, of course.


gordon (at) cshl (dot) edu