Making FASTQ Fast

Next Generation Sequencing has adopted the FASTQ file format as a common way to store sequences from NGS experiments. FASTQ is a text file format that stores sequences and accompanying quality scores sequentially in the file. Typically, these files are read as one of the first steps of a processing pipeline to examine the experimental results.

Initially quality scores are evaluated statistically and measures such as GC content profiles are calculated. Results of these evaluations are then used to select or filter bases or even entire sequences.

Calculating these measure takes time because each FASTQ file typically contains millions of sequences. To address this performance issue, one can use an “analyze once, use many” approach. In other words, calculate the statistics and measures once and store them with the FASTQ data. Then the data can be quickly used because the work of doing these calculations has already been completed.

A useful mechanism for storing both the original FASTQ sequences and quality scores along with statistical results is a database file. The tool that I have created is capable of reading FASTQ files and importing the sequences and quality scores in a corresponding SQLite database file that is easy to store, copy, and distribute.

Once imported, a module is executed to analyze the sequences statistically. The results of the analysis are also stored in separate database tables in the SQLite file.

I have created a utility to read these SQLite database files. The utility currently runs under Microsoft Windows and provides ways to load and graphically display the analysis data.

Currently, that tool is a prototype and I am looking for real world use cases for how the tool can be expanded and enhanced. I am also developing utilities for the batch processing of FASTQ files that will produce SQLite databases for several files at once.

If you are interested in trying this tool, please contact me using the following form: