Omics! Omics!: The Trouble with FASTQ

"Very few programs will accept either; the Ray assembler is a notable exception." #usability
I spend a lot of time working with sequencing data, and the most common format for such data is FASTQ. FASTQ has many things to appreciate, but FASTQ data also can be troublesome For those who don’t spend their days with such data, a bit of a historical backdrop. When I first was exposed to bioinformatics in the early 90s, there were many formats for sequence data. Pretty much every sequence analysis suite and public database had its own distinct format. The vast majority of these were designed for human readability first and computer parsing second, and tended to have text organized in columns by the use of carefully allotted spaces. Creating files in such formats was a dicey thing, because formal specifications of the formats were not common; you generally just tried to create something that looked like what you had seen and tried it out. That worked often, but failures were rampant. One of my first graduate bioinformatics projects involved a large-scale parse of features from GenBank, and I ended up spending a lot of time detecting and curating various inconsistencies.

Posted in User experience (UX) | Permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *