Ephemeral assemblage: File formats of bioinformatic programs

It should be clear for any bioinformatician today that most file formats used by the vast majority of bioinformatic programs are either “plain text” or XML, being the former the most prominent. Just to mention a few examples we have FASTA format, GENBANK format, EMBL format, BLAST plain text format, etc. The adoption of plain text as a standard format in scientific computer programs (in particular in bioinformatics) can be explained by the fact that a plain text file can be read immediately by humans, without the need of programming data visualizers or the use of additional sophisticated programs. However what is easy to process by humans is not so easy to process by a computer program (which we normally call a “parser”, from the latin for “part”: pars). For instance, multiple “white spaces”, “tabs” or “break-lines” can be ignored by a human reader but a parser needs to know exactly how to handle all these situations.

Since a couple of years the XML standard has started to being used in some well-known bioinformatic tools (like BLAST for example) in order to it make easier (hence less error-prone) the processing of the outputs of these programs. The design of XML is mainly oriented to documents, however its widespread usage has called the attention of some developers and for this reason some programs have started to output data in XML as well as in plain text. Although this is an advance toward the optimization of the processing of program outputs, I consider XML not the best available solution today for this purpose since these outputs are commonly not documents but instead highly repetitive chains of data. This fact makes the outputs so big that they can reach easily a size several times larger than the normal plain text outputs, depending on the size of the “tags” used on the XML description.

Particularly in bioinformatics, this is an aspect to take into account since we normally process DNA sequences and as DNA sequencing technologies are getting cheaper we now we may have millions of reads (a short sequence of DNA) which if we process, with a result for each of them, we can expect huge files as outputs.

Let’s see an example: Suppose we have some reads and we need to BLAST them. Suppose that we are only interested in the best hit for each read so the XML output for each read will include standard BLAST information, among which we have the bit-score that represents a measure of how much the read is related with the hit, on the selected BLAST database, in terms of homology. The entry for the bit-score information on the XML file looks like this:

<Hsp_bit-score>40.9604178096031</Hsp_bit-score>

As we can see, for each read we have 15+16=31 additional characters. If our file is stored in ASCII, these 31 extra characters will use 31 bytes. In total, in a typical XML BLASTN output the amount of extra characters is 1706 (1706 bytes). For the particular DNA sequence used in our example, the total XML BLASTN output was 2737 bytes, of which 1706 bytes were XML tags, representing an overhead of 62%, i.e. the XML tags used above twice the used space by the usable information.

From the above description one can quickly figure out that an optimum format to be used in bioinformatics would be one that meets the following requirements:

1. Readability without proprietary or complicated programs

2. Fast and easy to process by computer programs (oriented to mass-analysis)

3. Low overhead in the storing of the data

Massive amounts of data are typically store in relational databases and the way used to consult the data is through SQL (Structured Query Language). SQL is very easy to learn and use since its statements are quite close to the natural language that a person would use for that matter. Until some years ago a relational database was, exclusively, a complex piece of software, composed by lots of files, configurations, processes and other computer resources, which would make a relational database an unthinkable choice to be used as a “format” for storing simple outputs of programs. However ten years ago Dr. Robert Hipp released SQLite, a complete relational database stored on just one single file with no other process required to connect to nor complicated configurations. SQLite is also open-source, totally free and, as its name suggest, implements SQL as the query language.

In this way, SQLite meets the requirement on number 1 by using SQL either from other programming languages or from the many free available SQLite-viewers that one can find on Internet. As any relational database, SQLite stores the data in tables, which are indexed and so the processing is very fast, meeting the requirement number 2 above. Regarding the overhead of the stored data we have also good results. Without falling into the complexities of the internal sqlite format, we have carried out the experiment of creating a table in SQLite considering all the fields present on a typical XML BLASTN output. We have created a table with columns named exactly as the tags of this XML and the single sqlite file containing this table used 2048 bytes. Then we inserted 1000 times the same entry (simulating 1000 different blast results) and the sqlite file used 1153024 bytes, to which if we take off the previous 2048 bytes and then divide by 1000, we get 1151 bytes for each stored blast result in the sqlite format. Remember that in XML we used 2737 bytes, so the overhead by storing results in sqlite is remarkably low.

SQLite is also available for the main operating systems today and a sqlite file created on Linux for example, will be readable on Windows and MacOS. This make it possible to use SQLite format as output in Web services, like NCBI’s, EMBL’s, etc. Another important advantage when using SQLite as output format is that we are able to extract only the desired results, without the need to read the whole output. For example in BLASTN, if we need only results with bit-score higher than 50, then we can use:

SELECT * from BLASTOUTPUT where Hsp_bit_score >= 50;

If the column Hsp_bit_score is indexed (a typical characteristic in SQL tables) we get those columns very fast, skipping all the rest of the results. In contrast, in plain-text or XML outputs we need to read the whole output file and filter the results we are interested on, using much more computer resources (CPU-time and memory) and also programming time.

Each month new bioinformatic programs are released on the specialized journals and with them a new text format for the output is created and in order to use those outputs new parsers should be programmed. I believe all these could be avoided if the way of storing the data for programs was in a format more efficient for these cases, and especially in Bioinformatics.

Ephemeral assemblage

File formats of bioinformatic programs

No comments:

Post a Comment