It should be clear for any
bioinformatician today that most file formats used by the vast majority of
bioinformatic programs are either “plain text” or XML, being the former the
most prominent. Just to mention a few
examples we have FASTA format, GENBANK format, EMBL format, BLAST plain text
format, etc. The adoption of plain text as a standard format in scientific
computer programs (in particular in bioinformatics) can be explained by the
fact that a plain text file can be read immediately by humans, without the need
of programming data visualizers or the use of additional sophisticated
programs. However what is easy to process by humans is not so easy to process
by a computer program (which we normally call a “parser”, from the latin for
“part”: pars). For instance, multiple
“white spaces”, “tabs” or “break-lines” can be ignored by a human reader but a
parser needs to know exactly how to handle all these situations.
Since a couple of years the XML
standard has started to being used in some well-known bioinformatic tools (like
BLAST for example) in order to it make easier (hence less error-prone) the
processing of the outputs of these programs. The design of XML is mainly
oriented to documents, however its widespread usage has called the attention of
some developers and for this reason some programs have started to output data
in XML as well as in plain text. Although this is an advance toward the
optimization of the processing of program outputs, I consider XML not the best
available solution today for this purpose since these outputs are commonly not
documents but instead highly repetitive chains of data. This fact makes the
outputs so big that they can reach easily a size several times larger than the
normal plain text outputs, depending on the size of the “tags” used on the XML
description.
Particularly in bioinformatics,
this is an aspect to take into account since we normally process DNA sequences
and as DNA sequencing technologies are getting cheaper we now we may have
millions of reads (a short sequence of DNA) which if we process, with a result
for each of them, we can expect huge files as outputs.
Let’s see an example: Suppose
we have some reads and we need to BLAST them. Suppose that we are only
interested in the best hit for each read so the XML output for each read will
include standard BLAST information, among which we have the bit-score that
represents a measure of how much the read is related with the hit, on the
selected BLAST database, in terms of homology. The entry for the bit-score
information on the XML file looks like this:
<Hsp_bit-score>40.9604178096031</Hsp_bit-score>
As we can see, for each read we
have 15+16=31 additional characters. If our file is stored in ASCII, these 31
extra characters will use 31 bytes. In total, in a typical XML BLASTN output
the amount of extra characters is 1706 (1706 bytes). For the particular DNA
sequence used in our example, the total XML BLASTN output was 2737 bytes, of
which 1706 bytes were XML tags, representing an overhead of 62%, i.e. the XML
tags used above twice the used space by the usable information.
From the above description one
can quickly figure out that an optimum format to be used in bioinformatics
would be one that meets the following requirements:
1.
Readability
without proprietary or complicated programs
2.
Fast
and easy to process by computer programs (oriented to mass-analysis)
3.
Low
overhead in the storing of the data
Massive amounts of data are
typically store in relational databases and the way used to consult the data is
through SQL (Structured Query Language). SQL is very easy to learn and use
since its statements are quite close to the natural language that a person
would use for that matter. Until some years ago a relational database was,
exclusively, a complex piece of software, composed by lots of files,
configurations, processes and other computer resources, which would make a
relational database an unthinkable choice to be used as a “format” for storing
simple outputs of programs. However ten years ago Dr. Robert Hipp released
SQLite, a complete relational database stored on just one single file with no
other process required to connect to nor complicated configurations. SQLite is
also open-source, totally free and, as its name suggest, implements SQL as the
query language.
In this way, SQLite meets the
requirement on number 1 by using SQL either from other programming languages or
from the many free available SQLite-viewers that one can find on Internet. As
any relational database, SQLite stores the data in tables, which are indexed
and so the processing is very fast, meeting the requirement number 2
above. Regarding the overhead of the
stored data we have also good results. Without falling into the complexities of
the internal sqlite format, we have carried out the experiment of creating a
table in SQLite considering all the fields present on a typical XML BLASTN
output. We have created a table with columns named exactly as the tags of this
XML and the single sqlite file containing this table used 2048 bytes. Then we
inserted 1000 times the same entry (simulating 1000 different blast results)
and the sqlite file used 1153024 bytes, to which if we take off the previous
2048 bytes and then divide by 1000, we get 1151 bytes for each stored blast
result in the sqlite format. Remember that in XML we used 2737 bytes, so the
overhead by storing results in sqlite is remarkably low.
SQLite is also available for
the main operating systems today and a sqlite file created on Linux for
example, will be readable on Windows and MacOS. This make it possible to use
SQLite format as output in Web services, like NCBI’s, EMBL’s, etc. Another
important advantage when using SQLite as output format is that we are able to
extract only the desired results, without the need to read the whole output.
For example in BLASTN, if we need only results with bit-score higher than 50,
then we can use:
SELECT * from
BLASTOUTPUT where Hsp_bit_score >= 50;
If the column Hsp_bit_score is
indexed (a typical characteristic in SQL tables) we get those columns very
fast, skipping all the rest of the results. In contrast, in plain-text or XML
outputs we need to read the whole output file and filter the results we are
interested on, using much more computer resources (CPU-time and memory) and
also programming time.
Each month new bioinformatic
programs are released on the specialized journals and with them a new text
format for the output is created and in order to use those outputs new parsers
should be programmed. I believe all these could be avoided if the way of storing
the data for programs was in a format more efficient for these cases, and
especially in Bioinformatics.
No comments:
Post a Comment