Exploring the structure of genomes and analyzing their evolution is essential to understanding the adaptation of organisms to biotic and abiotic environments. Indeed, the huge number of genetic variants derived from next generation sequencing technologies may be used in approaches such as linkage disequilibrium calculation or haplotype reconstruction to serve a large spectrum of research purposes, including:
- Genome-wide association studies (GWAS) that connect genotypic variation to agriculturally or medically relevant phenotype variation;
- Research of selection footprints;
- Identification of causal genes and mutations in general;
- Population structure and diversity analyses to infer population history and dynamic;
- Phylogenetic analyses.
To do so, this variation data needs to be filtered according to specific criteria, and polymorphisms need to be compared across user-defined sets of individuals in order to obtain genotyping data matrices that are suitable for the targeted analyses. On account of the large amounts of data being produced by modern technology, computational challenges must be addressed to achieve efficient storage of such data, on which this filtering step is dependent.
The need for a tool such as Gigwa emerged while producing RNA-seq data in the scope of the ARCAD project. Variant calling undertaken by our team of bioinformaticians was leading to fairly large VCF files which we wished geneticists to be able to exploit autonomously. Most of these researchers are comfortable using Excel to manipulate such data, but this tool does not measure up well. Conversely, command-line tools existed (e.g. VCFtools) to address the size issue, but were not appropriate for biologists to use.
Therefore we identified the lack of a user-friendly tool providing a graphical interface for filtering large variation datasets. We assumed that although the provided panel of filters could never be exhaustive, combined with the ability to export data into numerous popular formats, it would prove helpful at least as an intermediary solution for extracting subsets of data on which more specific analyses could be conducted with ease.
We chose MongoDB as a storage layer because of its ability to handle large datasets and its support for complex queries via the Aggregation Framework. Having had some experience in developing genotyping data management applications, we designed a data model with the aim of being compatible with the VCF format, while remaining flexible enough to be able to syndicate data from different files into a unique database.
Our initial prototype was targeted at providing a tool to be managed by a sysadmin, which would provide a way to centralize data and share it across a community of users. But the need for a ‘standalone-like’ version also emerged, so we made efforts to provide a lightweight application that could also be used on any 64-bit workstation. While fine-tuning our initial prototype to make it a mature piece of software, we progressively included features suggested by users or reviewers. Among them:
- Filtering on functional annotations
- Ability to abort running queries
- Display of query progress
- Support for seven different export formats
- Easy connection with IGV for integration within a consistent genomic context
- No loss of phasing information when provided (VCF format only)
- Support for haploid, diploid, and polyploid data
- Online variant density viewing
As we are about to publish the first official version of Gigwa, we are working on the development of a GA4GH-compliant REST API. This enhancement will allow Gigwa to be used for instance as an intermediate storage layer between NGS pipelines and visualization tools. More investigations planned for the near future include:
- Docker packaging for easy setup in different environments;
- Further improvements on storage performance and data retrieval speed;
- Evaluating the option of wrapping the tool for integration in Galaxy.