Journal of Biomedical Semantics at BioHackathon 2011

Guest blog post by Mark Wilkinson and Philippe Rocca-Serra

 

Following in the footsteps of the original Open-Bio Foundation’s “BioHackathons”, the Japanese Database Centre for Life Sciences (DBCLS) initiated a series of BioHackathons beginning in 2008. These yearly events provide an opportunity for open-source bioinformatics code projects to come together with the goal of sharing ideas and experiences, and coordinating their efforts to allow their respective tools to more easily work together. A few weeks ago, the 2011 BioHackathon was held in Kyoto, Japan, sponsored by the Japanese Agency for Scientific Technology’s (JST) National Bioscience Database Centre (NBDC). In a natural progression, the theme of each BioHackathon has evolved from Web Service interoperability (2008), to integration of Web Services into visualization tools and mashups (2009), to the utilization and adoption of Linked Data standards (2010) to this year’s theme of the Semantic Web in life sciences.  With more than 65 participants, the activities of the 2011 BioHackathon were diverse and numerous, but we will attempt to highlight some of the achievements here to showcase the excellent work that is done at these events.

General Information and How-To’s

A large number of participants were involved in discussions around “how do I set-up my legacy database as a Semantic Web/Linked-Data resource”, and expressed curatorial concerns (how easily are RDF databases maintained? What is their responsiveness compared to traditional Relational Databases?). From these discussions came a series of tutorials, best-practice suggestions, and examples that can act as a resource for others who wish to begin publishing data using these new technologies: https://github.com/dbcls/bh11/wiki/ConstructionOfLinkedDataDB.

Uniform Resource Identifier (URI) Standards

To truly achieve the Semantic Web vision, there should be agreement on the identifier used for every entity in the bioinformatics space. While there are a number of “shared names” initiatives, none have been widely accepted to date.  To this end, members of the newly established Identifiers.org project proposed a standard for identifier structure and resolution, backed by a funded curatorial process. In collaboration with curators from the Life Sciences Resource Names (LSRN) project, an agreement was reached about what Resource Description Framework (RDF) metadata should be returned when resolving an Identifiers.org URI, and the decision was made to sunset the competing LSRN initiative. In addition, the PSICQUIC, Bio2RDF and SADI projects all agreed to support Identifiers.org URI’s in their infrastructure. We remain hopeful that the backing of these highly visible Semantic Web projects will lead to a much wider adoption of the Identifiers.org URI schema in the community, which would significantly advance the aims of the Semantic Web in Life Sciences.

Semantic Web Service Standards

The SADI project gave a one-day workshop on how to publish SADI-compliant Semantic Web Services, and subsequently worked with other groups on modeling data and services that were important to them. Of note was agreement on the OWL ontology describing a BLAST result – the most complex data structure modeled by the SADI project to date, and one that will act as a template for a large number of other algorithmic services in the bioinformatics space (e.g. HMMER). Care was taken to ensure that the model contained both curatorial information (database information, etc.) as well as the biological information semantically relating query sequences and hit sequences. Importantly, the goal was to represent the semantics of the information contained in the BLAST report, not the structure of the report. As such, the resulting data structures should be usable, verbatim, by downstream tools that consume a diverse array of data-types, including sequences, alignments, or species information. Several SADI-based BLAST services were published during the BioHackathon based on these models, and the Open-Bio participants began building support for these models into their tools.

Tooling

Open-Bio participants undertook a survey of Semantic Web technology support in each of their languages, and then focused on providing support for serializing and de-serializing RDF into their respective object models. Since the Open-Bio objects use the same object model for a wide range of similar data-types (e.g. EMBL vs. GenBank sequence files), achieving this goal would greatly facilitate interoperability by ensuring that all structured bioinformatics data has the same semantic representation regardless of its origin.

Visualization

The Cytoscape 3 visualization environment was targeted for enhancement of its ability to represent RDF data and OWL ontological information. Support for SPARQL was added to Cytoscape, where the resulting output can be visualized in the Cytoscape environment using SPARQL CONSTRUCT queries. This was tested over the RDF version of BioMart, and enhancements were made to simplify SPARQL query building. An open question remaining for this team was what to do with blank nodes in RDF (which are extremely common in, for example, the RDF representation of a BLAST report described above!).

Vocabularies

This year, the Ontology group focused on a number of practical cases, ranging from conversion from RDF to OWL, to more specific conversions such as a GFF3 file format to OWL. Groundwork laying was also performed to deliver tools for carrying out functional enrichment analysis with any OWL formatted resource. It gave the opportunity to evaluate the various OWL reasoners now available. Another ‘tour-de-force’ achieved by the group consisted in exploring alignment by means of semantic features as if they were sequences. The work spanned four days, with almost two days required for the computation of transitive closures. Further work as a follow up of the BioHackathon meeting will be needed to mine the results.

BioDBcore session – Resource Description and Discovery

The BioDBcore team was dedicated to reviewing and finalizing descriptors of database resources to provide key information about data resources, expressing licensing terms, access point and their protocols as well as the nature of datatypes stored. BioDBcore information will be made available as RDF graphs, building on biositemap information model. This should ensure smooth exchange with existing registries. The BioDBcore meeting allowed alignment with Medals, the Japanese resource cataloguing effort. Cross pollination occurred and reliance on identifiers.org URI for referencing key information (taxonomic information, bibliographic records, annotation standards as provided by the Biosharing catalog) was also discussed and will be channeled back to the BioDBcore group,  a joint effort of the International Society of Biocuration and the Biosharing initiative for final vetting. Additional topics addressed record internationalization and dealing with information in languages other than English.

Testament of the benefit of events such the BioHackathon, the work on BioDBcore triggered the creation of an RDF data sharing requirement group which surveyed the nature of information to embed in a named graph to enable dynamic catalog generation from distributed sources simply based on metadata declaration.

G-language group

Following up on earlier work, the G-language group met again and explored 2 tracks: the first one along an identifier conversion service aiming at simplifying bioinformatics data integration tasks. This G-language based REST service accepts regular and common identifiers as input but more interestingly, is also capable of dealing with sequence input (BLAT algorithm) and returns associated identifiers and persistent URLs. All this information is available in classic flavors (e.g. GenBank or tab-delimited) but most relevant to this audience, the RDF flavor is the sweetest and was the results of efforts carried out during BioHackathon 2011.

The second stroll took the group on the path of visualization, with the goal of providing creative methods to reduce the ‘hair-ball’ effect, which often limits visibility of Linked Data visual rendering. To this end, the group used filtering and enrichment techniques applied to an  E. coli gene set to query the linked data space taking advantage of the restauro-g.v2 service we just described. Cramer’s V for nominal data and Spearman’s rank correlation for continuous data were used to assess relatedness of information, while simple Fisher’s exact test was used for evaluating enrichment of top 25% continuous data (in comparison to all genes) against nominal data (categories). Implementation relied on the Javascript InfoViz Tookit and the group is now looking into ways to make this service available to JSON feed from a SPARQL endpoint. We look forward to their progress.

BioHackathon 2011 is over but will probably be remembered as a turning point that saw many of the components required for realizing the Linked Data Vision in the Life Science domain being created, refined or delivered by a group where enthusiasm and humor mixes with craftsmanship and patience. These will be captured in a "thematic series" of manuscripts written by the BioHackathon 2011 participants over the next few months, and published in Journal of Biomedical Semantics. While transitive closures, just like watermelons, were cracked by brute force, more subtle methods offered an exquisite assortment of technical solutions. Could this rival with the incredible 14-course meal of Japanese gastronomy the participants had the opportunity to experience? Some of the attendees are still debating the issue. With all the work done this year, one can only eagerly await for the next BioHackathon 2012 to take place and use the winter months to refine use cases and questions to be thrown at this fast evolving infrastructure.

The success of BioHackathon 2011 owes so much to the impeccable organization and sagacity of our Japanese hosts who went to great length to ensure that all the cats were herded safely at all times, probably helped in their task by the careful watch of Inari. We are very grateful and can only say one thing: Okini.

 

Mark Wilkinson (The Wilkinson Laboratory; Editorial Board Member of Journal of Biomedical Semantics)
Philippe Rocca-Serra (Technical Project Leader of the Standards and Data Sharing Infrastructure)

View the latest posts on the Research in progress blog homepage

Comments