Imagine you would like to study a plant species of interest, say for its ability to supply food, fiber, fuel or just to gain a deeper scientific understanding. Two approaches immediately come to mind: an understanding of its underlying genetics and an understanding of plant phenomics, which spans how the genes express themselves across huge length scales from the microscopic to the whole plant itself.
It is widely recognized that characterizing a plant at the genetic level is now relatively easy. This has been enabled by advances in technology – allowing cheaper and faster genetic and genomic characterizations than ever before , with comparisons often drawn to how much quicker the genotyping revolution has occurred compared to the evolution of modern electronics when one considers the cost of data processed . Of course the assembly of this genetic data can be complicated by repetitive sequences, polyploidy, or any other genetic complexities, but powerful tools have been created to deal with that. And since all living organisms share the same genetic code (different combinations of 4 molecules: adenosine, cytosine, guanine and tyrosine), most of these tools are ubiquitous and could be used for studying many different organisms and species. Once the analysis is done, then encoding and sharing the data is also easy and well agreed upon within the genomics community. Then, if you would like to compare your species with other characterized ones , you’ll just have to go online and browse the already existing resources (Gramene, TAIR, you name it).
Now let’s imagine that you would like to characterize the phenotype of your plant species.
The first decision you have to make is what type of phenotype data is most meaningful to you. Indeed, phenotype data come in all sorts of flavors and shapes: from simple leaf width and length, to morphometrics data, to biochemical profiles or leaf spectral data. Each data type comes with its unique technical constraints and “price tag”. Depending on your interest and budget, you already have to make a choice and confine your analysis to only a subset of phenotypes.
The second aspect of phenotypic choice is how do you ensure that the data you collect is meaningful across location and time? Plants are extremely plastic – they grow and develop, strongly interacting with their direct environment. Unlike animals, whose shape and growth are relatively stable within one species, two genetically identical plants grown in different environments, can present extremely large variations in size and shape. For the sake of comparison: image a pair of cloned cats one of them grows in Norway, the other one in Italy. If, like plants, cat growth and development were highly dependent on the temperature, the one raised in Italy would be three to four times bigger than the other one. Well, this certainly doesn’t happen in cats but it can in plants. So, when trying to characterise a plant phenotype, the environmental conditions and timing at which the phenotyping is done is of utmost importance.
Then, once you have chosen which traits you want to measure, as well as when and where to measure them, you’ll have to choose how to store, curate and encode this data depending on the kind of subsequent analysis. Almost no standard exists to store plant phenotypic data and only a few platforms currently exist to share them, such as the Plant Trait Database or MetaboLights for metabolic phenotypes. Given the limited existing platforms, if you want to compare your new data with existing ones, then you’ll have to dig into existing publications, find the supplemental data, that is most likely to be stored as PDF files (often at the publisher request), then re-format the data so it matches yours – and only then, will you be able to compare them. And this is one of the best cases, since most of the time such data is not shared, except in the form of processed data or charts.
Plant phenotyping, as a shared and accessible research community is in many cases still in its infancy. The parallel development of a plethora of sensors and robotic platforms, both in the greenhouse and the field, has significantly increased the amount of phenotypic data that could be generated, moving the bottleneck from data acquisition to data storage and analysis. To overcome these bottlenecks in order to accelerate research based on plant phenomics, the plant phenotyping community needs access to a common platform with resources to facilitate accessibility of raw phenotypic data in standardized formats and the tools to process such data. The GigaScience Thematic Series on “Plant Phenomics: Data Integration and Analysis” aims to address some of these challenges.
We elucidate the complexity of some of these challenges mentioned above – plant phenotypic data are highly variable, either in their very nature (shape, protein content, color) or in their context (plant species, plant age, environment, ….). One of the main challenges that the community is now facing is the ability to integrate phenotypic datasets coming from different sources in order to make sense of the whole picture. How do we compare data from multiple sites, taken by multiple teams with different tools? How do we integrate data at the the cellular, plant and field scale? Examples already exists, such as Terra-Ref (ARPA-E) that are attempting to create ‘gold’ standard ground-truthed phenomics datasets, which will compare multiple sensor types, that will be freely available to the research community. But these examples are too few.
Another challenge relates to a more technical aspect; such as, that how do we store such data? Different formats are already available: tabulated (CSV) or structured file (XML, JSON)? How do we encode the metadata? What units do we use? These questions might seems childish, but by not answering them, the community is bound to navigate into an ocean of ever changing data format and types. On the other hand, having such standards would allow researchers to further advance with their analysis. It would allow any new piece of information to be compared with previous ones (as is it theoretically possible with genetic data). It would also allow the re-use of previously gathered datasets to better analyse new data, for instance by means of machine learning techniques , as well as further enhance reproducibility. Another key advantage is that a common format to store phenomics data might attract more computational scientist to the field and would encourage them to create better and more universal analysis pipelines.
Sharing is Caring
Finally comes the question of sharing such data. As of today, most publishers do not require data to be shared although the situation is improving . Still, most of the time when data is required, the preferred format is usually not parsable by automated algorithms (some publishers still prefer supplemental data to be uploaded in PDF format). Support for data deposition in pre-print servers, that are quickly gaining popularity within the biological sciences, is still largely missing.
Having such valuable data scattered across the web is an important loss of resources and money, and certainly a loss for plant science in general. Again, some example exist, such as the Peer Reviewers’ Openness Initiative, that invites reviewers to acquire the raw data before performing any reviewing task. See also an Editorial on GigaScience’s open peer review.
Calling for Papers
In light of these challenges in Plant Phenomics, GigaScience has launched an open thematic series “”Plant Phenomics: Data Integration and Analyses”, Guest Edited by Dr Rubén Rellán Álvarez, Dr Guillaume Lobet, Dr Malia Gehan and Dr Srikant Srinivasan.
This comprehensive series aims to shed light on new advances, applications, and challenges, and to improve data sharing, integration, analyses and reproducibility in plant phenomics. We encourage the submission of Research Articles and Technical Notes, as well as Data Notes, which are papers that focus on the description of interesting plant phenomic datasets, curated and hosted in our database, GigaDB. We also consider Commentaries and thought provoking Reviews in this area.
Potential topics include, but are not limited to:
- New methods in phenotype data collection e.g. Drones, imaging techniques
- New tools in phenotype data integration and analyses
- Organ-scale phenomics
- Databases, management and workflows
- Research or Data wrapped in a containerized form e.g. Docker, BioBox, VMs
For more information, please email firstname.lastname@example.org
- Michael P Pound, Alexandra J Burgess, Michael H Wilson, Jonathan A Atkinson, Marcus Griffiths, Aaron S Jackson, Adrian Bulat, Georgios Tzimiropoulos, Darren M Wells, Erik H Murchie, Tony P Pridmore, Andrew P French. Deep Machine Learning provides state-of-the-art performance in image-based plant phenotyping. bioRxiv 053033; doi: http://dx.doi.org/10.1101/053033
- PLOS One Data Availability. http://journals.plos.org/plosone/s/data-availability. Accessed 1 June 2016