Let’s leverage herbaria impact through Deep Learning

Across the world, museums and institutions maintain herbaria containing millions of plant specimens collected over hundreds of years. Yet this treasure trove of data is relatively un-tapped because of the difficulty of identifying and classifying specimens. In this guest blog, José Carranza-Rojas, Erick Mata-Montero and Pierre Bonnet discuss their new research, published today in BMC Evolutionary Biology, that uses deep learning computer vision techniques to automate specimen identification and unlock the potential of the worlds' herbaria.

In the last decade, automatic plant identification has become a key interdisciplinary research field that supports botanists in their efforts to identify and classify all plants on earth. Computer vision applied to plant images has been proven to be helpful for this task. It started with low accuracy results under extremely controlled environments but has quickly evolved to better and more accurate algorithms. Consequently we have useful tools such as Pl@ntNet that allow common citizens to identify species of plants with images taken in-the-wild.

The number of institutions that maintain herbaria around the world is around 3,000; with more than 350,000,000 specimens stored under controlled environments. This reflects the accumulated conservation efforts of thousands of botanists around the globe since centuries ago. If we couple both the herbarium sheet images and computer vision techniques, in particular deep learning, herbaria around the world would have more arguments to demonstrate the value and impact of maintaining and investing in these precious collections, as providers of a completely new dataset, most likely useful in the automatic plant identification domain.

A new study by researchers from Costa Rica Institute of Technology, CIRAD and INRIA does exactly this: it tackles the problem of automatic plant identification using deep learning techniques coupled with herbarium sheet images.

Herbarium specimens presentation, at the Botanical Research Institute of Texas (BRIT), Fort Worth, Texas

Automatic Plant Identification

Initial automatic plant identification was focused on leaf images for easier algorithm development and material preparation, evolving later to images of plants in-the-wild, with additional organs such as flowers, fruits, and others, and complex environments captured in the field. Thanks to the PlantCLEF evaluation campaign, algorithms have evolved to better accuracy and are now able to deal with high levels of noise in the images, as well as more images and species. A couple of years ago, the Pl@ntNET initiative involving several French research institutions (INRIA, CIRAD, INRA and IRD) started using deep learning technology for its mobile plant identification.

Deep Learning Technology

During the ImageNet computer vision competition, in 2012, Alex Krizhevsky beat the previous state-of-the-art generalistic identification systems by a considerable margin by using a Convolution Neural Network for the first time. Since then, deep learning has become the facto technology for artificial intelligence (AI) and computer vision in the computer science community. This technology allows predictive models to “learn” concepts based on simpler concepts, in a sense similarly to the visual cortex of humans.

Deep Learning is not really a new technology. However, its full potential has been within reach in the last years thanks to two main factors, namely, the increased processing power of Graphical Processing Units (GPUs) and the availability of vasts amounts data. Models tend to require more data in order to be able to generalize better on unseen examples. This is where herbarium sheets become relevant, as their digitization can be useful to increase a global plant dataset for plant identification in the deep learning scene.

Scaling up a global visual plant dataset

As part of the need to apply deep learning on plant identification, there is a need for a global dataset with, hopefully, all known plant species, which is estimated to be around 400,000 species globally. As herbaria has hundreds of millions of specimens stored in their cabinets, it is logical to attempt to make use of those sheets as images in order to improve the global dataset in question. This global dataset improvement is in terms of size and also in terms of how balanced the dataset should be (machine learning algorithms tend to learn better when the number of images per class is roughly the same). Thanks to efforts such as iDigBio, which invest a lot in delivering and aggregating digital images, we now have millions of digital records of specimens publicly available that can be used to build such a global dataset. Encyclopedia of Life, GBIF, iNaturalist, and Pl@ntNet, among others, have also generated and published large amounts of plant images.

One of the problems with herbarium images is visual noise. Normally, specimens are placed on sheets without automated visual processing needs in mind. For instance, organs are juxtaposed and elements such as labels are also present in the image. Deep learning is a technology that has been proven to deal particularly well with visual noise and complex images. So now it is the right time to attempt to use it with herbaria images.

Room dedicated to specimen preparation, at the Botanical Research Institute of Texas (BRIT), Fort Worth, Texas
Pierre Bonnet

Supporting Herbarium Institutions with Deep Learning

The study of Carranza-Rojas et al proves that using deep learning with herbarium images renders good accuracy results on species identification that use herbarium images for both training and testing in an identification system. This could result in a future identification system for herbaria to support botanists conduct  their work with the huge volume of visual data now available. The authors also prove that using herbarium images from one region of the world to train such models, can result in knowledge transferable to other regions of the world. This is one of the first “transfer learning” experiments in the botanical context. This could be particularly useful for herbarium institutions that still don’t have a large amount of digitalized images. Because some of the richest biodiversity regions of the world are the least prospected, this transfer learning approach can be very important to improve the use of datasets on these floras.

As a result of  this study, herbarium institutions may gain additional value. There is a  huge amount of work invested by field explorers, botanists, taxonomists, technicians, and data managers that has generated very useful data not only for the biological sciences but also for the computer science community. We hope that this work will open the door to stronger collaborations between these communities, particularly between Natural History Museums and Machine Learning / Computer Vision labs.

View the latest posts on the BMC Series blog homepage