Deep learning could discover new plant species hidden in centuries of herbarium data

Machine learning techniques excel at doing a good-enough job quickly in situations where there’s lots of data to grind through. It turns out that’s a great fit for backlogs of plant samples at herbariums and other repositories around the world, which have millions of the things waiting to be digitized and identified — including some that may be new to science.

There are thousands of such collections around the world housing some 350 million specimens like the ones shown. It’s suspected that hidden among them may be tens of thousands of new species — but the labor cost of manually going through all the samples to double-check them, modernize taxonomy and so on is prohibitive.

Not only that, but the valuable info in these slowly vanishing temples to the plant kingdom needs to be modernized in order to be of use to an increasingly digital-first scientific community.

Enter the deep learning system. The researchers, from the Costa Rica Institute of Technology and the French Agricultural Research Centre for International Development, felt the time was right to let loose the technology on this huge corpus of data.

They trained a plant-identification algorithm on a quarter million images of plant samples, and set it to work IDing new sheets. It matched the species picked by human experts exactly 4 out of 5 times, and 90 percent of the time the correct species was in the algorithm’s next few guesses.

Depending on what discipline you’re in, those results may sound either good or bad. But this kind of work is as much art as it is science, and samples of a given species may vary so widely that two taxonomists may come to different conclusions. So getting it right most of the time on the first try is an excellent result. And anomalous results, of course, may indicate an unknown species and be flagged for extra attention.

As a bonus, the researchers found that if the algorithm was trained on images from an herbarium in, say, France, it was still effective if applied to samples from Brazil. This effective transfer learning was a relief, since it means a new system doesn’t have to be created from scratch and tweaked for every collection or style of plant sample.

The system’s expertise did not, however, carry over to leaf scan pictures, like those you might use to ID a plant in the field. The process of drying and mounting simply produces too different of an image type and whatever the system “learned,” it didn’t apply to fresh leaves. That was expected, though, and anyway, effective systems for that side of the science are already in use.

And don’t worry, it’s not going to put the botanists out of work.

“People feel this kind of technology could be something that will decrease the value of botanical expertise,” study co-author Pierre Bonnet told Nature. “But this approach is only possible because it is based on the human expertise. It will never remove the human expertise.”

Now that the basics of the system have been established, the researchers are looking to expand it. Metadata about the plants, such as when and where they were collected, what phase of flowering or growth they were in, and so on could improve accuracy and create research opportunities — for example, systematically comparing how leaf sizes of a certain species have changed over a century of climate change. Similar systems geared toward fossils or animal samples will also be informed by the team’s work.

The research was published this week in the journal BMC Evolutionary Biology.