Document Type

Honors Thesis

Major

Computer Science

Advisor(s)

Chaofan Chen

Committee Members

Kate Beard-Tisdale, Salimeh Yasaei Sekeh, Roy Turner

Graduation Year

May 2024

Publication Date

Spring 5-2024

Abstract

This research aims to develop an interpretable and fast machine learning (ML) model for identifying species using environmental DNA (eDNA). eDNA is a technique used to detect the presence or absence of species in an ecosystem by analyzing the DNA that animals naturally leave behind in water or soil. However, there can be millions of sequences to classify and the reference databases are sizeable, so traditional methods such as BLAST are slow. Convolutional neural networks (CNNs) have been shown to be 150 times faster at classifying sequences. In this work, we create a CNN that achieves 92.5% accuracy, surpassing the accuracy of previous work. Then, we add an interpretable layer based on the ProtoPNet from Chen et al. (2019). This new network learns patterns of bases that distinguish between species, called prototypes. Its decisions are human- understandable because it directly uses the similarity scores for each prototype in a fully- connected layer without bias to make predictions. Humans are able to easily see the reason for any given prediction by looking at the prototype that contributed most strongly to the prediction, and the class with which it is most associated. Training the network also generates new knowledge by learning the parts of sequences that uniquely differentiate each species. Unlike the original ProtoPNet, we compare prototypes to both the raw input and the latent space, which makes prototypes more genuine and visualizable. We retrain the network to achieve 93.4% accuracy, showing that interpretability does not reduce accuracy. This work contributes to environmental monitoring and conservation efforts by generating new knowledge and providing a more efficient, understandable, and accurate method for identifying native and invasive species.

Share