Date of Award

Winter 12-18-2015

Level of Access Assigned by Author

Open-Access Thesis

Degree Name

Master of Science (MS)


Computer Science


Roy M. Turner

Second Committee Member

George Markowsky

Third Committee Member

Nicholas A. Giudice


Responding to email is a time-consuming task that is a requirement for most professions. Many people find themselves answering the same questions over and over, repeatedly replying with answers they have written previously either in whole or in part. In this thesis, the Automatic Mail Reply (AMR) system is implemented to help with repeated email response creation. The system uses past email interactions and, through unsupervised statistical learning, attempts to recover relevant information to give to the user to assist in writing their reply.

Three statistical learning models, term frequency-inverse document frequency (tf-idf), Latent Semantic Analysis (LSA), and Latent Dirichlet Allocation (LDA), are evaluated to find which approach works the best in terms of email document retrieval and similarity matching. AMR is built using the Python programming language in order to take advantage of tools and libraries specifically built for natural language processing and topic modeling. Datasets include the author’s work email and personal email archives, the publicly available 20 Newsgroups dataset, and the recently released email archives of U.S. Secretary Hillary Clinton from the Freedom of Information Act website. In addition to different datasets and statistical modeling approaches, two different system tools, GenSim and SciKit-Learn, are also compared.

The outcome of this work is an initial version of the AMR system, which is freely available from the author’s Github page1. The core components of AMR input an email corpus, create a model of that corpus based on unsupervised learning and predict useful replies to new email based on the model. These pieces could be used as a toolkit for many different purposes. Although the best topic modeling approach is not definitively determined, this thesis concludes that using SciKit’s LSA implementation yields the most consistent results (p < 0.05) across the tested databases. These results could be used for future work on developing a more sophisticated product to accomplish a range of machine learning tasks.