Date of Award
Summer 8-16-2024
Level of Access Assigned by Author
Open-Access Thesis
Degree Name
Master of Science (MS)
Department
Computer Science
Advisor
Terry S. Yoo
Second Committee Member
Salimeh Yasaei Sekeh
Third Committee Member
Amy Booth
Abstract
Sound plays a crucial role in our daily lives, allowing us to communicate, interact, and navigate the world. Our ability to hear, distinguish, and locate sounds is due to the complex structure of our ears and the brain's processing systems. However, hearing loss can impede our ability to communicate and socialize, leading to isolation and depression due to cognitive overload. Hearing correction, in the form of hearing aids, can bring back hearing to individuals but still make it challenging to hear overlapping conversations or sounds in a crowded environment. By shifting and boosting key frequencies, hearing aids can make speech more audible for the user. Modern hearing devices can connect to embedded devices, such as smartphones, allowing the end-user to adjust the hearing aid. This thesis aims to address and move technology toward the goal of speaker separation for real-time applications on embedded devices using a deep-learning approach.
Deep Learning models have been used for sound classification, Music Information Retrieval (MIR), and source separation tasks. Each model differs in design due to its specific functions, and it is sometimes drastically different in architecture. For example, music information retrieval networks may be more complex in accommodating instrument harmonics outside the speech frequencies. Filtering aspects of the models, with Digital Signal Processing (DSP), are also different. Aspects of speech separation networks have been incorporated into MIR model design, such as masking, but there are few examples of the opposite. Combining the background noise filtering capabilities of MIR architecture with speech separation design enables generalized speech separation to work in all sound environments.
Deep Learning models require large datasets to achieve worthwhile results. Current datasets are designed with either speech or music separation in mind, but none have been developed to combine the two. While this makes sense for accomplishing a set task, it leaves out the chaotic nature of everyday life and a generalized environment. Because of this, a new mixed speech music dataset was created for training and evaluation, combining current speech separation datasets with grocery store music from Kmart. This modification to existing datasets introduces a way to train Deep Learning networks to be viable in real-world environments through generalization, allowing them to operate on everyday devices. A total of two datasets were used, including modifications to combine music from Kmart.
Various custom models were trained and evaluated using a mixture of combined and existing datasets, but only a serialized network consisting of DEMUCs and DPRNN produced results. The results showcased a difference between subjective and objective tests for separation capabilities, highlighting the need for a compatible metric capable of measuring human hearing perception for familiar speakers. The serialized network performed worse with objective measurements but showcased higher-quality output with subjective evaluation.
Despite the differences in results between objective and subjective results, a generalized network capable of filtering background sounds warrants improvements.
Recommended Citation
Zippert, Tristan R., "Towards Generalized Speech Separation For Hearing Aids: Deep Learning Approach For Combined Music and Speech" (2024). Electronic Theses and Dissertations. 4035.
https://digitalcommons.library.umaine.edu/etd/4035