Date of Award

Summer 8-16-2024

Level of Access Assigned by Author

Open-Access Thesis

Degree Name

Master of Science (MS)

Department

Computer Science

Advisor

Terry S. Yoo

Second Committee Member

Salimeh Yasaei Sekeh

Third Committee Member

Amy Booth

Abstract

Sound plays a crucial role in our daily lives, allowing us to communicate, interact, and navigate the world. Our ability to hear, distinguish, and locate sounds is due to the complex structure of our ears and the brain's processing systems. However, hearing loss can impede our ability to communicate and socialize, leading to isolation and depression due to cognitive overload. Hearing correction, in the form of hearing aids, can bring back hearing to individuals but still make it challenging to hear overlapping conversations or sounds in a crowded environment. By shifting and boosting key frequencies, hearing aids can make speech more audible for the user. Modern hearing devices can connect to embedded devices, such as smartphones, allowing the end-user to adjust the hearing aid. This thesis aims to address and move technology toward the goal of speaker separation for real-time applications on embedded devices using a deep-learning approach.

Deep Learning models have been used for sound classification, Music Information Retrieval (MIR), and source separation tasks. Each model differs in design due to its specific functions, and it is sometimes drastically different in architecture. For example, music information retrieval networks may be more complex in accommodating instrument harmonics outside the speech frequencies. Filtering aspects of the models, with Digital Signal Processing (DSP), are also different. Aspects of speech separation networks have been incorporated into MIR model design, such as masking, but there are few examples of the opposite. Combining the background noise filtering capabilities of MIR architecture with speech separation design enables generalized speech separation to work in all sound environments.

Deep Learning models require large datasets to achieve worthwhile results. Current datasets are designed with either speech or music separation in mind, but none have been developed to combine the two. While this makes sense for accomplishing a set task, it leaves out the chaotic nature of everyday life and a generalized environment. Because of this, a new mixed speech music dataset was created for training and evaluation, combining current speech separation datasets with grocery store music from Kmart. This modification to existing datasets introduces a way to train Deep Learning networks to be viable in real-world environments through generalization, allowing them to operate on everyday devices. A total of two datasets were used, including modifications to combine music from Kmart.

Various custom models were trained and evaluated using a mixture of combined and existing datasets, but only a serialized network consisting of DEMUCs and DPRNN produced results. The results showcased a difference between subjective and objective tests for separation capabilities, highlighting the need for a compatible metric capable of measuring human hearing perception for familiar speakers. The serialized network performed worse with objective measurements but showcased higher-quality output with subjective evaluation.

Despite the differences in results between objective and subjective results, a generalized network capable of filtering background sounds warrants improvements.

Share