Date of Award

Fall 12-15-2023

Level of Access Assigned by Author

Open-Access Thesis

Degree Name

Master of Science (MS)

Department

Computer Science

Advisor

Salimeh Yasaei Sekeh

Second Committee Member

Abram Magner

Third Committee Member

Chaofan Chen

Abstract

Graph summarization is a fundamental problem in the field of data analysis, aiming to distill extensive graph datasets into more manageable, yet informative representations. The challenge lies in creating compressed graphs that faithfully retain crucial structural information for downstream tasks. A recent advancement in this domain introduces an optimal transport-based framework that enables the incorporation of a priori knowledge regarding the importance of nodes, edges, and attributes during the graph summarization process. However, the statistical properties of this innovative framework remain largely unexplored. This master's thesis embarks on a comprehensive exploration of the field of graph summarization, with a particular focus on supervised graph summarization. In this context, the goal is not only to reduce the graph size but also to do so while preserving information essential for a specific class label. We employ information theoretic measures to quantify the preservation of such relevant information. To establish a robust theoretical foundation for supervised graph summarization, we frame the problem as the maximization of Shannon mutual information between the summarized graph and the associated class label. Strikingly, we prove that this problem is NP-hard to approximate, a finding that sets clear bounds on the expectations for any proposed solutions. To address this theoretical challenge, we introduce an innovative summarization method that integrates mutual information estimates. These estimates capture intricate relationships between random variables associated with sample graphs and class labels, seamlessly integrated into the optimal transport compression framework. Through a series of empirical experiments, we demonstrate the practical efficacy of our proposed method. Our results highlight significant improvements in terms of classification accuracy and computational efficiency, surpassing the performance of prior approaches. We validate our findings on both synthetic datasets and certain real-world scenarios. Beyond the empirical evaluations, this thesis delves into a deep theoretical analysis of the limitations of the optimal transport framework in the context of supervised graph summarization. We reveal that this approach fails to meet a critical information monotonicity property, shedding light on its practical and theoretical constraints. In conclusion, this master's thesis makes significant contributions to the burgeoning field of supervised graph summarization. It offers novel insights into the statistical properties of an emerging optimal transport-based framework, proposing a solution that unifies information theory with optimal transport. The work extends the boundaries of what is achievable in supervised graph summarization, providing practical enhancements and theoretical perspectives that can be applied across diverse application domains.

Share