Visualization of data with missing entries using non-linear dimensionality reduction

Files

Supervisors: Lee, John Aldo ; Verleysen, Michel
Faculty: Ecole polytechnique de Louvain
Degree label: Master [120] : ingénieur civil en informatique, à finalité spécialisée
Abstract: This master's thesis studies the field of dimensionality reduction in the context of datasets with missing entries, extending prior work to higher dimensions. Gaussian Mixture Models can be employed to manage missing data by modeling the data distribution, generating pseudo-complete datasets through multiple imputations. In addition, extensions to handle and attend for missing data are introduced. Gaussian Mixture Models are known to scale poorly in high dimension. To address this issue, parametrizations designed to make the model more parsimonious are also implemented. From the imputed datasets, high-dimensional similarities are derived to compute low-dimensional embeddings. At the end of this work, the performance on dimensionality reduction tasks performed with the imputed datasets is assessed on increasing dimensions, then evaluated against alternative imputation techniques. While effective within 50 dimensions, challenges arise in higher dimensional spaces due to the curse of dimensionality and numerical approximations. Other parametrizations from the High Dimensional Data Clustering family show similar performances, suggesting to rely on alternative imputation paradigms when dealing with very high dimensions.