Generative Manifold Learning for the Exploration of Partially Labeled Data

Raúl Cruz-Barbosa, Alfredo Vellido


In many real-world application problems, the availability of data labels for supervised learning is rather limited. This calls for the definition of semi-supervised methods. A manifold learning model, namely Generative Topographic Mapping (GTM), is the basis of the methods developed in this thesis. Here, a variant of GTM that uses a graph approximation to the geodesic metric is first defined. This model is capable of representing data of convoluted geometries. The standard GTM is here modified to prioritize neighbourhood relationships along the generated manifold. This is accomplished by penalizing the possible divergences between the Euclidean distances from the data points to the model prototypes and the corresponding geodesic distances along the manifold. The resulting Geodesic GTM (Geo-GTM) model is shown to improve the continuity and trustworthiness of the representation generated by the model, as well as to behave robustly in the presence of noise.
We then proceed to the definition of a novel semi-supervised model, SS-Geo-GTM, that extends Geo-GTM to deal with semi-supervised problems. In SS-Geo-GTM, the model prototypes are linked by the nearest neighbour to the data manifold constructed by Geo-GTM. The resulting proximity graph is used as the basis for a class label propagation algorithm. The performance of SS-Geo-GTM is experimentally assessed, comparing positively with that of an Euclidean distance-based counterpart and that of the alternative Laplacian Eigenmaps method. Finally, the developed model is applied to the analysis of a human brain tumour dataset (obtained by Nuclear Magnetic Resonance Spectroscopy), where the task is survival prognostic modeling.


Semi-supervised learning; Clustering; Generative Topographic Mapping; Exploratory Data Analysis;

Full Text: PDF