Breast, Lung and Liver Cancer Classification from Structured and Unstructured Data

Beatriz A. Gonzalez-Beltrán, José A. Reyes-Ortiz, Erick E. Montelongo-Gonzalez


Currently, cancer is a worldwide public health problem. Machine and deep learning techniques hold great promise in healthcare by analyzing Electronic Health Records (EHR) that contain a large collection of structured and unstructured data. However, most research has been done with structured data, and valuable data is also found in doctor’s plain-text notes. Thus, this paper proposes an approach to classify breast, liver, and lung cancer based on structured and unstructured data obtained from the MIMIC-II clinical database by using machine and deep learning techniques. In particular, the Paragraph Vector algorithm is used as a deep learning approach to text representation. The goal of this work is to help physicians in early diagnosis of cancer. The proposed approach was tested on a balanced dataset of breast, liver, and lung cancer patient records. Pre-processing is done with structured and unstructured data, and the result is used as input variables to three machine learning models: Support Vector Machines, Multi Layer Perceptron, and Adaboost-SAMME. Then, the scoring metrics for these models are calculated in different training data configurations to choose the best performing model for classification. Results show that the best performing model was obtained with MLP, achieving 89% precision using unstructured data.


Cancer classification, structured and unstructured data, deep learning for unstructured data representation, machine learning models, electronic health records

Full Text: PDF