Machine Learning-Based Classification of Habanero Pepper Yield Using Mixed Metabolomic and Phenotypic Profile Features
Abstract
This study addresses the intelligent classification of habanero pepper yield through a machine learning model based on the algorithmic pairing of IBk (Instance-Based k) and the HEOM (Heterogeneous Euclidean-Overlap Metric), designed to handle mixed-type data, integrating both numerical (metabolomic and morphological) and categorical (phenotypic) features. The dataset included 165 instances associated with 58 features, combining 51 metabolites (sugars, amino acids, organic acids, bioactive compounds), four qualitative descriptors (race, cultivar, color, description), and three quantitative descriptors (fruit size). The target variable was binary, defining high yield (>25 tons/ha) and low yield (<14 tons/ha) exhibiting a moderate class imbalance (IR = 1.75). Leave-One-Out Cross-Validation (LOOCV) was employed to ensure a robust and deterministic validation process. The IBk/HEOM algorithm achieved perfect classification (100% accuracy) with 58 features for k ? 25, demonstrating the high discriminatory power of the selected biomarkers. Starting from k = 26, a progressive increase in False Positives (Type I errors) was observed, which is typically associated with decision boundary overlap and bias towards the majority class. Feature relevance analysis identified eight critical attributes (race, cultivar, fruit width, succinic acid, ferulic acid, ascorbic acid, guanosine, and NAD) that, by themselves, maintained optimal predictive performance up to k = 31, providing a direct path for parsimonious model optimization and a reduction in field and laboratory costs. This work validates the utility of integrating mixed data from metabolomic biomarkers and phenotypic features. The robust HEOM-based framework natively handles data heterogeneity, eliminating the need for pre-processing transformations. This offers an inherently interpretable predictive tool ideal for decision-making in agricultural and biochemical research.
Keywords
Mixed-data, yield prediction, metabolomic biomarkers, morphological-productive profiling