Background The interpretation of tandem mass spectrometry (MS/MS) data is a central challenge in metabolomics and a critical step for metabolite identification in applications spanning biomedicine, nutrition, environmental sciences, and chemical ecology. Currently, most metabolite annotations rely on spectral library matching, where experimental MS/MS spectra are compared against reference spectra of known compounds. However, existing spectral libraries are far from comprehensive, resulting in a large proportion of detected MS/MS spectra remaining unannotated. This unresolved fraction, commonly referred to as the dark metabolome, represents a major bottleneck in metabolomics research. Recent advances in machine learning (ML) and deep learning (DL) offer promising alternatives to overcome these limitations by enabling structure-aware metabolite annotation beyond direct spectral matches. A key challenge in this context is the transformation of both MS/MS spectra and molecular structures into informative numerical representations that preserve relevant chemical information while reducing sparsity and dimensionality. Project Description This TFG project aims to explore and evaluate different strategies for encoding molecular structures and integrating them with ML/DL models for metabolite identification from MS/MS data. The student will work with real metabolomics datasets and investigate how different chemical representations impact model performance, robustness, and interpretability. Specifically, the project will focus on: • Exploring molecular structure encoding techniques, including: o Classical molecular fingerprints o Learned molecular embeddings (e.g., Mol2Vec-like approaches) o Graph-based representations of molecules • Evaluating different machine learning and deep learning architectures (e.g., neural networks, convolutional models, graph neural networks) for metabolite annotation tasks. • Assessing the strengths and limitations of each encoding–model combination in terms of accuracy, scalability, and generalization. • Contributing to the understanding of how representation learning influences metabolite identification beyond spectral library matching. Expected Outcomes The project will provide practical experience at the interface of metabolomics, cheminformatics, and machine learning, and may contribute to the development or improvement of computational tools for metabolite annotation. Depending on progress, results may be suitable for inclusion in a scientific publication or open-source software. Required Profile This project is suitable for students with: • Basic knowledge of machine learning • Programming experience in Python and/or R • Interest in metabolomics, cheminformatics, or applied AI Experience with deep learning frameworks (e.g., PyTorch, TensorFlow) is an advantage but not mandatory.
Grau en Enginyeria Biomèdica, Doble Titulació de Grau d'Enginyeria Informàtica i Biotecnologia (GEI), Doble titulació de Grau en Enginyeria Biomèdica i en Enginyeria de Sistemes i Serveis de Telecomunicacions (GEB)
Ciències òmiques i medicina personalitzada
Proposat
2026-01-20
Óscar Yanes Torrado
The project can be carried out in a flexible manner, either in hybrid mode (remote and in-person) or fully remotely, depending on the candidate’s preferences and circumstances.
Mitjana
No
No
No
No
| Fitxer | Descripció |