Recent advances in sequencing technologies have made population-scale genome analysis a reality, driving advances in modern biomedical research and healthcare. At this scale, genomic data analyses enable the discovery of valuable biological insights such as novel genetic variations, gene expression patterns, genes, and regulatory elements. However, the exponential growth of genomic data poses significant challenges for the performance scalability of genome analysis tools that require accessing large sequence databases. For instance, mapping long DNA sequences to a large reference genome is one of the most time-consuming steps in many genome sequencing analyses. In particular, seeding algorithms, which locate short DNA fragments in a reference genome, often become a major performance bottleneck in genome mapping tools. For that, many performance-critical tools rely on optimized hash-tables to perform fast database lookups of DNA sequences. Despite their well-known efficiency and performance, hash-table-based tools suffer from irregular memory access patterns and limited spatial locality, making them a poor fit for modern hardware. The recently proposed learned index strategies have shown promise in accelerating traditional data structures, such as hash tables, by leveraging machine-learned models such as RMI to predict the location of keys and, thereby, reduce the number of memory accesses. Notwithstanding, the performance of learned hash-tables remains constrained by low instruction-level parallelism, poor cache locality, and underutilized memory bandwidth. This thesis proposes to analyse and characterize the performance of learned hash-tables, identify their bottlenecks and limitations, and explore software and hardware acceleration strategies to unlock their full potential on modern high-performance architectures.
Doble Titulació de Grau d'Enginyeria Informàtica i Biotecnologia (GEI)
Arquitectura d'ordinadors
En Curs
2025-04-11
Carlos Molina Clemente
MONTSERRAT PALAZON BALMASEDA
Alta
No
No
Si
No