Computer Architect and Teaching Assistant
Email: victor.soria.pardos@upc.edu
Address: Campus Nord, Polytechnic University of Catalonia
C/Jordi Girona, 1-3, B6 Building
08034 Barcelona (Spain)
ABOUT ME
Víctor Soria Pardos received the BSc degree in Computer Science from the Universidad de Zaragoza, Spain, in 2019 and the MSc and the PhD degree in Computer Engineering from the Universitat Politècnica de Catalunya (UPC), Spain, in 2022 and 2026, respectively. Currently, he is a Teaching Assistant with the Computer Architecture Department (DAC) at the Universitat Politècnica de Catalunya (UPC), Spain. His research interests include processor microarchitecture, memory hierarchy, cache coherence and parallel computer architecture. He collaborates actively with the Grupo de Arquitectura de Computadores from the Universidad de Zaragoza (gaZ).
PUBLICATIONS
2026
Proceedings Articles
Siracusa, Marco; Hsu, Olivia; Soria-Pardos, Victor; Randall, Joshua; Grasset, Arnaud; Biscondi, Eric; Joseph, Doug; Allen, Randy; Kjolstad, Fredrik; Planas, Miquel Moretó; Armejach, Adrià
Ember: A Compiler for Efficient Embedding Operations on Decoupled Access-Execute Architectures Proceedings Article
In: 2026.
@inproceedings{siracusa2025ember,
title = {Ember: A Compiler for Efficient Embedding Operations on Decoupled Access-Execute Architectures},
author = {Marco Siracusa and Olivia Hsu and Victor Soria-Pardos and Joshua Randall and Arnaud Grasset and Eric Biscondi and Doug Joseph and Randy Allen and Fredrik Kjolstad and Miquel Moretó Planas and Adrià Armejach},
url = {https://arxiv.org/pdf/2504.09870},
year = {2026},
date = {2026-01-01},
urldate = {2026-01-01},
journal = {Proceedings of the 22nd ACM International Symposium on Code Generation and Optimization, CGO },
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
2025
Proceedings Articles
Soria-Pardos, Víctor; Armejach, Adrià; Suárez, Darío; Martinot, Didier; Grasset, Arnaud; Moretó, Miquel
FLAMA: Architecting floating-point atomic memory operations for heterogeneous HPC systems Proceedings Article
In: 2025 28th Euromicro Conference on Digital System Design (DSD), pp. 435–442, IEEE IEEE, 2025.
@inproceedings{soria2025flama,
title = {FLAMA: Architecting floating-point atomic memory operations for heterogeneous HPC systems},
author = {Víctor Soria-Pardos and Adrià Armejach and Darío Suárez and Didier Martinot and Arnaud Grasset and Miquel Moretó},
url = {https://upcommons.upc.edu/server/api/core/bitstreams/9199c411-ce89-4327-a06b-bf21838aa8db/content},
doi = {10.1109/DSD67783.2025.00066},
year = {2025},
date = {2025-01-01},
urldate = {2025-01-01},
booktitle = {2025 28th Euromicro Conference on Digital System Design (DSD)},
pages = {435–442},
publisher = {IEEE},
organization = {IEEE},
abstract = {Current heterogeneous systems integrate generalpurpose Central Processing Units (CPUs), Graphics Processing Units (GPUs), and Neural Processing Units (NPUs). The efficient use of such systems requires a significant programming effort to distribute computation and synchronize across devices, which usually involves using Atomic Memory Operations (AMOs). Arm recently launched a floating-point Atomic Memory Operations (FAMOs) extension to perform atomic updates on floating-point data types specifically. This work characterizes and models heterogeneous architectures to understand how floating-point AMOs impact graph, Machine Learning (ML), and high-performance computing (HPC) workloads. Our analysis shows that many AMOs are performed on floating-point data, which modern systems execute using inefficient compare-and-swap (CAS) constructs. Therefore, replacing CASbased constructs with FAMOs can improve a wide range of workloads. Moreover, we analyze the trade-offs of executing FAMOs at different memory hierarchy levels, either in private caches (near) or remotely in shared caches (far). We have extended the widely used AMBA CHI protocol to evaluate such FAMO support on a simulated chiplet-based heterogeneous architecture. While near FAMOs achieve an average 1.34× speed-up, far FAMOs reach an average 1.58× speed-up. We conclude that FAMOs can bridge the gap between CPU architecture and accelerators and enabling synchronization in key application domains.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Soria-Pardos, Víctor; Armejach, Adrià; Mück, Tiago; Gracia, Darío Suárez; Joao, Jose; Moretó, Miquel
Delegato: Locality-Aware Atomic Memory Operations on Chiplets Proceedings Article
In: Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, pp. 1793–1808, ACM, 2025.
@inproceedings{soria2025delegato,
title = {Delegato: Locality-Aware Atomic Memory Operations on Chiplets},
author = {Víctor Soria-Pardos and Adrià Armejach and Tiago Mück and Darío Suárez Gracia and Jose Joao and Miquel Moretó},
url = {https://dl.acm.org/doi/full/10.1145/3725843.3756030},
doi = {10.1145/3725843.375603},
year = {2025},
date = {2025-01-01},
urldate = {2025-01-01},
booktitle = {Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture},
pages = {1793–1808},
publisher = {ACM},
abstract = {The irruption of chiplet-based architectures has been a game changer, enabling higher transistor integration and core counts in a single socket. However, chiplets impose higher and non-uniform memory access (NUMA) latencies than monolithic integration. This harms the efficiency of atomic memory operations (AMOs), which are fundamental to implementing fine-grained synchronization and concurrent data structures on large systems. AMOs are executed either near the core (near) or at a remote location within the cache hierarchy (far). On near AMOs, the core’s private cache fetches the target cache line in exclusiveness to modify it locally. Near AMOs cause significant data movement between private caches, especially harming parallel applications’ performance on chiplet-based architectures. Alternatively, far AMOs can alleviate the communication overhead by reducing data movement between processing elements. However, current multicore architectures only support one type of far AMO, which sends all updates to a single serialization point (centralized AMOs).
This work introduces two new types of far AMOs, delegated and migrating, that execute AMOs remotely without centralizing updates in a single point of the cache hierarchy. Combining centralized, delegated, and migrating AMOs allows the directory to select the best location to execute AMOs. Moreover, we propose Delegato, a tracing optimization to effectively transport usage information from private caches to the directory to predict the best atomic type to issue accurately. Additionally, we design a simple predictor on top of Delegato that seamlessly selects the best placement to perform AMOs based on the data access pattern and usage activity of cores. Our evaluation using gem5 shows that Delegato can speed up applications on average by 1.07 × over centralized AMOs and by 1.13 × over the state-of-the-art AMO predictor.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This work introduces two new types of far AMOs, delegated and migrating, that execute AMOs remotely without centralizing updates in a single point of the cache hierarchy. Combining centralized, delegated, and migrating AMOs allows the directory to select the best location to execute AMOs. Moreover, we propose Delegato, a tracing optimization to effectively transport usage information from private caches to the directory to predict the best atomic type to issue accurately. Additionally, we design a simple predictor on top of Delegato that seamlessly selects the best placement to perform AMOs based on the data access pattern and usage activity of cores. Our evaluation using gem5 shows that Delegato can speed up applications on average by 1.07 × over centralized AMOs and by 1.13 × over the state-of-the-art AMO predictor.
2024
Journal Articles
López-Villellas, Lorién; Langarita-Benítez, Rubén; Badouh, Asaf; Soria-Pardos, Víctor; Aguado-Puig, Quim; López-Paradís, Guillem; Doblas, Max; Setoain, Javier; Kim, Chulho; Ono, Makoto; Armejach, Adrià; Marco-Sola, Santiago; Alastruey-Benedé, Jesús; Ibáñez, Pablo; Moretó, Miquel
GenArchBench: A genomics benchmark suite for arm HPC processors Journal Article
In: Future Generation Computer Systems, vol. 157, pp. 313-329, 2024, ISSN: 0167-739X.
@article{LOPEZVILLELLAS2024313,
title = {GenArchBench: A genomics benchmark suite for arm HPC processors},
author = {Lorién López-Villellas and Rubén Langarita-Benítez and Asaf Badouh and Víctor Soria-Pardos and Quim Aguado-Puig and Guillem López-Paradís and Max Doblas and Javier Setoain and Chulho Kim and Makoto Ono and Adrià Armejach and Santiago Marco-Sola and Jesús Alastruey-Benedé and Pablo Ibáñez and Miquel Moretó},
url = {https://www.sciencedirect.com/science/article/pii/S0167739X24001250},
doi = {https://doi.org/10.1016/j.future.2024.03.050},
issn = {0167-739X},
year = {2024},
date = {2024-01-01},
journal = {Future Generation Computer Systems},
volume = {157},
pages = {313-329},
abstract = {Arm usage has substantially grown in the High-Performance Computing (HPC) community. Japanese supercomputer Fugaku, powered by Arm-based A64FX processors, held the top position on the Top500 list between June 2020 and June 2022, currently sitting in the fourth position. The recently released 7th generation of Amazon EC2 instances for compute-intensive workloads (C7 g) is also powered by Arm Graviton3 processors. Projects like European Mont-Blanc and U.S. DOE/NNSA Astra are further examples of Arm irruption in HPC. In parallel, over the last decade, the rapid improvement of genomic sequencing technologies and the exponential growth of sequencing data has placed a significant bottleneck on the computational side. While most genomics applications have been thoroughly tested and optimized for x86 systems, just a few are prepared to perform efficiently on Arm machines. Moreover, these applications do not exploit the newly introduced Scalable Vector Extensions (SVE). This paper presents GenArchBench, the first genome analysis benchmark suite targeting Arm architectures. We have selected computationally demanding kernels from the most widely used tools in genome data analysis and ported them to Arm-based A64FX and Graviton3 processors. Overall, the GenArch benchmark suite comprises 13 multi-core kernels from critical stages of widely-used genome analysis pipelines, including base-calling, read mapping, variant calling, and genome assembly. Our benchmark suite includes different input data sets per kernel (small and large), each with a corresponding regression test to verify the correctness of each execution automatically. Moreover, the porting features the usage of the novel Arm SVE instructions, algorithmic and code optimizations, and the exploitation of Arm-optimized libraries. We present the optimizations implemented in each kernel and a detailed performance evaluation and comparison of their performance on four different HPC machines (i.e., A64FX, Graviton3, Intel Xeon Skylake Platinum, and AMD EPYC Rome). Overall, the experimental evaluation shows that Graviton3 outperforms other machines on average. Moreover, we observed that the performance of the A64FX is significantly constrained by its small memory hierarchy and latencies. Additionally, as proof of concept, we study the performance of a production-ready tool that exploits two of the ported and optimized genomic kernels.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
2023
Proceedings Articles
Soria-Pardos, Víctor; Armejach, Adria; Mück, Tiago; Suárez-Gracia, Dario; Joao, José; Rico, Alejandro; Moretó, Miquel
DynAMO: Improving Parallelism Through Dynamic Placement of Atomic Memory Operations Proceedings Article
In: Proceedings of the 50th Annual International Symposium on Computer Architecture, pp. 1–13, ACM, 2023.
@inproceedings{soria2023dynamo,
title = {DynAMO: Improving Parallelism Through Dynamic Placement of Atomic Memory Operations},
author = {Víctor Soria-Pardos and Adria Armejach and Tiago Mück and Dario Suárez-Gracia and José Joao and Alejandro Rico and Miquel Moretó},
url = {https://dl.acm.org/doi/abs/10.1145/3579371.3589065},
doi = {10.1145/3579371.3589065},
year = {2023},
date = {2023-01-01},
urldate = {2023-01-01},
booktitle = {Proceedings of the 50th Annual International Symposium on Computer Architecture},
pages = {1–13},
publisher = {ACM},
abstract = {With increasing core counts in modern multi-core designs, the overhead of synchronization jeopardizes the scalability and efficiency of parallel applications. To mitigate these overheads, modern cache-coherent protocols offer support for Atomic Memory Operations (AMOs) that can be executed near-core (near) or remotely in the on-chip memory hierarchy (far).
This paper evaluates current available static AMO execution policies implemented in multi-core Systems-on-Chip (SoC) designs, which select AMOs' execution placement (near or far) based on the cache block coherence state. We propose three static policies and show that the performance of static policies is application dependent. Moreover, we show that one of our proposed static policies outperforms currently available implementations.
Furthermore, we propose DynAMO, a predictor that selects the best location to execute the AMOs. DynAMO identifies the different locality patterns to make informed decisions, improving AMO latency and increasing overall throughput. DynAMO outperforms the best-performing static policy and provides geometric mean speed-ups of 1.09× across all workloads and 1.31× on AMO-intensive applications with respect to executing all AMOs near.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper evaluates current available static AMO execution policies implemented in multi-core Systems-on-Chip (SoC) designs, which select AMOs' execution placement (near or far) based on the cache block coherence state. We propose three static policies and show that the performance of static policies is application dependent. Moreover, we show that one of our proposed static policies outperforms currently available implementations.
Furthermore, we propose DynAMO, a predictor that selects the best location to execute the AMOs. DynAMO identifies the different locality patterns to make informed decisions, improving AMO latency and increasing overall throughput. DynAMO outperforms the best-performing static policy and provides geometric mean speed-ups of 1.09× across all workloads and 1.31× on AMO-intensive applications with respect to executing all AMOs near.