Darío Suarez Gracia

66 entries « ‹ 1 of 14 › »

2025

Journal Articles

Pedrajas, Samuel Pérez; Resano, Javier; Gracia, Darío Suárez

BnnRV: Hardware and Software Optimizations for Weight Sampling in Bayesian Neural Networks on Edge RISC-V Cores Journal Article

In: IEEE Transactions on Circuits and Systems for Artificial Intelligence, pp. 1-12, 2025, ISSN: 2996-6647.

Abstract | Links | BibTeX

Proceedings Articles

Soria-Pardos, Víctor; Armejach, Adrià; Suárez, Darío; Martinot, Didier; Grasset, Arnaud; Moretó, Miquel

FLAMA: Architecting floating-point atomic memory operations for heterogeneous HPC systems Proceedings Article

In: 2025 28th Euromicro Conference on Digital System Design (DSD), pp. 435–442, IEEE IEEE, 2025.

Abstract | Links | BibTeX

@inproceedings{soria2025flama,

title = {FLAMA: Architecting floating-point atomic memory operations for heterogeneous HPC systems},

author = {Víctor Soria-Pardos and Adrià Armejach and Darío Suárez and Didier Martinot and Arnaud Grasset and Miquel Moretó},

url = {https://upcommons.upc.edu/server/api/core/bitstreams/9199c411-ce89-4327-a06b-bf21838aa8db/content},

doi = {10.1109/DSD67783.2025.00066},

year  = {2025},

date = {2025-01-01},

urldate = {2025-01-01},

booktitle = {2025 28th Euromicro Conference on Digital System Design (DSD)},

pages = {435–442},

publisher = {IEEE},

organization = {IEEE},

abstract = {Current heterogeneous systems integrate generalpurpose Central Processing Units (CPUs), Graphics Processing Units (GPUs), and Neural Processing Units (NPUs). The efficient use of such systems requires a significant programming effort to distribute computation and synchronize across devices, which usually involves using Atomic Memory Operations (AMOs). Arm recently launched a floating-point Atomic Memory Operations (FAMOs) extension to perform atomic updates on floating-point data types specifically. This work characterizes and models heterogeneous architectures to understand how floating-point AMOs impact graph, Machine Learning (ML), and high-performance computing (HPC) workloads. Our analysis shows that many AMOs are performed on floating-point data, which modern systems execute using inefficient compare-and-swap (CAS) constructs. Therefore, replacing CASbased constructs with FAMOs can improve a wide range of workloads. Moreover, we analyze the trade-offs of executing FAMOs at different memory hierarchy levels, either in private caches (near) or remotely in shared caches (far). We have extended the widely used AMBA CHI protocol to evaluate such FAMO support on a simulated chiplet-based heterogeneous architecture. While near FAMOs achieve an average 1.34× speed-up, far FAMOs reach an average 1.58× speed-up. We conclude that FAMOs can bridge the gap between CPU architecture and accelerators and enabling synchronization in key application domains.},

keywords = {},

pubstate = {published},

tppubtype = {inproceedings}

}

Soria-Pardos, Víctor; Armejach, Adrià; Mück, Tiago; Gracia, Darío Suárez; Joao, Jose; Moretó, Miquel

Delegato: Locality-Aware Atomic Memory Operations on Chiplets Proceedings Article

In: Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, pp. 1793–1808, ACM, 2025.

Abstract | Links | BibTeX

@inproceedings{soria2025delegato,

title = {Delegato: Locality-Aware Atomic Memory Operations on Chiplets},

author = {Víctor Soria-Pardos and Adrià Armejach and Tiago Mück and Darío Suárez Gracia and Jose Joao and Miquel Moretó},

url = {https://dl.acm.org/doi/full/10.1145/3725843.3756030},

doi = {10.1145/3725843.375603},

year  = {2025},

date = {2025-01-01},

urldate = {2025-01-01},

booktitle = {Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture},

pages = {1793–1808},

publisher = {ACM},

abstract = {The irruption of chiplet-based architectures has been a game changer, enabling higher transistor integration and core counts in a single socket. However, chiplets impose higher and non-uniform memory access (NUMA) latencies than monolithic integration. This harms the efficiency of atomic memory operations (AMOs), which are fundamental to implementing fine-grained synchronization and concurrent data structures on large systems. AMOs are executed either near the core (near) or at a remote location within the cache hierarchy (far). On near AMOs, the core’s private cache fetches the target cache line in exclusiveness to modify it locally. Near AMOs cause significant data movement between private caches, especially harming parallel applications’ performance on chiplet-based architectures. Alternatively, far AMOs can alleviate the communication overhead by reducing data movement between processing elements. However, current multicore architectures only support one type of far AMO, which sends all updates to a single serialization point (centralized AMOs).

This work introduces two new types of far AMOs, delegated and migrating, that execute AMOs remotely without centralizing updates in a single point of the cache hierarchy. Combining centralized, delegated, and migrating AMOs allows the directory to select the best location to execute AMOs. Moreover, we propose Delegato, a tracing optimization to effectively transport usage information from private caches to the directory to predict the best atomic type to issue accurately. Additionally, we design a simple predictor on top of Delegato that seamlessly selects the best placement to perform AMOs based on the data access pattern and usage activity of cores. Our evaluation using gem5 shows that Delegato can speed up applications on average by 1.07 × over centralized AMOs and by 1.13 × over the state-of-the-art AMO predictor.},

keywords = {},

pubstate = {published},

tppubtype = {inproceedings}

}

2024

Proceedings Articles

Pérez, Samuel; Resano, Javier; Gracia, Darío Suárez

Accelerating Bayesian Neural Networks on Low-Power Edge RISC-V Processors Proceedings Article

In: 2024 IEEE 24th International Conference on Nanotechnology (NANO), pp. 507-512, 2024, ISSN: 1944-9380.

Abstract | Links | BibTeX

2023

Proceedings Articles

Soria-Pardos, Víctor; Armejach, Adria; Mück, Tiago; Suárez-Gracia, Dario; Joao, José; Rico, Alejandro; Moretó, Miquel

DynAMO: Improving Parallelism Through Dynamic Placement of Atomic Memory Operations Proceedings Article

In: Proceedings of the 50th Annual International Symposium on Computer Architecture, pp. 1–13, ACM, 2023.

Abstract | Links | BibTeX

66 entries « ‹ 1 of 14 › »

Team

Darío Suarez Gracia

ABOUT ME

PUBLICATIONS

2025

Journal Articles

Proceedings Articles

2024

Proceedings Articles

2023

Proceedings Articles