Compiling Strassen-like Matrix Multiplication Algorithms to Fast CUDA Kernels (PLDI 2026 - PLDI Research Papers)

Track

PLDI 2026 PLDI Research Papers

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT-06:00) Mountain Time (US & Canada).

Use conference time zone: (GMT-06:00) Mountain Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 19 Jun 2026 10:30 - 10:50 at Flatirons 4 - Compiler Optimization for Accelerators

Abstract

Matrix multiplication is a key operation in scientific computing and machine learning, with GPU libraries like NVIDIA Cutlass and cuBLAS providing optimized implementations of the three nested loop cubic algorithm. While sub-cubic algorithms, like the Strassen algorithm and its variants, are theoretically faster, their recursive structure makes it challenging to implement efficient GPU kernels. This is why existing approaches either do excessive memory accesses or do not effectively overlap memory accesses and computations, leading to sub-optimal performance compared to theoretical expectations.

This paper presents SubCuber, a domain-specific compiler that generates efficient CUDA kernels for
Strassen-like matrix multiplication algorithms. SubCuber contains two novel CUDA kernels that are designed to minimize memory input loads and effectively overlap computation with memory loads. To generate efficient code, SubCuber constructs the dependency graph of a Strassen-like algorithm, selects efficient kernel schedules, and applies fusion strategies tailored to a recursion level, matrix sizes, and GPU. Our evaluation on NVIDIA A100 and H200 GPUs shows that for both single- and half-precision floating point matrix multiplications, SubCuber’s generated code outperforms state-of-the-art CUDA implementations for matrix multiplication and the Strassen algorithm. SubCuber is up to 12% faster for one recursion level and 22% for two recursion levels over Cutlass and cuBLAS, while existing approaches are only up to 8% faster for one-level and 16% faster for two-levels. Furthermore, SubCuber makes matrix multiplication in language models like Phi-4 14B, Qwen-3 32B, and LLaMA-3 405b, up to 16% faster for inference scenarios.

DOI

https://doi.org/10.1145/3808267