Compiling Strassen-like Matrix Multiplication Algorithms to Fast CUDA Kernels
This program is tentative and subject to change.
Matrix multiplication is a key operation in scientific computing and machine learning, with GPU libraries like NVIDIA Cutlass and cuBLAS providing optimized implementations of the three nested loop cubic algorithm. While sub-cubic algorithms, like the Strassen algorithm and its variants, are theoretically faster, their recursive structure makes it challenging to implement efficient GPU kernels. This is why existing approaches either do excessive memory accesses or do not effectively overlap memory accesses and computations, leading to sub-optimal performance compared to theoretical expectations.
This paper presents SubCuber, a domain-specific compiler that generates efficient CUDA kernels for
Strassen-like matrix multiplication algorithms. SubCuber contains two novel CUDA kernels that are designed to minimize memory input loads and effectively overlap computation with memory loads. To generate efficient code, SubCuber constructs the dependency graph of a Strassen-like algorithm, selects efficient kernel schedules, and applies fusion strategies tailored to a recursion level, matrix sizes, and GPU. Our evaluation on NVIDIA A100 and H200 GPUs shows that for both single- and half-precision floating point matrix multiplications, SubCuber’s generated code outperforms state-of-the-art CUDA implementations for matrix multiplication and the Strassen algorithm. SubCuber is up to 12% faster for one recursion level and 22% for two recursion levels over Cutlass and cuBLAS, while existing approaches are only up to 8% faster for one-level and 16% faster for two-levels. Furthermore, SubCuber makes matrix multiplication in language models like Phi-4 14B, Qwen-3 32B, and LLaMA-3 405b, up to 16% faster for inference scenarios.
This program is tentative and subject to change.
Fri 19 JunDisplayed time zone: Mountain Time (US & Canada) change
10:30 - 12:10 | |||
10:30 20mTalk | Compiling Strassen-like Matrix Multiplication Algorithms to Fast CUDA Kernels PLDI Research Papers Abhinav Jangda Microsoft Research DOI | ||
10:50 20mTalk | Parameterized Algorithms and Complexity for Function Merging with Branch Reordering PLDI Research Papers Amir K. Goharshady University of Oxford, Kerim Kochekov Hong Kong University of Science and Technology, Tian Shu Hong Kong University of Science and Technology, Ahmed Khaled Zaher Hong Kong University of Science and Technology DOI | ||
11:10 20mTalk | NEURA: A Unified and Retargetable Compilation Framework for Coarse-Grained Reconfigurable Architectures PLDI Research Papers Shangkun Li Hong Kong University of Science and Technology, Jinming Ge Hong Kong University of Science and Technology, Diyuan Tao Independent Researcher, Zeyu Li Hong Kong University of Science and Technology, Jiawei Liang Hong Kong University of Science and Technology, Linfeng Du Hong Kong University of Science and Technology, Jiang Xu Hong Kong University of Science and Technology (Guangzhou), Wei Zhang Hong Kong University of Science and Technology, Cheng Tan Google; Arizona State University DOI | ||
11:30 20mTalk | Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs PLDI Research Papers Yifan Zhao University of Illinois Urbana-Champaign, Egan Johnson University of Illinois Urbana-Champaign, Prasanth Chatarasi IBM Research, Vikram S. Adve University of Illinois Urbana-Champaign, Sasa Misailovic University of Illinois Urbana-Champaign DOI | ||
11:50 20mTalk | SparseZETA: Intelligent Auto-tuner for Designing High-Performance SpMV Programs PLDI Research Papers Zhen Du Institute of Computing Technology at Chinese Academy of Sciences, Ying Liu Institute of Computing Technology at Chinese Academy of Sciences; University of Chinese Academy of Sciences, Xionghui Chen Nanjing University, Yanbo Zhao North Carolina State University, Xiaobing Feng Institute of Computing Technology at Chinese Academy of Sciences; University of Chinese Academy of Sciences, Huimin Cui Institute of Computing Technology at Chinese Academy of Sciences; University of Chinese Academy of Sciences, Jiajia Li North Carolina State University DOI | ||
