Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs (PLDI 2026 - PLDI Research Papers)

Mon 15 - Fri 19 June 2026 Boulder, Colorado, United States

Who

Yifan Zhao, Egan Johnson, Prasanth Chatarasi, Vikram S. Adve, Sasa Misailovic

Track

PLDI 2026 PLDI Research Papers

Time Zone

The program is currently displayed in (GMT-06:00) Mountain Time (US & Canada).

Use conference time zone: (GMT-06:00) Mountain Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 19 Jun 2026 11:40 - 12:00 at Flatirons 4 - Compiler Optimization for Accelerators Chair(s): Tyler Sorensen

Abstract

Operator fusion has become a key optimization for deep learning,
which combines multiple deep learning operators to improve data reuse and reduce global
memory transfers.
However, existing tensor compilers struggle to fuse complex reduction computations
involving loop-carried dependencies, such as attention mechanisms.

This paper introduces Neptune, a tensor compiler for advanced operator fusion for
sequences of reduction operators.
Neptune presents a new approach for advanced operator fusion, which intentionally
breaks some existing dependencies and compensates by constructing algebraic correction expressions
that allow the kernel to produce the correct result. Applying Neptune’s advanced operator fusion to a plain attention operator generates operators
equivalent to FlashAttention and FlashDecoding.

On ten attention-based benchmarks, Neptune, starting from a plain attention code
and a high-level scheduling template, outperforms existing compilers
like Triton, TVM, and FlexAttention, including Triton-based implementations of FlashAttention.
Across four different GPU architectures from NVIDIA and AMD,
Neptune-generated kernels have an average speedup of 1.35$\times$ over the next best alternative,
with up to $2.65 \times$ speedup on Nvidia GPUs and up to $3.32 \times$ on AMD GPUs,
demonstrating its effectiveness for deep learning workloads.

DOI

https://doi.org/10.1145/3808298

Yifan Zhao

University of Illinois Urbana-Champaign

United States

Egan Johnson

University of Illinois Urbana-Champaign

United States

Prasanth Chatarasi

IBM Research

United States

Vikram S. Adve

University of Illinois Urbana-Champaign

United States

Sasa Misailovic

University of Illinois Urbana-Champaign

United States

Time Zone

The program is currently displayed in (GMT-06:00) Mountain Time (US & Canada).

Use conference time zone: (GMT-06:00) Mountain Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Fri 19 Jun
Displayed time zone: Mountain Time (US & Canada) change

11:00 - 12:40	Compiler Optimization for AcceleratorsPLDI Research Papers at Flatirons 4 Chair(s): Tyler Sorensen Microsoft Research; University of California at Santa Cruz

11:00 20m Talk		Compiling Strassen-like Matrix Multiplication Algorithms to Fast CUDA Kernels PLDI Research Papers Abhinav Jangda Microsoft Research DOI
11:20 20m Talk		Parameterized Algorithms and Complexity for Function Merging with Branch Reordering PLDI Research Papers Amir K. Goharshady University of Oxford, Kerim Kochekov Hong Kong University of Science and Technology, Tian Shu Hong Kong University of Science and Technology, Ahmed Khaled Zaher Hong Kong University of Science and Technology DOI
11:40 20m Talk		Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs PLDI Research Papers Yifan Zhao University of Illinois Urbana-Champaign, Egan Johnson University of Illinois Urbana-Champaign, Prasanth Chatarasi IBM Research, Vikram S. Adve University of Illinois Urbana-Champaign, Sasa Misailovic University of Illinois Urbana-Champaign DOI
12:00 20m Talk		SparseZETA: Intelligent Auto-tuner for Designing High-Performance SpMV Programs PLDI Research Papers Zhen Du Institute of Computing Technology at Chinese Academy of Sciences, Ying Liu Institute of Computing Technology at Chinese Academy of Sciences; University of Chinese Academy of Sciences, Xionghui Chen Nanjing University, Yanbo Zhao North Carolina State University, Xiaobing Feng Institute of Computing Technology at Chinese Academy of Sciences; University of Chinese Academy of Sciences, Huimin Cui Institute of Computing Technology at Chinese Academy of Sciences; University of Chinese Academy of Sciences, Jiajia Li North Carolina State University DOI