Writing Performance-Portable Kernels Simplified with Helion (PLDI 2026 - Tutorials)

Mon 15 - Fri 19 June 2026 Boulder, Colorado, United States

Who

Oguz Ulgen, Markus Hoehnerbach, Will Feng, Jason Ansel

Track

PLDI 2026 Tutorials

Abstract

Modern machine learning relies heavily on custom kernels for performance, which are often written in hardware-specific languages and create technical debt. Helion addresses this by compiling a high-level Python Domain Specific Language (DSL) into optimized Triton code, automating low-level details and hardware-specific tuning. With its PyTorch-like syntax and autotuning engine, Helion delivers fast, portable performance while significantly reducing development effort. Helion is open-source at https://github.com/pytorch/helion. This 4-hour tutorial will describe Helion through a series of talks, demonstrations, and practical, hands-on development sessions.

Introduction to Helion (15 minutes): We will provide an overview of Helion, including its underlying motivation, programming model, overall design architecture, and various use cases.
Autotuning in Helion (1 hour): A key feature of Helion is its scalable autotuning framework that explores a vast configuration space, where one Helion kernel can map to thousands of Triton kernels. In this session, we detail the configuration space that Helion explores, illustrate how different configs map to Triton code, and examine the various search strategies that Helion utilizes, such as random sampling, evolutionary algorithms, and pattern search. Attendees will also have the opportunity for hands-on experience in autotuning Helion kernels.
Compiler Architecture and Integration with TorchInductor (45 mins): The Helion compiler architecture progressively lowers Python functions into highly optimized Triton code, utilizing TorchInductor as its backend. The key stages of this compilation pipeline are Python AST parsing, Type Propagation, Device IR lowering, a series of compiler passes, and finally, code generation. We will detail the integration between Helion and TorchInductor, explaining how this interface enables Helion to target both GPU and non-GPU hardware and how users can incorporate their own custom backends.
Distributed Support (45 mins): We will describe how Helion empowers kernel authors to write portable, compute-communication fused kernels without the need for hardware-specific code, by enabling direct expression of fused kernels using communication primitives and tile-level scheduling. In this session, we will demonstrate a fused kernel implemented in Helion, show how it maps to the underlying Triton code, and highlight the latency improvements achieved through compute-communication fusion.
Kernel Benchmarking (1 hour 15 mins): Helion delivers competitive or superior performance across a diverse set of workloads compared to vendor libraries, hand-optimized kernels, and compiler-generated kernels. In this session, we will showcase a variety of real-world use cases—including Attention kernels—and present comprehensive benchmarking results that highlight Helion’s performance advantages. Attendees will also have the opportunity to participate hands-on, learning how to author and optimize their own Helion kernels through guided examples.

Writing Performance-Portable Kernels Simplified with Helion

Oguz Ulgen

Meta

United States

Markus Hoehnerbach

Will Feng

Jason Ansel

Meta

United States

Tracks

Co-hosted Conferences

Workshops