This program is tentative and subject to change.

Wed 17 Jun 2026 16:30 - 16:50 at Flatirons 3 - GPU Programming

In the CUDA programming model, data transfers on the default stream are synchronous, and, similarly, device kernels launched on the default stream cannot overlap with other kernel computations and data transfers. Overlapping execution can be enabled using asynchronous APIs and streams in CUDA. Using them, however, requires careful handling of data dependencies across multiple data-transfer calls, host operations, and kernel computations to ensure program correctness. Moreover, numerous data transfer calls and kernel calls in a program make it even more challenging to manually assign the\nappropriate stream identifier for each such call. This challenge remains daunting for non-expert programmers because they lack the right tools and expertise.\n\nTo address this, we propose sync2async, a novel optimization technique that transforms synchronous data transfers and kernel launches into non-default-stream asynchronous calls by allocating stream identifiers (and adding stream synchronizations at appropriate places) to maximize parallelizability while preserving dependencies. To identify sync2async opportunities and apply transformations, we introduce StreamAlloc, a data-flow-analysis-based framework with four components: (1) inter-procedural compositional read-write analysis to identify variables read and written at call sites, (2) intra-procedural flow-sensitive Can-Run-Asynchronously (CRA) analysis to detect data-transfer and kernel calls that can run asynchronously, (3) Data Flow Stream Assignment (DFSA) algorithm to schedule such asynchronous calls to different non-default streams, and (4) a transformation framework to apply sync2async and automatically optimize the input program. We have implemented\nStreamAlloc using LLVM/Clang. On P100, A4000, and A100 GPUs, sync2async achieves geomean speedups of 1.49x, 1.63x, and 2.02x over the baseline, respectively.

This program is tentative and subject to change.

Wed 17 Jun

Displayed time zone: Mountain Time (US & Canada) change

15:50 - 17:30
15:50
20m
Talk
Kuiper: Correct and Efficient GPU Programming with Dependent Types and Separation Logic
PLDI Research Papers
Guido Martínez Microsoft Research, Bastian Köpcke TU Berlin, Jonáš Fiala ETH Zurich, Gabriel Ebner Microsoft Research, Tahina Ramananandro Microsoft Research, Michel Steuwer TU Berlin, Tyler Sorensen Microsoft Research, Nikhil Swamy Microsoft Research
DOI
16:10
20m
Talk
Modular GPU Programming with Typed Perspectives
PLDI Research Papers
Manya Bansal Massachusetts Institute of Technology, Daniel Sainati University of Pennsylvania, Joseph W. Cutler University of Pennsylvania, Saman Amarasinghe Massachusetts Institute of Technology, Jonathan Ragan-Kelley Massachusetts Institute of Technology
DOI
16:30
20m
Talk
[TOPLAS] StreamAlloc: A Framework for Analyzing and Transforming CUDA Code to Enable Asynchronous Execution
PLDI Research Papers
Soumik Kumar Basu IIT Hyderabad, Jyothi Vedurada IIT Hyderabad
16:50
20m
Talk
SIMT-Step Execution: A Flexible Operational Semantics For GPU Subgroup Behavior
PLDI Research Papers
Zheyuan Chen University of California at Santa Cruz, Naomi Rehman University of California at Santa Barbara, Guido Martínez Microsoft Research, Tyler Sorensen Microsoft Research; University of California at Santa Cruz
DOI
17:10
20m
Talk
Uniformity Analysis in the WebGPU Shading Language
PLDI Research Papers
James Lee-Jones Imperial College London, John Wickerson Imperial College London, Alastair F. Donaldson Imperial College London
DOI