
Counting Without Running: Evaluating LLMs' Reasoning About Code Complexity
Introduces gpuFLOPBench, a benchmark containing 577 CUDA kernels to evaluate whether language models can predict floating-point operation counts without execution, revealing limitations in understanding hardware-specific performance details.

Abstract
This research introduces gpuFLOPBench, a benchmark containing 577 CUDA kernels designed to evaluate whether language models can predict floating-point operation counts without execution. The study reveals that modern LLMs handle straightforward kernels well but struggle significantly with complex scenarios involving division, math functions, or shared subexpressions—highlighting fundamental limitations in understanding hardware-specific performance details. These findings have important implications for using LLMs in performance optimization and code analysis tasks.
Related Research
AI-generated code that works — and proves it
How Code Metal combines AI with formal methods to build trusted code translation systems, and welcoming Prof. Loris D'Antoni as our first Code Metal Scholar.
UniPar: A Unified LLM-Based Framework for Parallel and Accelerated Code Translation in HPC
Introduces UniPar, an evaluation framework for assessing how large language models translate between parallel programming languages, achieving 69% compilation success and 33% functional correctness through fine-tuning and compiler-guided repair.