Skip to main content
Counting Without Running: Evaluating LLMs' Reasoning About Code Complexity

Counting Without Running: Evaluating LLMs' Reasoning About Code Complexity

Introduces gpuFLOPBench, a benchmark containing 577 CUDA kernels to evaluate whether language models can predict floating-point operation counts without execution, revealing limitations in understanding hardware-specific performance details.

Published on December 4, 2025
← Back to Research
Counting Without Running: Evaluating LLMs' Reasoning About Code Complexity

Abstract

This research introduces gpuFLOPBench, a benchmark containing 577 CUDA kernels designed to evaluate whether language models can predict floating-point operation counts without execution. The study reveals that modern LLMs handle straightforward kernels well but struggle significantly with complex scenarios involving division, math functions, or shared subexpressions—highlighting fundamental limitations in understanding hardware-specific performance details. These findings have important implications for using LLMs in performance optimization and code analysis tasks.

Read Full Article