MONOCODER: Domain-Specific Code Language Model for HPC Codes and Tasks

Abstract

This paper introduces MonoCoder, a domain-specific language model designed for high-performance computing (HPC) tasks. The researchers developed MonoCoder by pre-training on HPCorpus, an HPC-specific dataset of C and C++ programs mined from GitHub. Despite being significantly smaller than comparable models, MonoCoder demonstrates superior performance on HPC-related tasks, outperforming larger state-of-the-art multilingual LLMs on normalized perplexity tests and delivering competitive CodeBLEU scores for parallel and high-performance code generation. This research demonstrates that specialized, domain-focused models can achieve better results with fewer parameters than general-purpose alternatives.

Read Full Article

Related Research

UniPar: A Unified LLM-Based Framework for Parallel and Accelerated Code Translation in HPC

Introduces UniPar, an evaluation framework for assessing how large language models translate between parallel programming languages, achieving 69% compilation success and 33% functional correctness through fine-tuning and compiler-guided repair.

Workflows vs Agents for Code Translation

Compares structured workflows versus agentic approaches for MATLAB-to-HDL translation, showing that agentic methods with the Model Context Protocol increase simulation reach rates by over 20 percentage points on mid-sized models.