Oral
- Coding With “Enemy”: Can Human Developers Detect AI Agent Sabotage?
- ProgramBench: Can Language Models Rebuild Programs From Scratch?
- Hawkeye: Hardware-Aware GPU Kernel Optimization with Minimal Supervision
- Systematic LLM Translation of Legacy Scientific Code to Differentiable Frameworks: Application to a Land Surface Model
- DevBench: An Interaction-Grounded Benchmark for Code Completion Models
Poster
- TritonRL: Training LLMs to Think and Code Triton Without Cheating
- SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR
- SkillFlow: Scalable and Efficient Agent Skill Retrieval System
- MultiVulnBench: A Large-Scale Benchmark for Count Bias in LLM-Based Multi-Vulnerability Detection
- Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation
- Orchestrating LLMs as Hierarchical Multi-Agent Reinforcement Learning System for Automotive Software Development
- BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization
- Detecting Functional Memorization in Code Language Models
- SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution
- Step Rejection Fine-Tuning: A Practical Distillation Recipe
- Language-based Trial and Error Falls Behind in the Era of Experience
- Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon
- GameDevBench: Evaluating Agentic Capabilities Through Game Development
- Domain-Adaptable Reinforcement Learning for Code Generation with Dense Rewards
- Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
- SynKer: Synthesize, Kernelize, Reinforce - Teaching GPU Kernel Generation to Small Language Models
- The Scaffold Effect in Coding Agents: Harness Choice as a Hidden Variable in Coding-Agent Evaluation
- Teaching LLMs Program Semantics via Symbolic Execution Traces
- LibEvoBench: Probing Temporal Knowledge Stratification in Code Generation Models
- How can we assess human-agent interactions? Case studies in software agent design
- Evolutionary Multi-Task Optimization for LLM-Guided Discovery
- Comparing Developer and LLM Biases in Code Evaluation
- Extracting Recurring Vulnerabilities from Black-Box LLM-Generated Software
- Nexus: Execution-Grounded Multi-Agent Test Oracle Synthesis
- RepoGuardBench: Repository-Borne Prompt Injection Attacks and Lightweight Defenses for Local Coding Agents
- ParallelKernelBench: Can LLMs Write Fast Multi-GPU Kernels?
- Beyond Pass@k: Developing Human-Centric Evaluation Metrics for AI Pair Programmers
- Beyond Lexical Similarity: A Benchmark for Evaluating Code Documentation Agents
- Frozen Inner Looping Is Brittle for Code Generation with BitNet b1.58
- DeRL-SWE: Decoupled Reinforcement Learning with Nested Credit Assignment for Software Engineering
- Do AI Agents Write Less Maintainable Code Than Human Developers?
- Interactive Benchmarking of Scientific Coding Agents for Spatial Transcriptomics Alignment
- Making Execution Time a Trainable Reward for Code Generation
- JAXBench: Benchmarking Autonomous TPU Kernel Optimization
- Matching Decompilation as a Verifier-Guided Task for Human-Centered Coding Agents
- Understanding and supporting how developers prompt for LLM-powered code editing in practice
- Loop or Leap? Benchmarking Iterative vs Recursive Reasoning in Code LLMs with LRLBench
- ATLAS: Automated Toolkit for Large-Scale Verified Code Synthesis
- Agentic Neural Architecture Search
- SWE-chat: Coding Agent Interactions From Real Users in the Wild
- Safety Drift in Human-Centered Coding Agents: Suppression vs. Representation Shaping After Benign Adaptation
- Improving Code Efficiency with Iterative Refinement using LLMs
- CodeCast: Context-Conditional Code Generation for Multimodal Time Series Forecasting
- VeriBench: An End-to-End Formal Verification Benchmark for AI Coding Agents in Lean 4
- AI Coding Benchmarks Need Proofs, Not Just Tests
- LLM-Based Rust Code Generation with On-the-Fly Compiler Feedback
- kAgent: An execution-guided crash resolution agent for the Linux kernel
- Scalable and Transparent Attribution for Human-AI Collaborative Code
- KForge: A Multi-Agent System for Cross-Platform Kernel Synthesis
- Execution-Grounded Agents: Enforcing Physical Constraints in AI Code Generation via Oracle Search
- Don’t Let Gains FADE: Breaking Down Policy Gradient Weights in RL
- Hidden Positives: Why Code Retrieval Benchmarks Underestimate Model Quality
- SWE-Router: Routing in Multi-turn Agentic Software Engineering Tasks
- Steerability via constraints: a substrate for scalable oversight of coding agents
- On Data Engineering for Scaling LLM Terminal Capabilities
- VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
- Learning Bug Context for PyTorch-to-JAX Translation with LLMs
- Certifying the Judge: Falsifiable Properties for LLM-Based Evaluation of Formal Code
- Rapid Fixes, Gradual Failures: Exploring Iterative Self-Correction Dynamics in Large Language Models for Program Synthesis
- Selective Code Generation under Correctness and Security Risk
- Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision
- How Should Agents Read Demonstrations? Hierarchical Structure Beats Flat Action Logs
- optimize_anything: Unified Text Optimization can Outperform Specialized Systems
- CRCA-Context: Counterfactual Robustness of Repository Context Retrieval Under Equivalent Issue Descriptions
- Beyond Isolated Tasks: A Framework for Evaluating Coding Agents on Sequential Software Evolution
- Test-Time Scaling with Weak Verifiers via Self-Play
- SpIDER: Spatially Informed Dense Embedding Retrieval for Software Issue Localization
- SLEEP: Simulated Future Learning Environments for Automated Evolution of Heuristic Portfolios
- Why Solve It Twice? Hierarchical Accumulation of Skills for Transfer-Efficient ML Engineering
- Adversarial Review: Structured Disagreement for Grounded Agentic Code Review
- Towards Evaluation of Implicit Software World Models in Coding LLMs
- Don’t Claim Benchmark-Oriented Optimization Improves General Coding Capability — Diverse Evaluation Is Required
- Benchmarks as Software: A Case Study on Terminal-Bench
- Can LLMs Detect Benchmark Defects? A Meta-Benchmark from Benchmark Updates
- From Search to Policy: Thompson Sampling Trees for Robust LLM Code Refinement
- Interpreting Code Correctness in Language Models through Activation Steering
- An Empirical Study of Proactive Coding Assistants in Real-World Software Development
- V1: Unifying Generation and Self-Verification for Parallel Reasoners
- Back to the Beginning of Heuristic Design: Bridging Code and Knowledge with LLMs
- NKI-Agent: Domain-Specific Fine-Tuning and Agentic Tool Use for Neuron Kernel Generation
- VibeSWEBench: Can AI Co-Worker Agents Do Real-World Software Engineering by Vibe Coding?
- ExVerus: Verus Proof Repair via Counterexample Reasoning
- Beyond Problem Solving: UOJ-Bench for Evaluating Code Generation, Hacking, and Repair in Competitive Programming
- QuantumSemEval: Benchmarking LLMs’ Understanding of Quantum Programs via Semantic Equivalence Checking