Papers

Oral Presentations

  1. EquiBench: Benchmarking Large Language Models’ Understanding of Program Semantics via Equivalence Checking

  2. Constrained Decoding of Diffusion LLMs with Context-Free Grammars

  3. SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

  4. SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?

  5. Training LLM Agents to Empower Humans

Poster Presentations

  1. Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks

  2. Improving Parallel Program Performance with LLM Optimizers via Agent-System Interfaces

  3. CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis

  4. Improving Assembly Code Performance with Large Language Models via Reinforcement Learning

  5. SATBench: Benchmarking LLMs’ Logical Reasoning via Automated Puzzle Generation from SAT Formulas

  6. VeriCoder: Enhancing LLM-Based RTL Code Generation through Functional Correctness Validation

  7. Where’s the Bug? Attention Probing for Scalable Fault Localization

  8. Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective

  9. Demystify the Potential of Large Language Models as General-Purpose Surrogate Code Executors

  10. Interactive Evaluation of Large Language Models for Multi-Requirement Software Engineering Tasks

  11. SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development

  12. CodeMirage: A Multi-Lingual Benchmark for Detecting AI-Generated and Paraphrased Source Code from Production-Level LLMs

  13. CoDyn: Dynamic LLM Routing for Coding Tasks

  14. Cyber-Zero: Training Cybersecurity Agents without Runtime

  15. Training Language Model Agents to Find Vulnerabilities with CTF-Dojo

  16. BUILD-BENCH: Benchmarking LLM Agents on Compiling Real-World Open-Source Software

  17. ChopChop: Semantically Constraining the Code Output of Language Models

  18. Refactoring Codebases through Library Design

  19. A Note on the Code Quality Score System: LLMs for Maintainable Large Codebases

  20. A Matter of Representation: Towards Graph-Based Abstract Code Generation

  21. Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

  22. LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures

  23. FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration

  24. Thyme: Think Beyond Images

  25. Is Your Benchmark Still Useful? Dynamic Benchmarking for Code Language Models

  26. Ensuring Functional Correctness of Large Code Models with Selective Generation

  27. Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models

  28. The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management

  29. Random Baselines for Simple Code Problems are Competitive with Code Evolution

  30. Practical Code RAG at Scale: Task-Aware Retrieval Design Choices under Compute Budgets

  31. GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities

  32. Code2Video: A Code-centric Paradigm for Educational Video Generation

  33. CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback

  34. Advancing Environment Setup LLMs through Online Reinforcement Learning

  35. RocqStar: Leveraging Similarity-driven Retrieval and Agentic Systems for Rocq generation

  36. Diff-XYZ: A Benchmark for Evaluating Diff Understanding

  37. Learning From Design Procedure To Generate CAD Programs for Data Augmentation

  38. Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks

  39. Efficient Code Embeddings from Code Generation Models

  40. SQL-of-Thought: Multi-agentic Text-to-SQL with Guided Error Correction

  41. SubtaskEval: Benchmarking LLMs on Competitive Programming Subtasks

  42. HarnessLLM: Automatic Testing Harness Generation via Reinforcement Learning

  43. Deep-Reproducer: From Paper Understanding to Code Generation

  44. Workflows vs Agents for Code Translation

  45. Can Test-Time Compute Help LLMs Write Low-Resource Parallel Code Better?

  46. Learning to Solve and Verify: A Self-Play Framework for Mutually Improving Code and Test Generation

  47. Astra: A Multi-Agent System for GPU Kernel Performance Optimization

  48. The Valley of Code Reasoning: Scaling Knowledge Distillation of Large Language Models

  49. LLM-Driven Multi-step Translation from C to Rust using Static Analysis

  50. MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding

  51. R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents

  52. Security Knowledge Dilution in Large Language Models: How Irrelevant Context Degrades Critical Domain Expertise

  53. In-Context Learning for Esoteric Programming Languages: Evaluating and Enhancing LLM Reasoning Without Fine-Tuning

  54. Agentic Property-Based Testing: Finding Bugs Across the Python Ecosystem

  55. DevBench: Beyond Accuracy: Realistic and Diagnostic Evaluation of Code Generation Models

  56. pydra: Probing Code Representations With Synthetic Clones and Bugs

  57. SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

  58. Good-Enough Structured Generation: A Case Study on JSON Schema

  59. HardTests: Synthesizing High-Quality Test Cases for LLM Coding

  60. Asm2SrcEval: Evaluating Large Language Models for Assembly to Source Code Translation

  61. STACKFEED: Structured Textual Actor-Critic Knowledge base editing with FEEDback

  62. Agint: Agentic Graph Compilation for Software Engineering Agents

  63. DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code

  64. Adapting Language Models for Low-Resource Programming Languages