Home

👥 AgentIF Team • 📚 AgentIF Paper • Code • 📊 AgentIF Dataset

Logo


We introduce AgentIF, the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios. AgentIF features three key characteristics: (1) Realistic, constructed from 50 real-world agentic applications. (2) Long, averaging 1,723 words with a maximum of 15,630 words. (3) Complex, averaging 11.9 constraints per instruction, covering diverse constraint types, such as tool specifications and condition constraints. Here is the instruction length distribution in AgentIF, along with the success rates of several representative LLMs across the constraint dimensions we propose: Logo An example instruction of AgentIF: Logo

Leaderboard

Metrics

  • Constraint Success Rate (CSR) measures the proportion of individual constraints that are correctly satisfied by the model’s response.
  • Instruction Success Rate (ISR) measures the proportion of instructions for which all constraints are satisfied.

Performance Across Constraint Categories

  • Press each model to get its latest results. Or press this key to reach the depository of all results: Results
Models Dimension Type ISR CSR
Vanilla Condition Example Formatting Semantic Tool
[T] o1-mini59.837.580.866.159.143.226.959.8
[N] GPT-4o 58.035.180.865.856.543.226.458.5
[N] Qwen3-32B 57.541.180.657.762.545.724.958.4
[T] QwQ-32B 57.535.682.761.459.443.227.258.1
[T] DeepSeek-R1 56.141.487.061.458.944.422.257.9
[T] GLM-Z1-32B 56.737.983.660.259.643.123.857.8
[N] DeepSeek-V3 54.941.584.559.358.940.821.956.7
[N] Claude-3-5-Sonnet 57.336.969.261.556.043.324.956.6
[N] Meta-Llama-3.1-70B-Instruct 55.135.084.361.655.642.820.956.3
[T] DeepSeek-R1-Distill-Qwen-32B 54.539.673.155.757.245.220.755.1
[T] DeepSeek-R1-Distill-Llama-70B 55.437.769.256.556.644.119.955.0
[N] Meta-Llama-3.1-8B-Instruct 53.536.671.455.654.843.519.953.6
[S] Crab-DPO-7B 48.324.357.548.847.441.910.147.2
[N] Mistral-7B-Instruct-v0.3 47.929.253.847.048.639.811.546.8
[S] Conifer-DPO-7B 45.627.050.542.046.941.810.744.3
Success rates (%) of various proprietary and open-source LLMs on AgentIF, sorted by CSR in descending order. [N] denotes non-thinking models, [T] denotes thinking models, and [S] denotes models explicitly designed for instruction following by the academic community.

Evaluation

For each instruction, we annotate the associated constraints and corresponding evaluation metrics, including code-based evaluation, LLM-based evaluation, and hybrid code-LLM evaluation.

How to evaluation

  1. Clone the remote repository to your local environment. The necessary data is already included, so no further actions are needed.
     git clone https://github.com/THU-KEG/AgentIF.git
    
  2. (Optional) To evaluate a model hosted locally, deploy it using vLLM. Use a command similar to the following:
     CUDA_VISIBLE_DEVICES=<CUDA_ID> vllm serve "<your_model_path>" \
         --served-model-name <your_model_name> \
         --port 8008 \
         --tensor-parallel-size <num_gpus> \
         --max-model-len 32000 \
         --gpu-memory-utilization 0.9
    
  3. Specify the target model and the evaluator in the run.sh file. To reproduce our results, we recommend using gpt-4o-2024-11-20.

    Model_Name=""             # Name of the model to evaluate
    Model_Name_URL=""         # Endpoint of the model (e.g., OpenAI API URL or local vLLM URL)
    Model_Name_API_Key="EMPTY" # Set to "EMPTY" for local vLLM; otherwise, provide your API key
    
    Evaluator_Model_Backbone=""  # Name of the evaluator model; use `gpt-4o-2024-11-20` for reproducibility
    Evaluator_URL=""             # Base URL of the evaluator; use `https://api.openai.com/v1` to match our setup
    Evaluator_API_Key=""         # API key for the evaluator
    
  4. Then run the script to start the evaluation.

     sh run.sh
    

Data Format

Each data instance in AgentIF is structured as follows:

{
  "input": [
    { "role": "system", "content": "..." },
    { "role": "user",   "content": "..." }
  ],
  "constraints": [
    {
      "id": 0,
      "desc": "...",                // Constraint description
      "other_info": {               // Auxiliary information for evaluation
        "...": "..."
      },
      "dimension": "...",           // Constraint Presentation Type
      "type": "...",                // Constraint Type
      "is_meta": false,             // Whether it is a meta-constraint
      "evaluation": [               // Evaluation Method
        {
          "type": "llm",            // LLM-based evaluation
          "required_keys": ["response"],
          "exec": "..."             // Evaluation prompt for LLM
        },
        {
          "type": "code",           // Code-based evaluation
          "exec": "..."             // Executable code snippet
        }
      ]
    }
  ]
}

Citation

@misc{qi2025agentifbenchmarkinginstructionfollowing,
      title={AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios}, 
      author={Yunjia Qi and Hao Peng and Xiaozhi Wang and Amy Xin and Youfeng Liu and Bin Xu and Lei Hou and Juanzi Li},
      year={2025},
      eprint={2505.16944},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2505.16944}, 
}