MCiteBench: A Multimodal Benchmark for Generating Text with Citations

1Fudan University  2Shanghai University

Abstract

In this paper, we introduce MCiteBench, the first benchmark designed to assess the ability of MLLMs to generate text with citations in multimodal contexts. Our benchmark comprises data derived from academic papers and review-rebuttal interactions, featuring diverse information sources and multimodal content. Experimental results reveal that MLLMs struggle to ground their outputs reliably when handling multimodal input. Further analysis uncovers a systematic modality bias and reveals how models internally rely on different sources when generating citations, offering insights into model behavior and guiding future directions for multimodal citation tasks.

Example of Multimodal Citation Tasks

The model takes multimodal corpus and generates responses with explicit citations.

Dataset

MCiteBench comprises 3,000 data samples for evaluating the ability of MLLMs to generate text with citations, extracted from 1,749 academic papers with an average of 1.72 questions per paper. Among these, 2,000 are Explanation tasks that require detailed evidence analysis and often lead to long-form answers, while 1,000 are Locating tasks that focus on direct evidence identification. The evidence is balanced across modalities, with 1,243 textual, 1,474 visual (including 941 figures and 533 tables), and 283 mixed-modality sources, ensuring diverse multimodal attribution scenarios.

Main Results

  • Smaller open-source models achieve lower Citation F1 scores and struggle to select evidence that adequately supports their responses. Furthermore, they also perform poorly in selecting evidence that directly answers the query, as shown by their low Source F1 and Source Exact Match scores.
  • As model size increases, we observe an improvement in citation performance, suggesting that scaling model size enhances attribution capability.
  • In comparison, GPT-4o achieves an 84.24% Citation F1 score on single-source Explanation questions, demonstrating strong citation quality. However, it struggles with source reliability, with Source Exact Match scores remaining low at 24.50% for single-source and 21.27% for multi-source settings. This indicates that even state-of-the-art models struggle to consistently cite evidence that is directly relevant to answering the query, underscoring the difficulty of precise citation in multimodal contexts.

RQ1: Can MLLMs Accurately Identify the Source Needed to Answer a Question?

Generating text with citation can be abstracted into a two-stage process:

  1. generating a response
  2. mapping that response to the appropriate supporting input sources by producing attribution tokens such as “[1]” or “Figure 3”.
Instead of requiring the model to generate an answer and then attribute it, we directly evaluate its ability to identify which source would be most helpful in answering a given question. The model is tasked with selecting which source would be most helpful in answering the question. This task directly evaluates the model’s ability to identify relevant sources based solely on the question, rather than relying on model-generated answers or intermediate claims. Despite this seemingly simplified setting, no model achieves more than 60% accuracy, highlighting the persistent difficulty in accurately grounding questions in the correct source.

RQ2: Does Modality Influence Citation Performance?

We analyze model performance in instances where the evidence modality comes from mixed modalities. Most models achieve high Source EM scores when the ground truth evidence is textual but perform poorly when it is visual. This suggests that although MLLMs can process multimodal inputs, they are better at aligning with textual evidence than accurately citing visual information when generating responses.

Using Qwen2-VL-7B as the test model, we calculate the attention distribution across multimodal inputs by averaging attention head scores and normalizing by input source token length across different layers. While the model processes all modalities, it prioritizes textual content and utilizes it more effectively than visual data.

RQ3: What Do Models Look At When Generating Citations?

We analyze the attention distribution of Qwen2-VL-7B when generating source-identifying tokens (e.g., "[1]", "Figure 2"). Notably, the distractors are sampled from unrelated papers, meaning they provide no useful information for answering the question.
The model’s attention heatmap reveals an intriguing pattern: even when the response is based entirely on a specific piece of evidence, the model’s attention is not solely focused on it. Specifically, we focus on its behavior when predicting the next token after "According to Figure " in its response. The model’s attention remains high on textual index positions (e.g., "[1]", "[2]"), even though the context suggests the model should focus on figure evidence. This suggests that while the model correctly cites the source, it maintains a broader contextual awareness by attending to multiple potential evidence.

BibTeX

@article{hu2025mcitebench,
        title={MCiteBench: A Benchmark for Multimodal Citation Text Generation in MLLMs},
        author={Hu, Caiyu and Zhang, Yikai and Zhu, Tinghui and Ye, Yiwei and Xiao, Yanghua},
        journal={arXiv preprint arXiv:2503.02589},
        year={2025}
      }