MCiteBench: A Benchmark for Multimodal Citation Text Generation in MLLMs

1Fudan University  2Shanghai University

Abstract

We introduce MCiteBench, the first benchmark designed to evaluate and analyze the multimodal citation text generation ability of MLLMs. Our benchmark comprises data derived from academic papers and review-rebuttal interactions, featuring diverse information sources and multimodal content. We comprehensively evaluate models from multiple dimensions, including citation quality, source reliability, and answer accuracy. Through extensive experiments, we observe that MLLMs struggle with multimodal citation text generation. We also conduct deep analyses of models' performance, revealing that the bottleneck lies in attributing the correct sources rather than understanding the multimodal content.

Example of Multimodal Citation Text Generation

The model takes multimodal corpus and generates responses with explicit citations.

Dataset

MCiteBench comprises 3,000 data samples for multimodal citation text generation questions, extracted from 1,749 academic papers with an average of 1.72 questions per paper. Among these, 2,000 are Explanation questions that require detailed evidence analysis and often lead to long-form answers, while 1,000 are Locating questions that focus on direct evidence identification. The evidence is balanced across modalities, with 1,243 textual, 1,474 visual (including 941 figures and 533 tables), and 283 mixed-modality sources, ensuring diverse multimodal attribution scenarios.

Main Results

  • Smaller open-source models achieve lower Citation F1 scores and struggle to select evidence that adequately supports their responses. Furthermore, they also perform poorly in selecting evidence that directly answers the query, as shown by their low Source F1 and Source Exact Match scores.
  • As model size increases, we observe an improvement in citation performance, suggesting that scaling model size enhances attribution capability.
  • In comparison, GPT-4o achieves an 84.24% Citation F1 score on single-source Explanation questions, demonstrating strong citation quality. However, it struggles with source reliability, with Source Exact Match scores remaining low at 24.50% for single-source and 21.27% for multi-source settings. This indicates that even state-of-the-art models struggle to consistently cite evidence that is directly relevant to answering the query, underscoring the difficulty of precise citation in multimodal contexts.

RQ1: Do Models Prefer Specific Modalities for Citation?

We analyze model performance in instances where the evidence modality comes from mixed modalities. Most models achieve high Source EM scores when the ground truth evidence is textual but perform poorly when it is visual. This suggests that although MLLMs can process multimodal inputs, they are better at aligning with textual evidence than accurately citing visual information when generating responses.

Using Qwen2-VL-7B as the test model, we calculate the attention distribution across multimodal inputs by averaging attention head scores and normalizing by input source token length across different layers. While the model processes all modalities, it prioritizes textual content and utilizes it more effectively than visual data.

RQ2: How Do Models Allocate Attention When Generating Source Citations?

We analyze the attention distribution of Qwen2-VL-7B when generating source-identifying tokens (e.g., "[1]", "Figure 2"). Notably, the distractors are sampled from unrelated papers, meaning they provide no useful information for answering the question.
The model’s attention heatmap reveals an intriguing pattern: even when the response is based entirely on a specific piece of evidence, the model’s attention is not solely focused on it. When generating the token after "According to Figure ", the model’s attention remains high on textual index positions (e.g., "[1]", "[2]"), even though the context suggests the model should focus on figure evidence. This suggests that while the model correctly cites the source, it maintains a broader contextual awareness by attending to multiple potential evidence.

RQ3: Understanding or Attribution: What is the Bottleneck?

On one hand, multimodal understanding requires the model to grasp the meaning of the inputs and use them to generate a coherent response. On the other hand, multimodal attribution demands that the model's generated response can be traced back to specific, verifiable sources in the input.
We constructed single source Locating QA pairs and sampled distractors from unrelated papers. Understanding questions involve a 4-option multiple-choice format, while attribution questions use a 5-option multiple-choice format, with each option corresponding to one of the five information sources in the input.
While models achieve over 90% accuracy on understanding tasks, they perform worse on attribution questions. This suggests that the bottleneck lies not in multimodal understanding but in multimodal attribution. This finding underscores a fundamental limitation in current multimodal models: they can process and understand multimodal inputs well but struggle to attribute outputs to the correct evidence accurately.

BibTeX

@article{hu2025mcitebench,
  title={MciteBench: A Benchmark for Multimodal Citation Text Generation in MLLMs},
  author={Hu, Caiyu and Zhang, Yikai and Zhu, Tinghui and Ye, Yiwei and Xiao, Yanghua},
  journal={arXiv preprint arXiv:2503.02589},
  year={2025}
}