Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment

Dongping Chen ¹²^*, Ruoxi Chen ²^*, Shu Pu ²^*, Zhaoyi Liu²^*, Yanru Wu²^*, Caixi Chen²^*, Benlin Liu¹, Yue Huang³, Yao Wan², Pan Zhou², Ranjay Krishna¹^†,

¹University of Washington, ²Huazhong University of Science and Technology, ³University of Notre Dame

cdp0612@uw.edu, ranjay@cs.washington.edu

Paper Code

Takeaways

Many real-world user queries (e.g.~ "How do to make egg fried rice?") could benefit from systems capable of generating responses with both textual steps with accompanying images, similar to a cookbook. Models designed to generate interleaved text and images face challenges in ensuring consistency within and across these modalities. To address these challenges, we present ISG, a comprehensive evaluation framework for interleaved text-and-image generation. ISG leverages a scene graph structure to capture relationships between text and image blocks, evaluating responses on four levels of granularity: holistic, structural, block-level, and image-specific. This multi-tiered evaluation allows for a nuanced assessment of consistency, coherence, and accuracy, and provides interpretable question-answer feedback. In conjunction with ISG, we introduce a benchmark, ISG-Bench, encompassing 1,150 samples across 8 categories and 21 subcategories. This benchmark dataset includes complex language-vision dependencies and golden answers to evaluate models effectively on vision-centric tasks such as style transfer, a challenging area for current models. Using ISG-Bench, we demonstrate that recent unified vision-language models perform poorly on generating interleaved content. While compositional approaches that combine separate language and image models show a 111% improvement over unified models at the holistic level, their performance remains suboptimal at both block and image levels. To facilitate future work, we develop ISG-Agent, a baseline agent employing a "plan-execute-refine" pipeline to invoke tools, achieving a 122% performance improvement.

Abstract

Unified Transformer-based models have enabled simultaneous multimodal understanding and generation, showing promise in unifying both vision and language tasks with interleaved text-and-image generation. However, assessing the performance of multimodal interleaved generation remains unexplored and challenging due to the complexity of interleaved content. In this paper, we design an automatic multi-granular evaluation framework Interleaved Scene Graph (ISG) with four levels to evaluate accurate interleaved generation tasks, which converts multimodal queries into atomic questions, then perform visual question answering for precise verification. Moreover, we propose ISG-Bench, the first multimodal interleaved benchmark with concrete generation requirement consisting of 1,150 samples across 21 text-image generation tasks. Additionally, we pioneer in a compositional agent framework ISG-Agent to explore the upper bound of interleaved generation with agent workflow. In our experiments, we conduct a multi-granular evaluation of ISG, demonstrating its potential for automatically evaluating interleaved generation consistent with ground truth and human preferences. Furthermore, comprehensive assessments of 10 interleaved generative frameworks reveal that unified models still lack basic accurate instruction-following capabilities, falling short even in structural requirements. Additionally, our ISG-Agent outperforms other compositional frameworks in interleaved generation at various levels but still struggles with vision-dominated tasks. Our work offers valuable insights for advancing future research in interleaved text-and-image generation.

Interleaved Scene Graph: The Evaluation Framework

We introduce Interleaved Scene Graph (ISG), a multi-granular, automatic evaluation framework for interleaved text-and-image generation, which assesses responses across four levels, detailed as follows:

Structure: Our method uses a language model to predict the structure of interleaved multimedia outputs based on mixed image-text inputs. We then evaluate if generated answers match these predicted structural requirements.
Block: We assess block-level relations in interleaved content by converting queries into atomic block-to-block questions, which are then evaluated using visual question answering (VQA) techniques. This process involves representing prompts as subject-object-relation tuples and generating questions from these tuples, which a multimodal language model evaluates by providing yes/no answers and numerical scores.
Image: We evaluate image-level interleaved generation by transforming multimodal queries into dependency-aware tuples of entities, relations, and attributes, linked to specific generated images. This approach is particularly useful for vision-dominant tasks like style transfer. We then use a language model to generate dependent questions, which are evaluated via a visual question answering module to assess image generation quality.
Holistic: Our holistic evaluation uses a multimodal language model as a judge, inputting the query, response, and human-annotated golden answer. This approach, which builds on previous methods, incorporates an "Analyze-then-Judge" Chain-of-Thought process. The result is a more human-aligned evaluation that assesses generation quality, text-image alignment, and overall helpfulness, producing a comprehensive score.

ISG-Bench: The Benchmark

ISG-BENCH is our benchmark for interleaved text-and-image generation, featuring 1,150 samples across 21 subtasks in 8 scenarios. It evaluates multimodal understanding, generation, and instruction-following. All queries are vision-language dependent, with golden reference answers. Samples were carefully curated using cross-validation and similarity filtering.

Data Collection and Quality Control: Our benchmark collection process integrates high-quality visual metadata from existing datasets and our own collections, with curated natural language queries referencing these images and specifying output structures, while leveraging MLLMs for initial answer generation and human annotators for refinement and creation of free-form queries and golden answers to prevent data contamination, ultimately yielding a diverse, high-quality interleaved benchmark validated through cross-annotation for consistency and accuracy.
Other Dimensions:Our ISG-BENCH categorization extends beyond task definitions to include two additional dimensions: Modality Domination, which classifies tasks based on their primary output modality (Vision, Language, or Both), and Accurate Image Generation, which assesses the specificity of image generation requirements in answers, ranging from concrete referential outputs to creative tasks without strict guidelines, with evaluation conducted using multimodal DSG to accurately assess these varied image generation requirements across different task types.

ISG-Agent: The Agent Framework

ISG-AGENT, our pioneering framework for interleaved text-and-image generation, addresses the challenges faced by unified generation models through a three-component system: (1) Planning, which interprets multimodal queries and generates tool usage plans; (2) Tool-usage, which executes appropriate tools with detailed logs for text and image generation; and (3) Refinement, which reviews and enhances generated content by addressing errors and improving coherence. This "Plan-Execute-Refine" pipeline ensures outputs that closely adhere to user instructions while autonomously handling diverse tasks, leveraging multimodal language models and specialized tools to create cohesive, text-image-aligned content that goes beyond discrete blocks.

Benchmark

We leverage GPT-4o for question generation and VQA in the ISG framework. Performance is evaluated against human-annotated ground truth with cross-validation, using varied sample sizes and metrics. Question generation is deemed correct based on subject/object matching and BERTScore. The VQA module employs an "Analyze-then-Judge" framework with "1-10" scoring and "Yes-or-No" options. Ablation studies examine vision inputs versus caption images and few-shot prompting. MLLM-as-a-Judge evaluation uses human agreement as the metric.

Evaluating ISG
Ablation Study on ISG

Evaluating ISG with human annotations. All results for Pearson Similarity have a P-value lower than 0.005. **Q-Gen:** Question generation module; **Acc+BS:** Accuracy and BertScore for block and question matching respectively.

Eval Level	Eval Task	Metric	Size	Avg.	Image				Image-Language			Language
					Style	Prog.	3D	Dec.	I-T C.	Temp.	VST	VQA
Structure	Direct Match	Accuracy	1,150	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
Block	Q-Gen	Acc+BS	1,150	0.967	0.955	0.988	0.890	0.970	0.993	0.980	0.980	0.980
	VQA Score VQA YesNo	Pearson	1,092	0.718	0.482	0.529	0.581	0.850	0.778	0.816	0.873	0.835
	VQA Score VQA YesNo	Pearson	1,092	0.446	0.169	0.386	0.528	0.382	0.555	0.388	0.634	0.529
Image	Q-Gen	Acc+BS	1,150	0.811	0.949	0.761	0.553	0.925	0.884	0.817	0.792	-
Image	VQA YesNo	Accuracy	4,871	0.907	0.851	0.873	0.863	0.937	0.968	0.921	0.934	-
Holistic	w. GT	Agreement	260	0.730	0.720	0.620	0.660	0.600	0.950	0.750	0.640	0.900
Holistic	w.o. GT	Agreement	260	0.537	0.600	0.460	0.450	0.400	0.900	0.600	0.370	0.800

Eval Level	Vision	Few-Shot	Avg.	Image	Image-Language	Language
Block	✘	✘	0.631	0.635	0.801	0.495	0.778	0.725	0.621	0.787	0.207
✘	✔	0.967	0.955	0.988	0.890	0.970	0.993	0.980	0.980	0.980
✔	✘	0.671	0.662	0.858	0.575	0.810	0.739	0.649	0.848	0.224
✔	✔	0.942	0.934	0.959	0.822	0.969	0.981	0.970	0.949	0.954
Image	✘	✘	0.688	0.873	0.751	0.497	0.908	0.575	0.526	0.684	-
✘	✔	0.804	0.902	0.796	0.518	0.905	0.869	0.859	0.780	-
✔	✘	0.711	0.943	0.755	0.535	0.951	0.586	0.539	0.671	-
✔	✔	0.811	0.949	0.761	0.553	0.925	0.884	0.817	0.792	-

Benchmarking Interleaved Text-and-Image Generation

We evaluate 10 frameworks capable of generating interleaved text-and-image content, four recently released unified models, Show-o, Anole, Minigpt-5, CoMM-Minigpt-5 , SEED-LLaMA as well as two compositional settings, using Gemini-1.5-Pro (GeminiTeam, 2023) and Claude-3.5-Sonnet as a multimodal preceptor2 and SD3 as its generator, with SD2.1 for ablation study. For our ISG-AGENT, we use GPT-4o for planning and verification agent, and use Claude-3.5-Sonnet for tool selector, with SD3 as image generator and multiple tools (UltraEdit, DynamiCrafter, SV3D, DreamMover). ISG-Bench-Overview

Empirical Results

ISG demonstrates commendable performance across all tasks

Each module within ISG aligns well with human annotation. For structure, ISG exhibits consistent excellence across all tasks, indicating robust potential for capturing structural requirements in interleaved generation instructions. In both Q-Gen and VQA modules, ISG successfully extracts fine-grained requirements with high fidelity to ground truth. For VQA module, the scoring approach consistently outperforms the “Yes-or-No” method, suggesting that more nuanced judgments align better with human evaluations. Vision-guided tasks consistently underperform compared to other tasks, with a noticeable decline in both Q-Gen and VQA modules, underscoring the challenges in automatically evaluating fine-grained aspects of interleaved text-and- image generation. In holistic evaluation, leveraging a golden answer significantly outperforms the zero-shot judging setting of MLLMs, especially in vision-guided tasks, yielding an average 20% improvement in human agreement.

Ablation Study on Vision Input and Few-shot Prompting

We evaluate our ISG under two conditions: vision input and few-shot examples, for a more comprehensive study. Multimodal input varies in block-level and image-level question generation, with a slight enhance- ment in image-level question generation. In addition, few-shot in-context learning provides dramatic enhancement on both tasks, improving performance by more than 30% in block-level and 10% in image-level tasks, especially in vision-language guided tasks by limiting requirements for the predicted generative content. For language-guided tasks, few-shot learning brings a 70% enhancement in block-level performance, further demonstrating the accurate evaluation framework establishment for this type of creative generation task.

Unified models underperform in accurate interleaved generation

All unified models exhibit significant deficiencies in following instructions to generate interleaved text-and-image content. Many models produced only one to three images, while some failed to generate any images at all. Consequently, these models could not be subjected to block-level and image-level evaluation protocols. In terms of holistic evaluation, the models demonstrated superior capabilities in language-dominant tasks, while notably underperforming in vision-dominant tasks. This disparity further proves the hypothesis that current training datasets for unified models lack sufficient vision-dominant instruction tuning samples, such as those for "Style Transfer" and "Image Decomposition". Notably, Show-o, as one of the first unified autoregressive models, stands out in precise generation structure but falls short in generating high-quality responses due to hallucinations, which generate images related to system prompts instead of user's instructions. Moreover, Anole also shows potential in interleaved generation and achieves state-of-the-art results among other unified models, suggesting the potential efficacy of its advanced architecture.

ISG-Agent outperforms in vision-dominated tasks

ISG-Agent strictly follows users' requirements to generate interleaved content, achieving comparative results to human's golden answer in various tasks in both block-level and image-level, especially in vision-dominated tasks like "Style Transfer" and "3D Scene". The state-of-the-art results in "Progressive Transformation" also demonstrate good coherence of the image content, even accommodating to human-collected answers. Although LLM+Diffusion frameworks fall short in accurate instruction-following, they achieve state-of-the-art results in holistic evaluation in some language-dominated tasks, demonstrating their high generation quality of textual information.

Vision-dominated tasks challenge all models.

Given that these compositional frameworks perceive images and generate images separately, not end to end, they naturally cannot perform these tasks well such as accurate image editing due to their inherent structure. On the other hand, although these unified models have the potential to understand and generate images in an end-to-end manner and announce their capability in vision generative tasks such as "Image Generation" or "Image Editing", they fall short in understanding multimodal queries to generate interleaved content with multiple images. The best unified model Anole fails to understand the output format and deviates from the context of input images, demonstrating their deficiency in generating images in vision in-context learning.

MLLM-as-a-Judge cannot evaluate fine-grained accurate generation.

The inconsistency between holistic evaluation results and those at three fine-grained levels reveals a notable limitation in MLLM-as-a-Judge to comprehensively assess responses, even when provided with both the user's instruction and correct golden answer. Specifically, Judge MLLM struggles to evaluate responses according to fine-grained criteria, such as output structure (including image count) and the detailed text-image relationships stipulated in the prompt. Furthermore, our analysis of the results uncovers an inherent bias within MLLM-as-a-Judge, namely "image-quality bias", where higher scores are consistently awarded to responses featuring higher-quality image content, despite these responses potentially violating the user's instructional requirements and judging guidelines. This bias demonstrates that MLLM-as-a-Judge, even provided with a golden answer, still cannot properly perform accurate assessments on interleaved responses that adhere to specified requirements.

Enhanced components bring improvement to general response quality.

The comparative analysis between two image generation models and ablation study on tools consistently demonstrates superior performance across various task levels when employing enhanced components, thereby underscoring the importance of advanced tools in producing more accurate and high-fidelity content. Furthermore, the incorporation of a refinement module significantly contributes to improved text-image alignment, substantially enhancing both block-level and holistic performance, which highlights the potential for optimizing individual components to achieve precise interleaved generation within a compositional framework.

Discussion and Future Work

Improving Unified Models with Advanced Interleaved Datasets.

Our results highlight the potential of unified autoregressive model structures like Anole and Show-o, while revealing substantial room for improvement in their instruction following and accurate generation capabilities. This underscores the need for dedicated interleaved datasets, particularly for vision-dominant tasks. Current datasets, limited to unimodal tasks or loosely aligned vision-language pairs, inadequately address the challenges of generating coherent interleaved content. Additionally, existing interleaved datasets are predominantly language-centric, failing to establish robust vision-language dependencies crucial for enhanced multimodal understanding and generation. In this context, our compositional agent, ISG-AGENT, shows promise as a pipeline for synthetic interleaved instruction tuning and vision-centric data, potentially advancing the development of unified generative models.

Improving Evaluation Framework for Transparency and Reliability.

Although we have carefully built the whole benchmark from scratch with cross-validation and evaluate the reliability of these generative models in the question generation and VQA module, concluding that it's practical to use them as evaluators, the potential trustworthiness problem of LLMs should be noted as they still make mistakes in evaluation. Moreover, due to their inherent structure, their evaluation lacks transparent and interpretable results. Therefore, a future direction lies in reducing the AI models in the evaluation process, like Task Me Anything, to synthetically generate questions paired with answers to evaluate model performance with highest truthfulness and confidence.

A Flexible and Integrative Compositional Strategy.

In this study, we explore a compositional agent strategy that integrates diverse model modules to generate interleaved multimodal content. Experimental results indicate that further enhancing each sub-module's performance may significantly improve the overall generative capabilities. Consequently, the compositional model not only demonstrates high flexibility and adaptability but also serves as a pivotal component in the advancement of unified models, particularly by functioning as a synthetic data pipeline to facilitate interleaved dataset construction. By leveraging high-quality generated content, this synthetic dataset further augments the generalization capabilities of unified multimodal models. Thus, its application not only contributes to exploring the upper-performance bounds of current models but also provides valuable insights and guidance for the design and optimization of future unified models.

Trustworthiness of Interleaved Generation.

While ISG-BENCH provides a strong foundation for evaluating accurate multimodal interleaved generation, a critical yet underexplored aspect is trustworthiness within these models. However, evaluating trustworthiness for interleaved generation presents several key challenges: (1) Previous research mainly focuses on single-modality generative models (e.g., LLMs), while challenges across text-and-image are not well addressed. (2) Another significant challenge is assessing the robustness of interleaved generation models against adversarial inputs (e.g., jailbreak attacks) or unexpected variations in prompts. These models may produce misleading or harmful outputs when manipulated through subtle alterations in the input text or images. Evaluating a model's resistance to such attacks is particularly difficult in a multimodal setting, as an attack could target just one modality (e.g., a slight change in a word or a pixel) and still cause cascading effects on the overall output.

Acknowledgement

This project is a follow-up of MLLM-as-a-Judge. This work is partially funded by Toyota Motor Corporation. We’d also like to extend a thank you to Jieyu Zhang, Weikai Huang, and Zixian Ma for their insightful feedback and support.

BibTeX

Interleaved Scene Graph Team