Findings of ACL 2026

Measuring Cross-Market Generative Ability of
Vision–Language Models via Movie Poster Transcreation

Youyuan Lin · Yuan Li · Yahan Yu · Fei Cheng · Shinya Nishida · Chenhui Chu

Kyoto University

Abstract

Cross-market image transcreation requires preserving movie identity while adapting to market-specific design preferences and multilingual typography — a challenge that goes far beyond simple translation. We introduce MPTc-Bench, a benchmark of 582 aligned poster pairs spanning 34 target markets, sourced from Douban, Eiga, and IMDb. We define two task variants: Surface (text-centric localisation) and Deep (preference-level style adaptation), and propose a two-stage planner–editor pipeline. Evaluation combines information-retention checks, LLM-as-a-judge aesthetic scoring, and objective visual similarity signals. Our experiments reveal a substantial gap between model outputs and human-crafted target-market posters — particularly in faithful text rendering — and show that the strongest Gemini image-editing endpoints currently define the frontier across task variants.

Dataset

MPTc-Bench

Carefully curated aligned poster pairs with rich metadata, filtered from over 4,500 cross-market candidates using perceptual hashing and GLOBE cultural clustering.

582 Aligned poster pairs

34 Target markets

2 Task variants

7 Image editors benchmarked

World map showing coverage of 34 target markets

Global coverage: 34 target markets across Asia, Europe, North and South America. Market selection is informed by GLOBE cultural clustering to ensure diversity.

Access the Dataset

🤗 HuggingFace Hub 📦 Zenodo (DOI)

from datasets import load_dataset
# Download: https://github.com/minamotooRin/mptc-bench/tree/main/data
import json, urllib.request
for split in ["surface", "deep"]:
    url = f"https://raw.githubusercontent.com/minamotooRin/mptc-bench/main/data/mptcbench_{split}.jsonl"
    # urllib.request.urlretrieve(url, f"mptcbench_{split}.jsonl")

Benchmark

Task Design & Evaluation

Poster pairs are split into two task levels using perceptual hash (pHash) distance, capturing fundamentally different localisation challenges.

Surface Transcreation pHash < 12

The source and target posters share the same layout and visual composition. The model must translate title and tagline text, adapt typography, and localise any text-overlay elements — without altering the visual design.

Deep Transcreation pHash > 30

Source and target posters differ substantially in visual design. The model must re-compose the layout, adjust character poses, colour palette, and graphic motifs to match the cultural and aesthetic preferences of the target market.

Surface (top) vs. Deep (bottom) examples: Surface tasks preserve composition; Deep tasks require full visual redesign.

THREE-DIMENSIONAL EVALUATION

📋

Basic Info

Title fidelity (chrF), genre & year quiz accuracy. Measures whether key movie metadata is preserved in the transcreated poster.

🎨

Aesthetic & Adaptation

LLM-as-a-judge direct scoring (1–5) for visual quality and cultural appropriateness, plus pairwise win-rate vs. human GT.

📐

Objective Similarity

CLIP cosine similarity, FID (GT↔MTC, SRC↔MTC), and PPOCRv5 text-layout IoU for surface tasks.

Results

Leaderboard

Click any column header to sort. Human GT row (greyed) shows the upper-bound reference. Systems are listed by API endpoint where available; Aesthetic and Adaptation scores are on a 1–5 scale, while Win Rate and Title chrF are percentages.

Key Finding: Text Rendering Gap

Planners generate accurate title translations, but image editors often fail to render them faithfully. Diffusion-based models lose 30–50% of title fidelity in this planner→editor gap.

Title fidelity loss from planner to image editor

Title chrF: planned translation vs. visually rendered output. Gemini image-editing endpoints nearly close the gap; diffusion models suffer severe losses.

Aesthetic score vs. Adaptation score for all evaluated systems (Deep task). Gemini image-editing endpoints are closest to the human GT adaptation band.

Citation

Cite This Work

If you use MPTc-Bench in your research, please cite our paper.

@inproceedings{lin-etal-2026-mptc,
  title     = {{MPT}c-Bench: Measuring Cross-market Generative Ability of
               Vision-Language Models via Movie Poster Transcreation},
  author    = {Lin, Youyuan and Li, Yuan and Yu, Yahan and
               Cheng, Fei and Nishida, Shin{'}ya and Chu, Chenhui},
  booktitle = {Findings of the Association for Computational Linguistics:
               ACL 2026},
  year      = {2026},
  address   = {San Diego, California, United States},
  publisher = {Association for Computational Linguistics},
  pages     = {37897--37913},
  url       = {https://aclanthology.org/2026.findings-acl.1889/}
}

Measuring Cross-Market Generative Ability of Vision–Language Models via Movie Poster Transcreation