Kyoto University
Cross-market image transcreation requires preserving movie identity while adapting to market-specific design preferences and multilingual typography — a challenge that goes far beyond simple translation. We introduce MPTc-Bench, a benchmark of 582 aligned poster pairs spanning 34 target markets, sourced from Douban, Eiga, and IMDb. We define two task variants: Surface (text-centric localisation) and Deep (preference-level style adaptation), and propose a two-stage planner–editor pipeline. Evaluation combines information-retention checks, LLM-as-a-judge aesthetic scoring, and objective visual similarity signals. Our experiments reveal a substantial gap between model outputs and human-crafted target-market posters — particularly in faithful text rendering — and show that Gemini 3 currently leads across both task variants.
Carefully curated aligned poster pairs with rich metadata, filtered from over 4,500 cross-market candidates using perceptual hashing and GLOBE cultural clustering.
Global coverage: 34 target markets across Asia, Europe, North and South America. Market selection is informed by GLOBE cultural clustering to ensure diversity.
from datasets import load_dataset
# Download: https://github.com/minamotooRin/mptc-bench/tree/main/data
import json, urllib.request
for split in ["surface", "deep"]:
url = f"https://raw.githubusercontent.com/minamotooRin/mptc-bench/main/data/mptcbench_{split}.jsonl"
# urllib.request.urlretrieve(url, f"mptcbench_{split}.jsonl")
Poster pairs are split into two task levels using perceptual hash (pHash) distance, capturing fundamentally different localisation challenges.
The source and target posters share the same layout and visual composition. The model must translate title and tagline text, adapt typography, and localise any text-overlay elements — without altering the visual design.
Source and target posters differ substantially in visual design. The model must re-compose the layout, adjust character poses, colour palette, and graphic motifs to match the cultural and aesthetic preferences of the target market.
Surface (top) vs. Deep (bottom) examples: Surface tasks preserve composition; Deep tasks require full visual redesign.
Click any column header to sort. Human GT row (greyed) shows the upper-bound reference. Aesthetic and Adaptation scores are on a 1–5 scale; Win Rate and Title chrF are percentages.
Planners generate accurate title translations, but image editors often fail to render them faithfully. Diffusion-based models lose 30–50% of title fidelity in this planner→editor gap.
Title chrF: planned translation vs. visually rendered output. Gemini 3 nearly closes the gap; diffusion models suffer severe losses.
Aesthetic score vs. Adaptation score for all evaluated systems (Deep task). Gemini 3 is the only system that matches or exceeds GT adaptation.
Three case studies showing SRC (original poster), GT (human target-market poster), and top model outputs. Click any image to enlarge.
If you use MPTc-Bench in your research, please cite our paper.
@inproceedings{lin2025mptcbench,
title = {Measuring Cross-Market Generative Ability of
Vision--Language Models via Movie Poster Transcreation},
author = {Lin, Youyuan and Li, Yuan and Yu, Yahan and
Cheng, Fei and Nishida, Shinya and Chu, Chenhui},
booktitle = {Proceedings of the 63rd Annual Meeting of the
Association for Computational Linguistics (ACL)},
year = {2025},
note = {Under review}
}