ViTeX-Bench: Benchmarking High Fidelity Video Scene Text Editing

Anonymous Authors
Submitted to NeurIPS 2026 Track on Evaluations and Datasets
Paper under double-blind review. Author list and affiliations updated after deanonymization.

Source video vs. our two variants and every baseline, scene by scene

ViTeX-Bench teaser: source-target task and three-axis evaluation
The video scene text editing task and ViTeX-Bench protocol. Given a source video V, a text-region mask M, and a source–target string pair (ssrc, stgt), the task is to edit only the masked scene text. Representative outputs from four method families exhibit distinct failure modes, and ViTeX-Bench scores each edit along three axes: text correctness, visual quality, and edit locality.

Abstract

Video scene text editing aims to replace text appearing on scene surfaces in a video, such as storefront signs, whiteboards, and product labels, while preserving the surrounding content, motion, and camera dynamics. Although scene text editing has been extensively studied for still images, its video counterpart remains underdeveloped. Existing resources do not provide paired edits over real-world videos, current protocols rarely measure whether the edited region reads as the target string over time, and there is no large-scale open-source benchmark for systematic comparison. To close these gaps, we introduce ViTeX-Bench, a comprehensive benchmark suite for high-fidelity and temporally consistent video scene text editing. Specifically, we first built ViTeX-Dataset, containing 387 real-world 720p videos with precise text-region masks and editing instructions; the 230-clip training split provides the first such resource with paired editing results built through a semi-automatic pipeline, with the rest forming the evaluation set for standardized benchmarking. In ViTeX-Bench, we propose a three-axis evaluation protocol β€” text correctness, visual quality, and edit locality β€” totalling 13 metrics that combine character-level OCR signals with motion and preservation metrics. Our benchmarking experiments indicate that all eight leading video editing models, both public and commercial, fail in distinct ways: they either flicker or drift over time, or simply fail to produce the requested editing effects. To further validate the efficacy of our proposed dataset, we trained ViTeX-Edit-14B on the paired training split with a motion-aligned glyph-video conditioning stream. We show that, trained on only 230 high-quality paired videos, ViTeX-Edit-14B achieves the strongest CharAcc among video-native editors (0.688, +11.1% over VideoPainter) and leads three ranked temporal metrics, reducing the previous best by 2.1% (Flickerf), 6.6% (Flickerc), and 1.9% (Warpc), respectively.

A coordinated three-part release

Our contributions, paraphrased from the paper introduction.

πŸ“Š

ViTeX-Dataset

The first paired real-video resource for video scene text editing, releasing both a training split with paired edited references and an evaluation split. It closes the paired-data gap left open by image-only datasets and by STRIVE, which only have synthetic paired data and unpaired real-world data.

🎯

ViTeX-Bench

A task-specific evaluation protocol that scores character-level correctness, temporal visual quality, and edit locality on three complementary axes, released with a cross-family reference grid covering all four practical strategies.

πŸ€–

ViTeX-Edit-14B

An open-source reference editor that augments Wan2.1-VACE-14B with a glyph-video conditioning pathway for target-character identity and source-aligned motion. We fine-tune it on the ViTeX-Dataset training split and release the trained weights as a reproducible baseline for the public leaderboard.

ViTeX-Dataset construction

A semi-automatic pipeline turns a curated 387-clip pool into paired training tuples.

387
Total clips
230
Training
157
Frozen test
4
Scripts
120
Frames per clip
720p
at 24 fps
ViTeX-Dataset construction pipeline
Construction pipeline. Four assets are produced for every retained clip β€” the text-region mask (SAM 3 keypoint segmentation), the source–target string pair (Qwen3-VL-32B-Instruct), the clean background video (removal-1.3B, a Wan2.1-VACE-1.3B fine-tune), and the first-frame target-text patch (Nano Banana Pro). They are composed into the paired edited video αΉΌ via Strategy A (alpha composition, static-text clips only) or Strategy B (a fine-tuned PISCO-based inserter, all clips). Static clips run both strategies and the annotator keeps whichever output is judged higher.

Evaluation protocol

Three orthogonal axes, thirteen complementary metrics, the vector is the unit of report.

Visual quality Flicker / Warp / MUSIQ Text correctness SeqAcc Β· CharAcc Β· TTS Edit locality PSNR Β· SSIM Β· LPIPS Β· DreamSim ViTeX-Bench three axes Β· 13 metrics

Each method is reported as a thirteen-metric vector along three orthogonal axes. No cross-axis aggregate is published, because no axis substitutes for another: a method with the wrong characters is not redeemed by visual quality, and a method that re-renders the entire scene is not redeemed by text accuracy. Two methods with similar overall numbers can differ on which axis they fail. For ranking purposes only, the public leaderboard sorts on TextScore = βˆ›(SeqAcc Β· CharAcc Β· TTS) β€” the geometric mean of the three text-correctness primitives. SeqAcc = 0 collapses TextScore to zero, the intended semantics for methods that never produce the requested target string. The remaining ten primitives appear next to TextScore on every leaderboard row, so the full vector is always visible.

Public leaderboard

Ranked by TextScore, the geometric mean of the three text-correctness primitives. Click any column header to re-sort by that metric.

Method Fam Src TextScore ↑ SeqAcc ↑ CharAcc ↑ TTS ↑ Flk_f ↓ Flk_c ↓ Wp_f ↓ Wp_c ↓ MUSIQ_f ↑ MUSIQ_c ↑ PSNR ↑ SSIM ↑ LPIPS ↓ DSim ↓
1TextCtrlAAdmin0.56240.4750.7330.5113.804.291.592.0970.3242.7741.140.9940.0080.004
2ViTeX-Edit-14B (Composite)oursAdmin0.54100.3450.6890.6663.733.831.511.5670.2744.9442.950.9920.0060.002
3ViTeX-Edit-14BoursAdmin0.53380.3410.6880.6483.273.421.551.5369.6443.5329.080.9510.0600.024
4VideoPainter†CAdmin0.51510.3650.6190.6062.38†2.62†2.93†3.35†67.1640.5928.560.9150.1040.024
5FLUX-TextAAdmin0.50230.5280.7370.3265.1114.813.0313.0170.2643.8531.490.9750.0290.012
6RS-STEAAdmin0.49080.3540.6260.5343.733.661.611.8169.5734.2637.000.9830.0240.007
7AnyText2AAdmin0.40740.2800.6330.3823.344.952.043.9566.6841.6525.560.9050.0920.043
8TextCtrl + AnyV2VBAdmin0.16490.0570.3080.2574.984.984.113.9769.4133.8521.080.7850.2250.073
β€”Identity (sanity)β€”Admin0.00000.0000.3170.7603.723.681.461.2770.3345.12100.001.0000.0000.000
β€”Wan2.1-VACE-14BCAdmin0.00000.0000.2980.6903.783.841.691.5670.5445.2635.210.9760.0220.007
β€”Kling Video 3.0 OmniDAdmin0.00000.0000.2080.6414.254.083.122.9072.2347.7521.180.8430.1760.061
πŸ† View leaderboard & submit

ViTeX-Edit-14B: open reference model

Wan2.1-VACE-14B extended with a glyph-video conditioning pathway, fine-tuned on the 230-clip training split.

ViTeX-14B architecture
Architecture. Three streams enter the model: (i) the target text stgt via the frozen uMT5-XXL text encoder; (ii) the source video V and binary mask M via the standard VACE Video Condition Unit; and (iii) a glyph video built from stgt rendered in the source font and warped along the source-text trajectory tracked by CoTracker3, encoded by a frozen Wan VAE and compressed by a learnable GlyphEncoder into glyph tokens EG. Every VACE block queries EG through an added zero-initialized ConditionCrossAttention layer, so training learns to route glyph information across the eight injection depths.

Why three axes, jointly

Four diagnosed baseline failures, one per method family. Each is invisible from a single visual impression or a single metric column, and each is recovered by a specific axis or pair of axes.

A
TTS alone is not an editing-quality indicator
Wan2.1-VACE-14B (zero-shot) target NEW β†’ OLD
Source clip with text-region mask overlay
Wan2.1-VACE-14B output β€” text is unchanged

The instruction is to change NEW to OLD, but Wan2.1-VACE leaves the masked region essentially unchanged from the source β€” the rendered text still reads NEW. TTS is high (consecutive frames decode to the same string) and the source-faithful PSNR is also high, so a benchmark that relied on temporal stability or pixel-level fidelity alone would score this favorably. SeqAcc = 0 on this clip surfaces the failure: the model never produced the requested target.

B
Per-frame editors cannot win correctness and temporal stability simultaneously
FLUX-Text
Source clip with text-region mask overlay
FLUX-Text output β€” characters flicker and drift between frames

Family-A methods take the SeqAcc and CharAcc lead because each frame is edited by a strong static scene-text editor, but the same frame-independence that yields character precision removes any temporal coupling. FLUX-Text registers near-correct characters on individual frames yet cycles through different glyph shapes between frames, so high SeqAcc / CharAcc co-occur with low TTS and an extreme Flickerc spike β€” a structural property of per-frame editing rather than a tunable defect.

C
High visual quality is not correctness
Kling Video 3.0 Omni target SOC β†’ COC, edited as ZOO
Source clip with text-region mask overlay
Kling output β€” visually plausible but reads ZOO instead of COC

Kling produces a high-quality clip β€” both MUSIQ axes lead the table β€” but the rendered glyphs spell ZOO rather than the requested target COC. A benchmark that aggregated visual quality without conditioning on correctness would put Kling at the top of the leaderboard; the three-axis structure of ViTeX-Bench makes this divergence explicit.

D
First-frame propagation collapses correctness and locality together
TextCtrl + AnyV2V
Source clip with text-region mask overlay
TextCtrl + AnyV2V output β€” flicker, drift, and off-mask scene corruption

TextCtrl on its own reaches strong SeqAcc with near-perfect locality. Pairing it with AnyV2V to lift that per-frame edit into a video drops SeqAcc by an order of magnitude and produces the worst PSNR / SSIM / LPIPS in the table simultaneously: as the propagation drifts away from the first-frame edit, the unmasked scene drifts with it and the rendered text loses temporal stability. Reporting either correctness or locality in isolation would mask half of the regression.

BibTeX

@misc{vitexbench2026,
  title  = {ViTeX-Bench: Benchmarking High Fidelity Video Scene Text Editing},
  author = {Anonymous},
  year   = {2026},
  note   = {Submitted to NeurIPS 2026 Track on Evaluations and Datasets.
            Author list and DOI updated after deanonymization.},
  url    = {https://vitex-bench.github.io/}
}