CRAFT: Continuous Reasoning and Agentic Feedback Tuning
for Multimodal Text-to-Image Generation

Results Sample 1

Parti-Prompts: 1.6K+ long, complex prompts testing spatial reasoning, composition, and text rendering.

Same neural network with thinking mode and "off" and "ON"

Comparison 1
Baseline CRAFT
The eye of the planet Jupiter.
Comparison 2
Baseline CRAFT
A hyperrealistic photo of glass-made samgyeopsal placed on a grill.
Comparison 3
Baseline CRAFT
Ultra-realistic Roronoa Zoro portrait.
Comparison 4
Baseline CRAFT
Portrait Pic, mysterious silhouette of a majestic wolf standing on a cliff at sunset, created with a dreamy double exposure effect showing a misty forest inside its silhouette, combined with expressive oil brushstrokes for texture and depth, cinematic lighting, surreal atmosphere, amazing depth, double exposure, surreal, intricately detailed, perfect balanced, deep fine borders, artistic photorealism, highly detailed, masterpiece, award winning, double exposure.
Comparison 5
Baseline CRAFT
Leela replaces Ripley: tall athletic cyclops woman with purple ponytail, white tank top, black trousers, shoulder harness, sweaty and battle-worn, holding Nibbler (small black three-eyed alien in red cape and white shorts, worried and clinging to her). Leela grips a detailed sci-fi pulse rifle matching the Aliens weapon silhouette but in clean Futurama cartoon linework and simple cel-shading. Blue backlight haze, drifting smoke, cinematic composition with dramatic rim light on Leela’s face. Background: industrial sci-fi corridor from Aliens, foggy blue-grey tones and subtle reflections. Keep Leela and Nibbler exactly on-model to Futurama’s style.
Comparison 6
Baseline CRAFT
Сreate an illustration-style diagram about “A shoulder-stretch routine you can do in 3 minutes.
Comparison 7
Baseline CRAFT
Create a single cinematic illustration that visually represents the following poem, capturing its emotions, metaphors, and atmosphere: “Good night mother, Good night father, Kiss your little son. Good night sister, Good night brother, Good night everyone.”
Comparison 8
Baseline CRAFT
A giant cobra snake made from salad.
Comparison 9
Baseline CRAFT
Generate a series of six candid, documentary-style photos of this Indonesian president in office, in the rice fields, and partying with other presidents.
Comparison 10
Baseline CRAFT
Generate a 4-panel comic about the hardships of an embedded engineer.
Comparison 11
Baseline CRAFT
Funny anatomical diagram of dog pet, with humorous annotations.

Abstract

CRAFT brings explicit “thinking” into image generation. Instead of generating images in a single shot, it verifies visual constraints, fixes only what fails, and stops when all constraints are satisfied — improving reliability without retraining or scaling models.

CRAFT (Continuous Reasoning and Agentic Feedback Tuning) is a training-free, model-agnostic framework for text-to-image generation and image editing. It converts a user prompt into dependency-structured visual constraints, verifies generated images using a vision–language model, and applies targeted prompt updates only where constraints fail.

CRAFT runs entirely at inference time and works on top of existing image generators, requiring no retraining or architectural changes. With minimal overhead, it consistently improves compositional accuracy, text rendering, and preference-based quality across multiple model families — allowing small and inexpensive models to approach the performance of much larger systems.

Results Sample 1

Cost-quality trade-off at 2048×2048 resolution. CRAFT cost includes two generations plus $0.00128 reasoning overhead (VLM + LLM). Cost multiplier shown in parentheses.

Methodology

Methodology CRAFT operates as a small multimodal orchestrator: given a user prompt (1), it decomposes the intended scene into explicit visual questions (2), evaluates generated images (4) with a VLM through visual-question answering (5), and rewrites the prompt (8) with an LLM to fix the detected inconsistencies. At each iteration, the system alternates between generation (4), reasoning via VQ responses (5, 7), and correction of the current prompt (3, 8), using deterministic visual questions (DVQ) as a stable signal for constraint satisfaction. A comparator (6) checks whether all constraints are satisfied; the loop continues until they are met or a small iteration budget is exhausted. CRAFT is fully model-agnostic and works with any T2I backbone (diffusion, autoregressive, or API-based). It requires no training, introduces minimal overhead, and reliably improves compositional accuracy, text rendering, spatial relations, and artifact detection across a wide range of models. In essence, CRAFT acts as an embedded agentic orchestrator for image generation: a lightweight reasoning supervisor that ensures the final output faithfully matches the user’s intent (loop 1→2→3→4→5→6→7→8).

Method Diagram

Results

CRAFT improves prompt–image alignment on DSG1K and Parti-Prompts across all evaluated T2I models. It consistently increases VQA and DSG scores, with the largest gains on weaker backbones, and is strongly preferred in automatic side-by-side evaluations. The method is training-free, model-agnostic, and effective with only a single reasoning iteration.

DSG1K Evaluation

DSG1K: ~1K prompts evaluated via structured yes/no visual questions for fine-grained compositional correctness.

Results Sample 1
Results Sample 1

Parti-Prompts

Parti-Prompts: 1.6K+ long, complex prompts testing spatial reasoning, composition, and text rendering.

Results Sample 3
Results Sample 3

Human Evaluation

Human evaluation further confirms these trends. Across representative architectures — Z-Image-Turbo, FLUX-2 Pro, FLUX-Dev, and FLUX-Schnell — annotators prefer CRAFT over the baseline by wide margins

Results Sample 5

BibTeX

@article{craft,
  title={CRAFT: Continuous Reasoning and Agentic Feedback Tuning or Multimodal Text-to-Image Generation},
  author={V. Kovalev, A. Kuvshinov, A. Buzovkin, D. Pokidov, D. Timonin},
  journal={arXiv preprint arXiv:2512.20362},
  year={2025}
}