scroll to read
The core idea: the agent draws a part, looks at the canvas, reads the next text instruction, and draws again. This loop gives you free editability — swap out any part's strokes, regenerate with a new description, and the rest stays intact.
We train a VLM agent that draws vector sketches one part at a time. It's powered by three new ingredients: a generic VLM-driven annotation pipeline that labels semantic parts in any vector sketch, ControlSketch-Part — a part-annotated sketch dataset — and multi-turn process-reward GRPO, an RL algorithm that rewards the agent at every intermediate step.
A generic propose-critique-refine loop that splits any vector sketch into semantic parts and assigns every SVG path to a part.
High-quality part-annotated sketches with captions, part descriptions, and path-to-part maps — plus a multi-turn benchmark.
Dense per-step DreamSim rewards close the gap between oracle intermediate states and self-generated ones.
Three ingredients. One agent that actually draws one part at a time.
A VLM loops through propose → critique → refine → assign paths → verify with a color-coded diagnostic. Works on any vector sketch dataset.

35K high-quality sketches, each enriched with a short caption, per-part descriptions, and a full path-to-part map.

Qwen3-VL-30B-A3B + LoRA. Each RL turn draws one part; DreamSim on the partial render gives dense, per-step rewards that close the oracle-vs-self-generation gap.

Part-level edit control.
Every stroke is tagged with a part, so you can swap, remove, or regenerate any region with a new text description — without redoing the whole sketch.
@article{du2026sketch,
title = {Teaching an Agent to Sketch One Part at a Time},
author = {Du, Xiaodan and Xu, Ruize and Yunis, David and Vinker, Yael and Shakhnarovich, Greg},
journal = {arXiv preprint arXiv:2603.19500},
year = {2026}
}