Teaching an Agent to Sketch One Part at a Time

The core idea: the agent draws a part, looks at the canvas, reads the next text instruction, and draws again. This loop gives you free editability — swap out any part's strokes, regenerate with a new description, and the rest stays intact.

TL;DR

We train a VLM agent that draws vector sketches one part at a time. It's powered by three new ingredients: a generic VLM-driven annotation pipeline that labels semantic parts in any vector sketch, ControlSketch-Part — a part-annotated sketch dataset — and multi-turn process-reward GRPO, an RL algorithm that rewards the agent at every intermediate step.

01 · Data pipeline

Automated part annotation

A generic propose-critique-refine loop that splits any vector sketch into semantic parts and assigns every SVG path to a part.

02 · Dataset

ControlSketch-Part

High-quality part-annotated sketches with captions, part descriptions, and path-to-part maps — plus a multi-turn benchmark.

03 · Training

Multi-turn process-reward GRPO

Dense per-step DreamSim rewards close the gap between oracle intermediate states and self-generated ones.

How it works

Three ingredients. One agent that actually draws one part at a time.

A pipeline that labels parts automatically

A VLM loops through propose → critique → refine → assign paths → verify with a color-coded diagnostic. Works on any vector sketch dataset.

A dataset: ControlSketch-Part

35K high-quality sketches, each enriched with a short caption, per-part descriptions, and a full path-to-part map.

III

Training: SFT then multi-turn process-reward GRPO

Qwen3-VL-30B-A3B + LoRA. Each RL turn draws one part; DreamSim on the partial render gives dense, per-step rewards that close the oracle-vs-self-generation gap.

Editable by design

Part-level edit control.

Every stroke is tagged with a part, so you can swap, remove, or regenerate any region with a new text description — without redoing the whole sketch.

Progressive editing examples — Left: same part descriptions, different start → different outputs. Right: swap one early part description → differences stay local.

Results

See it side-by-side

**Ours** produces smooth paths, natural style, and identifiable parts. Baselines tend toward simple geometric primitives or suffer compounding errors. Class labels are reference only — not used by our model. Use ←/→ or the dots to browse.

SketchAgent — **Ours** produces smooth paths, natural style, and identifiable parts. Baselines tend toward simple geometric primitives or suffer compounding errors. Class labels are reference only — not used by our model. Use ←/→ or the dots to browse.

By the numbers

Long-CLIP cosine similarity bar chart — Long-CLIP cosine similarity. Full model (SFT + RL) beats every baseline and the SFT-only variant.

User study pairwise preference bars — 3,092 user study (pairwise comparisons) — ours preferred in every baseline comparison, in both final-output and step-by-step modes.

BibTeX

@article{du2026sketch,
  title   = {Teaching an Agent to Sketch One Part at a Time},
  author  = {Du, Xiaodan and Xu, Ruize and Yunis, David and Vinker, Yael and Shakhnarovich, Greg},
  journal = {arXiv preprint arXiv:2603.19500},
  year    = {2026}
}