Autoregressive Generation


We examine adapting our framework to autoregressive sketch generation, enabling interactive drawing scenarios that are difficult to support with diffusion-based models. The autoregressive model produces visually coherent sketches with clear stroke-by-stroke progression, although with slightly reduced visual fidelity compared to the diffusion-based approach.

Interactive Co-drawing


This application uses our autoregressive method, and the video is shown at real-time speed.

Comparisons to Prior Work


Comparison of sketch generation progress across methods. Wan2.1 produces near-static outputs with limited temporal progression. PaintsUndo reveals detailed structures early due to its undo-based formulation, but generates painting-like results rather than vector sketches. SketchAgent better follows human drawing order but often yields overly simplistic and less recognizable outputs. Our method closely matches human sketching progression while achieving higher final quality, producing semantically structured and detailed sketches.

Ablation


We find that full two-stage training is necessary for both reliable ordering control and the desired sketch appearance. Training on synthetic shapes alone improves ordering consistency but yields primitive-looking strokes with weaker recognizability. Training on real sketches alone improves visual style but often violates the specified order. Combining both stages transfers ordering fidelity into the sketch domain and delivers the best overall results.

Limitations


Multiple strokes per frame

Operating in pixel space provides less explicit structural control than parametric stroke representations, which can occasionally lead to violations of sketching constraints, such as multiple strokes appearing within a single frame.

Prompt adherence

Prompt adherence is not guaranteed. When the model has a strong visual prior, it may deviate from the instructions. For example, in the ``tiger roaring'' prompt, the model changes the action late in the video and introduces color.

Limited knowledge

Performance also depends on the underlying video model’s concept knowledge, which is more limited than that of LLMs for specialized domains such as mathematics.

AR quality gap

Finally, while we demonstrate autoregressive sketch generation, the resulting outputs do not yet match the visual quality of the diffusion-based model, reflecting the present maturity of autoregressive video models.