Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation

Abstract

We present Chain-of-Action (CoA), a novel visuo-motor policy paradigm built upon Trajectory Autoregressive Modeling.

Unlike conventional approaches that predict next step action(s) forward, CoA generates an entire trajectory by explicit backward reasoning with task-specific goals through an action-level Chain-of-Thought (CoT) process. This process is unified within a single autoregressive structure: (1) the first token corresponds to a stable keyframe action that encodes the task-specific goals; and (2) subsequent action tokens are generated autoregressively, conditioned on the initial keyframe and previously predicted actions. This backward action reasoning enforces a global-to-local structure, allowing each local action to be tightly constrained by the final goal. To further realize the action reasoning structure, CoA incorporates four complementary designs: continuous action token representation; dynamic stopping for variable-length trajectory generation; reverse temporal ensemble; and multi-token prediction to balance action chunk modeling with global structure.

As a result, CoA gives strong spatial generalization capabilities while preserving the flexibility and simplicity of a visuo-motor policy. Empirically, we observe CoA achieves the state-of-the-art performance across 60 RLBench tasks and 8 real-world manipulation tasks.

Framework

Chain-of-Action built on trajectory autoregressive modeling. The left part illustrates the network architecture where notation is for the training stage, and the right part illustrates the execution process. The model encodes visual and proprioceptive observations and generates actions in reverse order from a predicted keyframe action by an autoregressive decoder. For clarity, the keyframe action $a_T$ is shown in green, and subsequent steps are visualized with a gradual color transition.

Results

60 tasks on RLBench

Average success rate: CoA 0.552, ACT 0.389, DP 0.326

Spatial Generalization

Study of spatial generalization on push button task. Gray crosses indicate 100 training samples. Colored dots represent test samples - green for success, red for failure. The black dashed line separates 50 interpolation samples (in-distribution) from 50 extrapolation samples (out-of-distribution).

Task Demonstrations

Hockey

Open Microwave

Take Shoes Out of Box

Put Shoes in Box

Take Plate Off Dish Rack

Put Plate in Dish Rack

Put Rubbish in Bin

Hang Frame on Hanger

Toilet Seat Up

Stack Wine

Lamp On

Reach and Drag

Put Books on Bookshelf

Slide Block to Target

Put Rubbish in Bin (Alt)

Open Drawer

Generated Trajectory Visualization

The trajectory is regenerated for each 1 step execution in a closed-loop manner. 2D traces actually represent the 6DOF pose and gripper state.

Take Lid Off Saucepan

Sweep to Dustpan

Stack Wine

Reach Target

Push Button

Press Switch

Pick Up Cup

Open Drawer

Open Box

Turn Tap

FAQ

Is the keyframe action (i.e., goal pose) predicted or provided?

Both the keyframe action and subsequent actions are predicted. They are unified within the same action space and generated through autoregressive modeling. The keyframe action is obtained using a learnable start-of-sequence token.

How does CoA differ from traditional methods (such as pose estimation and planning)?

CoA offers greater flexibility and can handle more complex tasks. It is environment-aware, capable of executing actions in a closed-loop manner, and does not depend on high-quality 3D perception. Overall, CoA is a visuomotor policy algorithm that can be compared to ACT and DP.

Have you tried providing the keyframe action to ACT?

Yes, we have experimented with providing the keyframe action to ACT, but the improvement was not significant. Our ablation study showed that both action chain modeling and using the keyframe action as the start of the sequence are necessary.

How do you ensure that the generated actions end at the gripper's starting position?

We design a dynamic stopping mechanism that halts the generation process once the gripper reaches its starting position.

BibTeX

@inproceedings{zhang2025chainofaction,
  author    = {Zhang, Wenbo and Hu, Tianrun and Qiao, Yanyuan and Zhang, Hanbo and Qin, Yuchu and Li, Yang and Liu, Jiajun and Kong, Tao and Liu, Lingqiao and Ma, Xiao},
  title     = {Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation},
  journal   = {arxiv},
  year      = {2025},
}

Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation

TL;DR: Chain-of-Action generates actions from goal to start, and this simple reformulation alone enhances spatial generalization— NO TRICKS, NO MORE DATA, JUST BY MODELING.

Abstract

Framework

Results

60 tasks on RLBench

Spatial Generalization

Task Demonstrations

Hockey

Open Microwave

Take Shoes Out of Box

Put Shoes in Box

Take Plate Off Dish Rack

Put Plate in Dish Rack

Put Rubbish in Bin

Hang Frame on Hanger

Toilet Seat Up

Stack Wine

Lamp On

Reach and Drag

Put Books on Bookshelf

Slide Block to Target

Put Rubbish in Bin (Alt)

Open Drawer

Generated Trajectory Visualization

Take Lid Off Saucepan

Sweep to Dustpan

Stack Wine

Reach Target

Push Button

Press Switch

Pick Up Cup

Open Drawer

Open Box

Turn Tap

FAQ

Is the keyframe action (i.e., goal pose) predicted or provided?

How does CoA differ from traditional methods (such as pose estimation and planning)?

Have you tried providing the keyframe action to ACT?

How do you ensure that the generated actions end at the gripper's starting position?

BibTeX

TL;DR: Chain-of-Action generates actions from goal to start, and this simple reformulation alone enhances spatial generalization—
NO TRICKS, NO MORE DATA, JUST BY MODELING.