Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

Peng Jin1,2,3, Hao Li1,2,3, Zesen Cheng1,3, Kehan Li1,3,
Runyi Yu1,3, Chang Liu4, Xiangyang Ji4, Li Yuan1,2,3, Jie Chen1,2,3,
1School of Electronic and Computer Engineering, Peking University
2Peng Cheng Laboratory
3AI for Science (AI4S)-Preferred Program, Peking University
4Department of Automation and BNRist, Tsinghua University
Teaser image.

Our method (GuidedMotion) empowers users to combine preferred local actions freely, generating motions that align with their mental imagery.

Abstract

Text-to-motion generation requires not only grounding local actions in language but also seamlessly blending these individual actions to synthesize diverse and realistic global motions. However, existing motion generation methods primarily focus on the direct synthesis of global motions while neglecting the importance of generating and controlling local actions. In this paper, we propose the local action-guided motion diffusion model, which facilitates global motion generation by utilizing local actions as fine-grained control signals. Specifically, we provide an automated method for reference local action sampling and leverage graph attention networks to assess the guiding weight of each local action in the overall motion synthesis. During the diffusion process for synthesizing global motion, we calculate the local-action gradient to provide conditional guidance. This local-to-global paradigm reduces the complexity associated with direct global motion generation and promotes motion diversity via sampling diverse actions as conditions. Extensive experiments on two human motion datasets, \ie, HumanML3D and KIT, demonstrate the effectiveness of our method. Furthermore, our method provides flexibility in seamlessly combining various local actions and continuous guiding weight adjustment, accommodating diverse user preferences, which may hold potential significance for the community.

Video

Summary

Problems

  • Existing text-to-motion generation methods primarily focus on directly synthesizing global motions based on language instructions. However, they come with limitations regarding the type of control they support over the motion results.
  • Typically, generating a motion that faithfully corresponds to our mental imagery requires numerous iterations of editing a prompt, reviewing the resulting motion, and then adjusting the prompt accordingly.

Our Solutions

  • We propose to employ reference local actions as control signals in the global motion generation process.
  • These reference local actions can serve as control signals during the global motion generation process, facilitating the generation of global motions with similar characteristics, including movement trajectories and human body postures, to those of the local actions.
  • Users can seamlessly combine their preferred local actions, exerting precise control over the resulting global motion to align with the characteristics of those chosen local actions.

Our Pipeline

In GuidedMotion, we provide an automatic local action sampling method, which deconstructs the original motion description into multiple local action descriptions and uses a text-to-motion model to generate the reference local actions. Subsequently, we leverage graph attention networks to estimate the guiding weight of each local motion in the overall motion synthesis. To enhance generation stability, we divide the motion diffusion process for synthesizing global motion into three stages:

  • In the initial diffusion stage, we de-noise the Gaussian noise based on the original motion description to provide a good initial value for the subsequent stage.
  • In the second diffusion stage, we apply local-action gradients based on the energy function to offer conditional guidance for aligning the generated motion with the characteristics of the reference local actions.
  • In the final diffusion stage, we fine-tune the generated results further to conform to the original motion description, rather than solely adhering to a reference local action.
  • Method pipeline.

    BibTeX

    @inproceedings{guidedmotion,
      title={Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation},
      author={Jin, Peng and Li, Hao and Cheng, Zesen and Li, Kehan and Yu, Runyi and Liu, Chang and Ji, Xiangyang and Yuan, Li and Chen, jie},
      booktitle={ECCV},
      year={2024}
    }