Unsupervised learning of object structure and dynamics from videos

Matthias Minderer^*, Chen Sun, Ruben Villegas, Forrester Cole, Kevin Murphy, Honglak Lee

Google Research
^*Google AI Resident

Extracting and predicting object structure and dynamics from videos without supervision is a major challenge in machine learning. To address this challenge, we adopt a keypoint-based image representation and learn a stochastic dynamics model of the keypoints. Future frames are reconstructed from the keypoints and a reference frame. By modeling dynamics in the keypoint coordinate space, we achieve stable learning and avoid compounding of errors in pixel space. Our method improves upon unstructured representations both for pixel-level video prediction and for downstream tasks requiring object-level understanding of motion dynamics. We evaluate our model on diverse datasets: a multi-agent sports dataset, the Human3.6M dataset, and datasets based on continuous control tasks from the DeepMind Control Suite. The spatially structured representation outperforms unstructured representations on a range of motion-related tasks such as object tracking, action recognition and reward prediction.

Supplemental Videos

Video generation quality across models (Human3.6M)

Comparison of video generation quality across models. Marker on the left is green for observed frames and red for predicted frames. Columns show different examples.

Abbreviations:
SVG: Stochastic Video Generation with a Learned Prior (Denton et al., 2018)
Struct-VRNN: Our full model (structured representation, stochastic dynamics, best-of-many-samples objective).
CNN-VRNN: Model without structured representation.
Struct-RNN: Model with deterministic dynamics.

Sample diversity (Human3.6M)

Videos in the same row were conditioned on the same oberved frames.

Abbreviations:
Struct-VRNN: Our full model (structured representation, stochastic dynamics, best-of-many-samples objective).
No BoM: Struct-VRNN without best-of-many-samples objective.

Example 1

Example 2

Example 3

Example 4

Example 5

Example 6

Example 7

Example 8

Example 9

Example 10

Keypoint manipulation (Human3.6M)

Keypoints for each limb were manually identified based on the left-most image. Keypoints for a single limb were then manipulated by rotating them around the joint of the limb, while holding the other keypoints static. Columns shows different examples.

Video generation quality across models (Basketball)

Comparison of video generation quality across models. Marker on the left is green for observed frames and red for predicted frames. Each column shows a different example.

Action-conditional video generation quality (DMCS)

Video generation quality for the DeepMind Control Suite dataset. A single model was trained on data from all tasks. Columns show different examples.

Abbreviations:
Struct-VRNN: Our full model (structured representation, stochastic dynamics, best-of-many-samples objective).
CNN-VRNN: Model without structured representation.