Research on 2D and 3D generative models typically focuses on the final artifact being created, e.g., an image or a 3D structure. Unlike 2D image generation, the generation of 3D objects in the real world is commonly constrained by the process and order in which the object is constructed. For instance, gravity needs to be taken into account when building a block tower.
In this paper, we explore the prediction of ordered actions to construct 3D objects. Instead of predicting actions based on physical constraints, we propose learning through observing human actions. To enable large-scale data collection, we use the Minecraft1 environment. We introduce 3D-Craft, a new dataset of 2,500 Minecraft houses each built by human players sequentially from scratch. To learn from these human action sequences, we propose an order-aware 3D generative model called VoxelCNN. In contrast to other 3D generative models which either have no explicit order (e.g. holistic generation with 3DGAN ), or follow a simple heuristic order (e.g. raster-scan), VoxelCNN is trained to imitate human building order with spatial awareness. We also transferred the order to other dataset such as ShapeNet. The 3D-Craft dataset, models, and benchmark system will be made publicly available, which may inspire new directions for future research exploration.