We introduce a deep reinforcement learning method that learns to control articulated humanoid bodies to imitate given target motions closely when simulated in a physics simulator. The target motion, which may not have been seen by the agent and can be noisy, is supplied at runtime. Our method can recover balance from moderate external disturbances and keep imitating the target motion. When subjected to large disturbances that cause the humanoid to fall down, our method can control the character to get up and recover to track the motion. Our method is trained to imitate the mocap clips from the CMU motion capture database and a number of other publicly available databases. We use a state-of-the-art deep reinforcement learning algorithm to learn to dynamically control the gain of PD controllers, whose target angles are derived from the mocap clip and to apply corrective torques with the goal of imitating the provided motion clip as closely as possible. Both the simulation and the learning algorithms are parallelized and run on the GPU. We demonstrate that the proposed method can control the character to imitate a wide variety of motions such as running, walking, dancing, jumping, kicking, punching, standing up, and so on.
Simulating believable virtual crowds has been an important research topic in many research fields such as industry films, computer games, urban engineering, and behavioral science. One of the key capabilities agents should have is navigation, which is reaching goals without colliding with other agents or obstacles. The key challenge here is that the environment changes dynamically, where the current decision of an agent can largely affect the state of other agents as well as the agent in the future. Recently, reinforcement learning with deep neural networks has shown remarkable results in sequential decision-making problems. With the power of convolution neural networks, elaborate control with visual sensory inputs has also become possible. In this paper, we present an agent-based deep reinforcement learning approach for navigation, where only a simple reward function enables agents to navigate in various complex scenarios. Our method is also able to do that with a single unified policy for every scenario, where the scenario-specific parameter tuning is unnecessary. We will show the effectiveness of our method through a variety of scenarios and settings.
Juggling is a physical skill which consists in keeping one or several objects in continuous motion in the air by tossing and catching it. Jugglers need a high dexterity to control their throws and catches which require speed, accuracy and synchronization. The more balls we juggle with, the more these qualities have to be strong to achieve this performance. This complex skill is good challenge for realistic physical based simulation which could be useful for jugglers to evaluate the feasibility of their tricks. This simulation has to understand the different notations used in juggling and to apply the mathematical theory of juggling to reproduce it. In this paper, we present a deep reinforcement learning method for both tasks catching and throwing, and we combine them to recreate the all juggling process. Our character is able to react accurately and with enough speed and power to juggle with up to 7 balls, even with external forces applied on it.
While the topic of virtual cinematography has essentially focused on the problem of computing the best viewpoint in a virtual environment given a number of objects placed beforehand, the question of how to place the objects in the environment with relation to the camera (referred to as staging in the film industry) has received little attention. This paper first proposes a staging language for both characters and cameras that extends existing cinematography languages with multiple cameras and character staging. Second, the paper proposes techniques to operationalize and solve staging specifications given a 3D virtual environment. The novelty holds in the idea of exploring how to position the characters and the cameras simultaneously while maintaining a number of spatial relationships specific to cinematography. We demonstrate the relevance of our approach through a number of simple and complex examples.
Video games enable the representation and control of characters that can agilely evolve in virtual environments. However, the detached character interaction they propose - often using a push-button metaphor - is far from the satisfactory feeling of grasping and moving physical toys. In this paper, we propose a new interaction metaphor that reduces the gap between physical toys and virtual characters. The user moves a smartphone around, and a puppet that responds in real time to the manipulations is seen through the screen. The virtual character moves in order to follow the user gestures, as if it was attached to the phone via a rigid stick. This yields a natural interaction, similar to moving a physical toy, and the puppet now feels alive because its movements are augmented with compelling animations. Using the smartphone, our method ties together the control of the character and camera into a single interaction mechanism. We validate our system by presenting an application in Augmented Reality.
Hands deserve particular attention in virtual reality (VR) applications because they represent our primary means for interacting with the environment. Although marker-based motion capture with inverse kinematics works adequately for full body tracking, it is less reliable for small body parts such as hands and fingers which are often occluded when captured optically, thus leading VR professionals to rely on additional systems (e.g. inertial trackers). We present a machine learning pipeline to track hands and fingers using solely a motion capture system based on cameras and active markers. Our finger animation is performed by a predictive model based on neural networks trained on a movements dataset acquired from several subjects with a complementary capture system. We employ a two-stage pipeline, which first resolves occlusions, and then recovers all joint transformations. We show that our method compares favorably to inverse kinematics by inferring automatically the constraints from the data, provides a natural reconstruction of postures, and handles occlusions better than three proposed baselines.
Retargeting motion from one character to another is a key process in computer animation. It enables to reuse animations designed for a character to animate another one, or to make performance-driven be faithful to what has been performed by the user. Previous work mainly focused on retargeting skeleton animations whereas the contextual meaning of the motion is mainly linked to the relationship between body surfaces, such as the contact of the palm with the belly. In this paper we propose a new context-aware motion retargeting framework, based on deforming a target character to mimic a source character poses using harmonic mapping. We also introduce the idea of Context Graph: modeling local interactions between surfaces of the source character, to be preserved in the target character, in order to ensure fidelity of the pose. In this approach, no rigging is required as we directly manipulate the surfaces, which makes the process totally automatic. Our results demonstrate the relevance of this automatic rigging-less approach on motions with complex contacts and interactions between the character's surface.
Time-of-flight point cloud acquisition systems have grown in precision and robustness over the past few years. However, even subtle motion can induce significant distortions due to the long acquisition time. In contrast, there exists sensors that produce depth maps at a higher frame rate, but they suffer from low resolution and accuracy. In this paper, we correct distortions produced by small motions in time-of-flight acquisitions and even output a corrected animated sequence by combining a slow but high-resolution time-of-flight LiDAR system and a fast but low-resolution consumer depth sensor. We cast the problem as a curve-to-volume registration, by seeing a LiDAR point cloud as a curve in a 4-dimensional spacetime and the captured low-resolution depth video as a 4-dimensional spacetime volume. Our approach starts by registering both captured sequences in 4D, in a coarse-to-fine approach. It then computes an optical flow between the low-resolution frames and finally transfers high-resolution details by advecting along the flow. We demonstrate the efficiency of our approach on both synthetic data, on which we can compute registration errors, and real data.
Stop Motion Animation is the traditional craft of giving life to handmade models. The unique look and feel of this art form is hard to reproduce with 3D computer generated techniques. This is due to the unexpected details that appear from frame to frame and to the sometimes choppy appearance of the character movement. The artist's task can be overwhelming as he has to reshape a character into hundreds of poses to obtain just a few seconds of animation. The results of the animation are usually applied in 2D mediums like films or platform games. Character features that took a lot of effort to create thus remain unseen. We propose a novel system that allows the creation of 3D stop motion-like animations from 3D character shapes reconstructed from multi-view images. Given two or more reconstructed shapes from key frames, our method uses a combination of non-rigid registration and as-rigid-as-possible interpolation to generate plausible in-between shapes. This significantly reduces the artist's workload since much fewer poses are required. The reconstructed and interpolated shapes with complete 3D geometry can be manipulated even further through deformation techniques. The resulting shapes can then be used as animated characters in games or fused with 2D animation frames for enhanced stop motion films.
We explore the potential of learned autocompletion methods for synthesizing animated motions from input keyframes. Our model uses an autoregressive two-layer recurrent neural network that is conditioned on target keyframes. The model is trained on the motion characteristics of example motions and sampled keyframes from those motions. Given a set of desired key frames, the trained model is then capable of generating motion sequences that interpolate the keyframes while following the style of the examples observed in the training corpus. We demonstrate our method on a hopping lamp, using a diverse set of hops from a physics-based model as training data. The model can then synthesize new hops based on a diverse range of keyframes. We discuss the strengths and weaknesses of this type of approach in some detail.
Learning couple dance such as Salsa is a challenge for modern human as it requires to assimilate and understand correctly all the required parameters. In this paper, we propose a set of music-related motion features (MMF) allowing to describe, analyse and classify salsa dancer couple in their respective learning state (beginner, intermediate and expert). These dance qualities have been proposed from a systematic review of papers cross linked with interviews from teacher and professionals in the field of social dance. We investigated how to extract these MMF from musical data and 3D movements of dancers in order to propose a new algorithm to compute them. For the presented study, a motion capture database (SALSA) has been recorded of 26 different couples with varying skill levels dancing on 10 different tempos (260 clips). Each recorded clips contains a basic steps sequence and an extended improvisation sequence during two minutes in total at 120 frame per second. We finally use our proposed algorithm to analyse and classify these 26 couples in three learning levels, which validates some proposed music-related motion features and give insights on others.
When designing environments in Immersive Virtual Reality, virtual humans are often used to enrich them. In this paper, we research factors arising by the use of virtual crowds that may instigate more user participation in virtual reality scenarios. In particular, we examine whether implementing responsive virtual crowd behaviors toward the participant provides cues that increase the feeling of presence. The second factor we investigate is whether using appearance characteristics to a virtual crowd enable users to not only to identify as being socially related with the virtual characters, but also behave as such in a virtual environment, a factor we refer to as Group membership. We present an experiment in a Virtual Environment (VE) populated with virtual crowd with a violent incident where the user could intervene, aimed to determine how these factors contribute to the enhancement of plausibility and feeling of presence. Our results show that in IVR, virtual crowds with Responsive behavior can increase the feeling of Presence since the user tends to make more interventions when the virtual crowd is responsive towards and when the user is socially related to the incident's victim.
We present a computational spatial analytics tool for designing environments that better support human-related factors. Our system performs both static and dynamic analyses: the first relates to the building geometry and organization, while the second additionally considers the crowd movement in the space. The results are presented to the designers in the form of numerical values, traces and heat maps displayed on top of the floor plan. We demonstrate our approach with a user study whereby novice architects have tested the proposed approach to iteratively improve a building accessibility in real-time with respect to a selected number of static and dynamic metrics. The results indicate that the users were able to successfully improve their design solutions and thus generate more human-aware environments. The usability and effectiveness of the tool where also measured, yielding positive scores. The modular and flexible nature of the tool enables further extension to incorporate additional static and dynamic spatial metrics.
The development of autonomous agents for wayfinding tasks has long maintained the usage of naive, omniscient models for navigation. The simplicity of these models improves the scalability of crowd simulations, but limits the utility of such simulations to the visualization of general behaviors. This restricted scope does not allow for the observation of more nuanced, individualized behaviors. In this paper, we demonstrate a novel framework for agent simulations that does not rely on omniscience. Instead, each agent is equipped with a memory architecture that enables wayfinding by maintaining a cognitive map of the space explored by the agent. Based on findings from simulation studies, cognitive science, and psychology, we describe a wayfinding procedure that simulates human behavior and human cognitive processes, incorporating landmark navigation, path integration, and memory. This cognitive approach makes observations of agent behavior more comparable to those of human behavior.
This paper presents a virtual reality experiment in which two participants share both the virtual and the physical space while performing a collaborative task. We are interested in studying what are the differences in human locomotor behavior between the real world and the VR scenario. For that purpose, participants performed the experiment in both the real and the virtual scenarios. For the VR case, participants can see both their own animated avatar and the avatar of the other participant in the environment. As they move, we store their trajectories to obtain information regarding speeds, clearance distances and task completion times. For the VR scenario, we also wanted to evaluate whether the users were aware of subtle differences in the avatar's animations and foot steps sounds. We ran the same experiment under three different conditions: (1) synchronizing the avatar's feet animation and sound of footsteps with the movement of the participant; (2) synchronizing the animation but not the sound and finally (3) not synchronizing either one. The results show significant differences in user's presence questionnaires and also different trends in their locomotor behavior between the real world and the VR scenarios. However the subtle differences in animations and sound tested in our experiment had no impact on the results of the presence questionnaires, although it showed a small impact on their locomotor behavior in terms of time to complete their tasks, and clearance distances kept while crossing paths.
We present a multi-segment foot model and its control method for the simulation of realistic bipedal behaviors. The ground reaction force is the only source of control for a biped that stands and walks on its feet. The foot is the body part that interacts with the ground and produces appropriate actuation to the body. Foot anatomy features 26 bones and many more muscles that play an important role in weight transmission, balancing posture and assisting ambulation. Previously, the foot model was often simplified into one or two rigid bodies connected by a revolute joint. We propose a new foot model consisting of multiple segments to accurately reproduce human foot shape and its functionality. Based on the new model, we developed a foot pose controller that can reproduce foot postures that are generally not obtained in motion capture data. We demonstrate the validity of our foot model and the effectiveness of our foot controller with a variety of foot motions in a physics-based simulation.
Within the manufacturing industry, digital modelling activities and the simulation of human motion in particular, have emerged during the last decades. For the use-case of walk path planning, however, recent path planning approaches on the one hand reveal drawbacks in terms of realism and naturalness of motion. On the other hand, the generation of variant-rich travel routes by means of modeling the statistical nature of human motion has not been explored yet. In order to contribute to a better prediction quality of planning models, this paper therefore presents an approach for realistically simulating walk paths of single subjects. In order to take into consideration the variability of human locomotion, a statistical model describing human motion in a two-dimensional bird's eye view is presented. This model is generated from a comprehensive database (20 000 steps) of captured human motion and covers a wide range of gait variants. In order to obtain short and collision-free trajectories, this approach is combined with a path planning algorithm. The utilized hybrid A* path planner can be regarded as an orchestration-instance, stitching together succeeding left and right steps, which were drawn from the statistical motion model. Although being initially designed for industrial purposes, this method can be applied to a wide range of use-cases beyond automotive walk path planning. To underline the evident benefits of the proposed approach, the novel motion planner's technical performance is demonstrated in an evaluation.
This paper proposes a real-time approach to animate a character walking in different environments for a video game setting. The technique combines a physics-based model with procedural animation and motion editing. Highlighting environmental interaction in fluids, the paper leverages simplified drag forces to drive realistic changes from existing locomotion data. To demonstrate generalizability of the effort, we also generate forces from impulse felt from interactions in the environment.
Recently, the significant technological developments of head-mounted displays and tracking systems have boosted a widespread use of immersive virtual reality in the manufacturing industry. Regardless of the respective use-case, however, methods to ensure the validity of the human motion being performed in such virtual environments remain largely unaddressed. In the context of human locomotion, previous work present models quantifying behavioral differences between virtual reality and the real world. However, those findings are not used in control experiments to post-hoc counterbalance VR-induced performance modulations. Consequently, the prediction quality of previous analyses is not known. This paper bridges this gap, by testing such a behavior model in the context of an independent experiment (n = 10). The evaluation shows that a model derived from literature can indeed be used to post-hoc correct temporal disparities between locomotion in the real world and in virtual reality.
Virtual characters have been employed for many purposes including interacting with players of serious games, with a purpose to increase engagement. These characters are often embodied conversational agents playing diverse roles, such as demonstrators, guides, teachers or interviewers. Recently, much research has been conducted into properties that affect the realism and plausibility of virtual characters, but it is less clear whether the inclusion of interactive agents in serious applications can enhance a user's engagement with the application, or indeed increase efficacy. In a first step towards answering these questions, we conducted a study where a Virtual Learning Environment was used to examine the effect of employing a virtual character to deliver a lesson. In order to investigate whether increased familiarity between the player and the character would help achieve learning outcomes, we allowed participants to customize the physical appearance of the character. We used direct and indirect measures to assess engagement and learning; we measured knowledge retention to ascertain learning via a test at the end of the lesson, and also measured participants' perceived engagement with the lesson. Our findings show that a virtual character can be an effective learning aid, causing heightened engagement and retention of knowledge. However, allowing participants to customize character appearance resulted in inhibited engagement, which was contrary to what we expected.