In this paper we propose a method for estimating geometry, lighting and albedo from a single image of an uncontrolled outdoor scene. To do so, we combine state-of-the-art deep learning based methods for single image depth estimation and inverse rendering. The depth estimate provides coarse geometry that is refined using the inverse rendered surface normal estimates. Combined with the inverse rendered albedo map, this provides a model that can be used for novel view synthesis with both viewpoint and lighting changes. We show that, on uncontrolled outdoor images, our approach yields geometry that is qualitatively superior to that of the depth estimation network alone and that the resulting models can be re-illuminated without artefacts.
Style transfer represents the most creative task of deep learning, creating a virtual world in art. However, most of the current deep networks concentrate on style and ignore to exploit low-level components of content such as edges, shapes, which define objects in a virtualized image. In our study, we present a scheme to use skip connections and leverage semantic information as input, which effectively preserve low-level components in great detail. To understand the meaning of our added components, besides the ablation study to visualize the proficiency, we propose to use constrained hyper-parameters for all skip connections to find how each layer influence on stylized images. Our models are trained on images in COCO-stuff with their semantic maps and testing without them. We also compare our work to previous works. As a result, our method outperforms in retaining definable details of content with significant style using skip connections, especially semantic information.
In computer graphics and virtual environment development, a large portion of time is spent creating assets - one of these being the terrain environment, which usually forms the basis of many large graphical worlds. The texturing of height maps is usually performed as a post-processing step - with software requiring access to the height and gradient of the terrain in order to generate a set of conditions for colouring slopes, flats, mountains etc. With further additions such as biomes specifying which predominant texturing the region should exhibit such as grass, snow, dirt etc. much like the real-world. These methods combined with a height map generation algorithm can create impressive terrain renders which look visually stunning - however can appear somewhat repetitive. Previous work has explored the use of variants of Generative Adversarial Networks for the learning of elevation data through real-world data sets of world height data. In this paper, a method is proposed for learning not only the height map values but also the corresponding satellite image of a specific region. This data is trained through a non-spatially dependant generative adversarial network, which can produce an endless amount of variants of a specific region. The textured outputs are measured using existing similarity metrics and compared to the original region, which yields strong results. Additionally, a visual and statistical comparison of other deep learning image synthesis techniques is performed. The network outputs are also rendered in a 3D graphics engine and visualised in the paper. This method produces powerful outputs when compared directly with the training region, creating a tool that can produce many different variants of the target terrain. This is ideally suited for the use of a developer wanting a large number of specific structures of terrain.
We present a method for augmenting hand-drawn characters and creatures with global illumination effects. Given a single view drawing only, we use a novel CNN to predict a high-quality normal map of the same resolution. The predicted normals are then used as guide to inflate a surface into a 3D proxy mesh visually consistent and suitable to augment the input 2D art with convincing global illumination effects while keeping the hand-drawn look and feel. Along with this paper, a new high resolution dataset of line drawings with corresponding ground-truth normal and depth maps will be shared. We validate our CNN, comparing our neural predictions qualitatively and quantitatively with the recent state-of-the art, show results for various hand-drawn images and animations, and compare with alternative modeling approaches.
The lack of information provided by line arts makes user guided-colorization a challenging task for computer vision. Recent contributions from the deep learning community based on Generative Adversarial Network (GAN) have shown incredible results compared to previous techniques. These methods employ user input color hints as a way to condition the network. The current state of the art has shown the ability to generalize and generate realistic and precise colorization by introducing a custom dataset and a new model with its training pipeline. Nevertheless, their approach relies on randomly sampled pixels as color hints for training. Thus, in this contribution, we introduce a stroke simulation based approach for hint generation, making the model more robust to messy inputs. We also propose a new cleaner dataset, and explore the use of a double generator GAN to improve visual fidelity.
Many 2D cel animations and 3DCG cel-style animations are still being actively produced in Japan. Cel animation has its own perspective, which is different from general 3D space. In this paper, we focus on the perspective generated by the compositing process and reproduce it by moving the vanishing point and adjusting the perspective strength. In the proposed method, we modify the projection matrix. This makes it possible to change the appearance of the 2D shape in real-time without changing the 3D shape of the model. The user controls the matrix using three 3D points called the vanishing guidance point, virtual viewpoint, and model focus point. A matrix is generated for each model. In addition, to operate the proposed method with the current major shading algorithms, we provide modified shadow mapping and rim lighting methods. Furthermore, we consider effective hierarchical structures for using the proposed method in general computer graphics software. The results confirm that this method reproduces the features of cel animation. Moreover, they show that the processing speed achieves the performance required for use in real-time rendering.
Deformation transfer (DT) transfers the action of a source object to the target object. In computer graphics, the geometry of a real world object is captured by its geometric model. Skeleton, triangular mesh, quad mesh, polygon mesh and hybrid mesh are examples of such geometric models. In this paper, we propose a universal representation for describing the geometric model of an object. Using this representation, we pose the problem of DT as guided pose deformation followed by Poisson interpolation (PI). The proposed vector graph (VG) representation makes this framework applicable to both skeletons and hybrid meshes. During the first phase, pose deformation between reference pose and other poses of the source sequence are computed. In the second phase, these deformations are transferred to the target reference pose. In the final phase, the temporal gradients of reconstructed target sequence are refined using PI. The objective of the final phase is to adjust the temporal property, such as motion trajectory, according to the source sequence. The planarity requirement of faces for non-triangular meshes is added as a constraint term in the cost function for mesh reconstruction. The qualitative and quantitative comparisons of the proposed method with two state-of-the-art methods show the effectiveness of the proposed framework.
Many 3D reconstruction techniques are based on the assumption of prior knowledge of the object’s surface reflectance, which severely restricts the scope of scenes that can be reconstructed. In contrast, Helmholtz Stereopsis (HS) employs Helmholtz Reciprocity to compute the scene geometry regardless of its Bidirectional Reflectance Distribution Function (BRDF). Despite this advantage, most HS implementations to date have been limited to 2.5D reconstruction, with the few extensions to full 3D being generally limited to a local refinement due to the nature of the optimisers they rely on. In this paper, we propose a novel approach to full 3D HS based on Markov Random Field (MRF) optimisation. After defining a solution space that contains the surface of the object, the energy function to be minimised is computed based on the HS quality measure and a normal consistency term computed across neighbouring surface points. This new method offers several key advantages with respect to previous work: the optimisation is performed globally instead of locally; a more discriminative energy function is used, allowing for better and faster convergence; a novel visibility handling approach to take advantage of Helmholtz reciprocity is proposed; and surface integration is performed implicitly as part of the optimisation process, thereby avoiding the need for an additional step. The approach is evaluated on both synthetic and real scenes, with an analysis of the sensitivity to input noise performed in the synthetic case. Accurate results are obtained on both types of scenes. Further, experimental results indicate that the proposed approach significantly outperforms previous work in terms of geometric and normal accuracy.
We present an algorithm for progressive mesh registration to provide temporal consistency, simplify temporal texture editing and optimize the data rate of 3D mesh sequences recorded in a volumetric capture studio.
The capture pipeline produces a sequence of manifold meshes with varying connectivity. We split the sequence into groups of frames that will share connectivity and chose a keyframe in each group that is progressively deformed to approximate the surface of the adjacent meshes. Our algorithm works with a coarse-to-fine ICP approach, which makes it robust against large deformations in the scene while preserving small details. Applying a deformation graph constrains the transformations to be locally as-rigid-as-possible, while allowing it to work with any natural objects in the scene, not just humans.
We show how to robustly track sequences of human actors with varying clothing over hundreds of frames recorded in a volumetric capture studio. We verify our results with a publicly available dataset of more than 40,000 frames. Our mesh registration takes less than five seconds per frame on a single desktop machine and has been successfully integrated into a volumetric capture pipeline for commercial use.
With the advent of 360° film narratives, traditional tools and techniques used for storytelling are being reconsidered. VR cinema, as a narrative medium, provides users with the liberty to choose where to look and to change their point-of-view constantly. This freedom to frame the visual content themselves brings about challenges for the storytellers in carefully guiding the users so as to convey a narrative effectively. Thus researchers and filmmakers exploring VR cinema are evaluating new storytelling methods to create efficient user experiences. In this paper, we present, through empirical analysis, the significance of perceptual cues in VR cinema, and its impact on guiding the users’ attention to different plot-points in the narrative. The study focuses on examining the experiential fidelity using “Dragonfly”, a 360° film created using the existing guidelines for VR cinema. We posit that the insights derived would help better understand the evolving grammar of VR storytelling. We also present a set of additional guidelines for effective planning of perceptual cues in VR cinema.
Previsualization is a process within pre-production of filmmaking where filmmakers can visually plan specific scenes with camera works, lighting, character movements, etc. The costs of computer graphics-based effects are substantial within film production. Using previsualization, these scenes can be planned in detail to reduce the amount of work put on effects in the later production phase. We develop and assess a prototype for previsualization in virtual reality for collaborative purposes where multiple filmmakers can be present in a virtual environment to share a creative work experience, remotely. By performing a within-group study on 20 filmmakers, our findings show that the use of virtual reality for distributed, collaborative previsualization processes is useful for real-life pre-production purposes.
We describe the implementation of and early results from a system that automatically composes picture-synched musical soundtracks for videos and movies. We use the phrase picture-synched to mean that the structure of the automatically composed music is determined by visual events in the input movie, i.e. the final music is synchronised to visual events and features such as cut transitions or within-shot key-frame events. Our system combines automated video analysis and computer-generated music-composition techniques to create unique soundtracks in response to the video input, and can be thought of as an initial step in creating a computerised replacement for a human composer writing music to fit the picture-locked edit of a movie. Working only from the video information in the movie, key features are extracted from the input video, using video analysis techniques, which are then fed into a machine-learning-based music generation tool, to compose a piece of music from scratch. The resulting soundtrack is tied to video features, such as scene transition markers and scene-level energy values, and is unique to the input video. Although the system we describe here is only a preliminary proof-of-concept, user evaluations of the output of the system have been positive.
This paper introduces a new method for adding synthetic atmospheric effects, such as haze and fog, to real RGBD images. The given depth maps are used to compute per-pixel transmission and and spatial frequency values, which determine the local contrast and blur, based on physical models of atmospheric absorption and scattering. A fast 2D inhomogeneous diffusion algorithm is developed, which is capable of computing and rendering the effects in real time. The necessary pre-processing methods, including sky identification and matting, are also explained. A GPU implementation is described, and evaluated on a range of RGBD data, including that from outdoor lidar and indoor structured light systems.