The Chimera is a mythological hybrid creature composed of different animal parts. The chimera's movements are highly dependent on the spatial and temporal alignments of its composing parts. In this paper, we present a novel algorithm that creates and animates chimeras by dynamically reassembling source characters and their movements. Our algorithm exploits a two-network architecture: part assembler and dynamic controller. The part assembler is a supervised learning layer that searches for the spatial alignment among body parts, assuming that the temporal alignment is provided. The dynamic controller is a reinforcement learning layer that learns robust control policy for a wide variety of potential temporal alignments. These two layers are tightly intertwined and learned simultaneously. The chimera animation generated by our algorithm is energy efficient and expressive in terms of describing weight shifting, balancing, and full-body coordination. We demonstrate the versatility of our algorithm by generating the motor skills of a large variety of chimeras from limited source characters.
In this paper, we introduce ControlVAE, a novel model-based framework for learning generative motion control policies based on variational autoencoders (VAE). Our framework can learn a rich and flexible latent representation of skills and a skill-conditioned generative control policy from a diverse set of unorganized motion sequences, which enables the generation of realistic human behaviors by sampling in the latent space and allows high-level control policies to reuse the learned skills to accomplish a variety of downstream tasks. In the training of ControlVAE, we employ a learnable world model to realize direct supervision of the latent space and the control policy. This world model effectively captures the unknown dynamics of the simulation system, enabling efficient model-based learning of high-level downstream tasks. We also learn a state-conditional prior distribution in the VAE-based generative control policy, which generates a skill embedding that outperforms the non-conditional priors in downstream tasks. We demonstrate the effectiveness of ControlVAE using a diverse set of tasks, which allows realistic and interactive control of the simulated characters.
We present a deep learning-based framework to synthesize motion in-betweening in a two-stage manner. Given some context frames and a target frame, the system can generate plausible transitions with variable lengths in a non-autoregressive fashion. The framework consists of two Transformer Encoder-based networks operating in two stages: in the first stage a Context Transformer is designed to generate rough transitions based on the context and in the second stage a Detail Transformer is employed to refine motion details. Compared to existing Transformer-based methods which either use a complete Transformer Encoder-Decoder architecture or additional 1D convolutions to generate motion transitions, our framework achieves superior performance with less trainable parameters by only leveraging the Transformer Encoder and masked self-attention mechanism. To enhance the generalization of our transformer-based framework, we further introduce Keyframe Positional Encoding and Learned Relative Positional Encoding to make our method robust in synthesizing longer transitions exceeding the maximum transition length during training. Our framework is also artist-friendly by supporting full and partial pose constraints within the transition, giving artists fine control over the synthesized results. We benchmark our framework on the LAFAN1 dataset, and experiments show that our method outperforms the current state-of-the-art methods at a large margin (an average of 16% for normal-length sequences and 55% for excessive-length sequences). Our method trains faster than the RNN-based method and achieves a four-time speedup during inference. We implement our framework into a production-ready tool inside an animation authoring software and conduct a pilot study to validate the practical value of our method.
In this paper, we propose to compute Voronoi diagrams over mesh surfaces driven by an arbitrary geodesic distance solver, assuming that the input is a triangle mesh as well as a collection of sites P = {Pi}mi=1 on the surface. We propose two key techniques to solve this problem. First, as the partition is determined by minimizing the m distance fields, each of which rooted at a source site, we suggest keeping one or more distance triples, for each triangle, that may help determine the Voronoi bisectors when one uses a mark-and-sweep geodesic algorithm to predict the multi-source distance field. Second, rather than keep the distance itself at a mesh vertex, we use the squared distance to characterize the linear change of distance field restricted in a triangle, which is proved to induce an exact VD when the base surface reduces to a planar triangle mesh. Specially, our algorithm also supports the Euclidean distance, which can handle thin-sheet models (e.g. leaf) and runs faster than the traditional restricted Voronoi diagram (RVD) algorithm. It is very extensible to deal with various variants of surface-based Voronoi diagrams including (1) surface-based power diagram, (2) constrained Voronoi diagram with curve-type breaklines, and (3) curve-type generators. We conduct extensive experimental results to validate the ability to approximate the exact VD in different distance-driven scenarios.
We present SHRED, a method for 3D SHape REgion Decomposition. SHRED takes a 3D point cloud as input and uses learned local operations to produce a segmentation that approximates fine-grained part instances. We endow SHRED with three decomposition operations: splitting regions, fixing the boundaries between regions, and merging regions together. Modules are trained independently and locally, allowing SHRED to generate high-quality segmentations for categories not seen during training. We train and evaluate SHRED with fine-grained segmentations from PartNet; using its merge-threshold hyperparameter, we show that SHRED produces segmentations that better respect ground-truth annotations compared with baseline methods, at any desired decomposition granularity. Finally, we demonstrate that SHRED is useful for downstream applications, out-performing all baselines on zero-shot fine-grained part instance segmentation and few-shot finegrained semantic segmentation when combined with methods that learn to label shape regions.
Since the development of 3D applications, the point cloud, as a spatial description easily acquired by sensors, has been widely used in multiple areas such as SLAM and 3D reconstruction. Point Cloud Compression (PCC) has also attracted more attention as a primary step before point cloud transferring and saving, where the geometry compression is an important component of PCC to compress the points geometrical structures. However, existing non-learning-based geometry compression methods are often limited by manually pre-defined compression rules. Though learning-based compression methods can significantly improve the algorithm performances by learning compression rules from data, they still have some defects. Voxel-based compression networks introduce precision errors due to the voxelized operations, while point-based methods may have relatively weak robustness and are mainly designed for sparse point clouds. In this work, we propose a novel learning-based point cloud compression framework named 3D Point Cloud Geometry Quantiation Compression Network (3QNet), which overcomes the robustness limitation of existing point-based methods and can handle dense points. By learning a codebook including common structural features from simple and sparse shapes, 3QNet can efficiently deal with multiple kinds of point clouds. According to experiments on object models, indoor scenes, and outdoor scans, 3QNet can achieve better compression performances than many representative methods.
We propose a novel framework for computing the medial axis transform of 3D shapes while preserving their medial features via restricted power diagram (RPD). Medial features, including external features such as the sharp edges and corners of the input mesh surface and internal features such as the seams and junctions of medial axis, are important shape descriptors both topologically and geometrically. However, existing medial axis approximation methods fail to capture and preserve them due to the fundamentally under-sampling in the vicinity of medial features, and the difficulty to build their correct connections. In this paper we use the RPD of medial spheres and its affiliated structures to help solve these challenges. The dual structure of RPD provides the connectivity of medial spheres. The surfacic restricted power cell (RPC) of each medial sphere provides the tangential surface regions that these spheres have contact with. The connected components (CC) of surfacic RPC give us the classification of each sphere, to be on a medial sheet, a seam, or a junction. They allow us to detect insufficient sphere sampling around medial features and develop necessary conditions to preserve them. Using this RPD-based framework, we are able to construct high quality medial meshes with features preserved. Compared with existing sampling-based or voxel-based methods, our method is the first one that can preserve not only external features but also internal features of medial axes.
Traditional differentiable rendering approaches are usually hard to converge in inverse rendering optimizations, especially when initial and target object locations are not so close. Inspired by Lagrangian fluid simulation, we present a novel differentiable rendering method to address this problem. We associate each screen-space pixel with the visible 3D geometric point covered by the center of the pixel and compute derivatives on geometric points rather than on pixels. We refer to the associated geometric points as point proxies of pixels. For each point proxy, we compute its 5D RGBXY derivatives which measures how its 3D RGB color and 2D projected screen-space position change with respect to scene parameters. Furthermore, in order to capture global and long-range object motions, we utilize optimal transport based pixel matching to design a more sophisticated loss function. We have conducted experiments to evaluate the effectiveness of our proposed method on various inverse rendering applications and have demonstrated superior convergence behavior compared to state-of-the-art baselines.
Cameras with a finite aperture diameter exhibit defocus for scene elements that are not at the focus distance, and have only a limited depth of field within which objects appear acceptably sharp. In this work we address the problem of applying inverse rendering techniques to input data that exhibits such defocus blurring. We present differentiable depth-of-field rendering techniques that are applicable to both rasterization-based methods using mesh representations, as well as ray-marching-based methods using either explicit [Yu et al. 2021] or implicit volumetric radiance fields [Mildenhall et al. 2020]. Our approach learns significantly sharper scene reconstructions on data containing blur due to depth of field, and recovers aperture and focus distance parameters that result in plausible forward-rendered images. We show applications to macro photography, where typical lens configurations result in a very narrow depth of field, and to multi-camera video capture, where maintaining sharp focus across a large capture volume for a moving subject is difficult.
Pixel reconstruction filters play an important role in physics-based rendering and have been thoroughly studied. In physics-based differentiable rendering, however, the proper treatment of pixel filters remains largely under-explored. We present a new technique to efficiently differentiate pixel reconstruction filters based on the path-space formulation. Specifically, we formulate the pixel boundary integral that models discontinuities in pixel filters and introduce new antithetic sampling methods that support differentiable path sampling methods, such as adjoint particle tracing and bidirectional path tracing. We demonstrate both the need and efficacy of antithetic sampling when estimating this integral, and we evaluate its effectiveness across several differentiable- and inverse-rendering settings.
We present an approach to decompose cartoon animation videos into a set of "sprites" --- the basic units of digital cartoons that depict the contents and transforms of each animated object. The sprites in real-world cartoons are unique: artists may draw arbitrary sprite animations for expressiveness, where the animated content is often complicated, irregular, and challenging; alternatively, artists may also reduce their workload by tweening and adjusting sprites, or even reuse static sprites, in which case the transformations are relatively regular and simple. Based on these observations, we propose a sprite decomposition framework using Pixel Multilayer Perceptrons (Pixel MLPs) where the estimation of each sprite is conditioned on and guided by all other sprites. In this way, once those relatively regular and simple sprites are resolved, the decomposition of the remaining "challenging" sprites can simplified and eased with the guidance of other sprites. We call this method "sprite-from-sprite" cartoon decomposition. We study ablative architectures of our framework, and the user study demonstrates that our results are the most preferred ones in 19/20 cases.
Pixel art is a unique art style with the appearance of low resolution images. In this paper, we propose a data-driven pixelization method that can produce sharp and crisp cell effects with controllable cell sizes. Our approach overcomes the limitation of existing learning-based methods in cell size control by introducing a reference pixel art to explicitly regularize the cell structure. In particular, the cell structure features of the reference pixel art are used as an auxiliary input for the pixelization process, and for measuring the style similarity between the generated result and the reference pixel art. Furthermore, we disentangle the pixelization process into specific cell-aware and aliasing-aware stages, mitigating the ambiguities in joint learning of cell size, aliasing effect, and color assignment. To train our model, we construct a dedicated pixel art dataset and augment it with different cell sizes and different degrees of anti-aliasing effects. Extensive experiments demonstrate its superior performance over state-of-the-arts in terms of cell sharpness and perceptual expressiveness. We also show promising results of video game pixelization for the first time. Code and dataset are available at https://github.com/WuZongWei6/Pixelization.
StageMix is a mixed video that is created by concatenating the segments from various performance videos of an identical song in a visually smooth manner by matching the main subject's silhouette presented in the frame. We introduce PopStage, which allows users to generate a StageMix automatically. PopStage is designed based on the StageMix Editing Guideline that we established by interviewing creators as well as observing their workflows. PopStage consists of two main steps: finding an editing path and generating a transition effect at a transition point. Using a reward function that favors visual connection and the optimality of transition timing across the videos, we obtain the optimal path that maximizes the sum of rewards through dynamic programming. Given the optimal path, PopStage then aligns the silhouettes of the main subject from the transitioning video pair to enhance the visual connection at the transition point. The virtual camera view is next optimized to remove the black areas that are often created due to the transformation needed for silhouette alignment, while reducing pixel loss. In this process, we enforce the view to be the maximum size while maintaining the temporal continuity across the frames. Experimental results show that PopStage can generate a StageMix of a similar quality to those produced by professional creators in a highly reduced production time.
High-quality HDRIs (High Dynamic Range Images), typically HDR panoramas, are one of the most popular ways to create photorealistic lighting and 360-degree reflections of 3D scenes in graphics. Given the difficulty of capturing HDRIs, a versatile and controllable generative model is highly desired, where layman users can intuitively control the generation process. However, existing state-of-the-art methods still struggle to synthesize high-quality panoramas for complex scenes. In this work, we propose a zero-shot text-driven framework, Text2Light, to generate 4K+ resolution HDRIs without paired training data. Given a free-form text as the description of the scene, we synthesize the corresponding HDRI with two dedicated steps: 1) text-driven panorama generation in low dynamic range (LDR) and low resolution (LR), and 2) super-resolution inverse tone mapping to scale up the LDR panorama both in resolution and dynamic range. Specifically, to achieve zero-shot text-driven panorama generation, we first build dual codebooks as the discrete representation for diverse environmental textures. Then, driven by the pre-trained Contrastive Language-Image Pre-training (CLIP) model, a text-conditioned global sampler learns to sample holistic semantics from the global codebook according to the input text. Furthermore, a structure-aware local sampler learns to synthesize LDR panoramas patch-by-patch, guided by holistic semantics. To achieve super-resolution inverse tone mapping, we derive a continuous representation of 360-degree imaging from the LDR panorama as a set of structured latent codes anchored to the sphere. This continuous representation enables a versatile module to upscale the resolution and dynamic range simultaneously. Extensive experiments demonstrate the superior capability of Text2Light in generating high-quality HDR panoramas. In addition, we show the feasibility of our work in realistic rendering and immersive VR.
We propose a novel multi-view camera pipeline for the reconstruction and registration of dynamic clothing. Our proposed method relies on a specifically designed pattern that allows for precise video tracking in each camera view. We triangulate the tracked points and register the cloth surface in a fine-grained geometric resolution and low localization error. Compared to state-of-the-art methods, our registration exhibits stable correspondence, tracking the same points on the deforming cloth surface along the temporal sequence. As an application, we demonstrate how the use of our registration pipeline greatly improves state-of-the-art pose-based drivable cloth models. Furthermore, we propose a novel model, Garment Avatar, for driving cloth from a dense tracking signal which is obtained from two opposing camera views. The method produces realistic reconstructions which are faithful to the actual geometry of the deforming cloth. In this setting, the user wears a garment with our custom pattern which enables our driving model to reconstruct the geometry. Our code and data are available at https://github.com/HalimiOshri/Pattern-Based-Cloth-Registration-and-Sparse-View-Animation. The released data includes our pattern and registered mesh sequences containing four different subjects and 15k frames in total.
We introduce the first learning-based reconstructability predictor to improve view and path planning for large-scale 3D urban scene acquisition using unmanned drones. In contrast to previous heuristic approaches, our method learns a model that explicitly predicts how well a 3D urban scene will be reconstructed from a set of viewpoints. To make such a model trainable and simultaneously applicable to drone path planning, we simulate the proxy-based 3D scene reconstruction during training to set up the prediction. Specifically, the neural network we design is trained to predict the scene reconstructability as a function of the proxy geometry, a set of viewpoints, and optionally a series of scene images acquired in flight. To reconstruct a new urban scene, we first build the 3D scene proxy, then rely on the predicted reconstruction quality and uncertainty measures by our network, based off of the proxy geometry, to guide the drone path planning. We demonstrate that our data-driven reconstructability predictions are more closely correlated to the true reconstruction quality than prior heuristic measures. Further, our learned predictor can be easily integrated into existing path planners to yield improvements. Finally, we devise a new iterative view planning framework, based on the learned reconstructability, and show superior performance of the new planner when reconstructing both synthetic and real scenes.
When conducting autonomous scanning for the online reconstruction of unknown indoor environments, robots have to be competent at exploring scene structure and reconstructing objects with high quality. Our key observation is that different tasks demand specialized scanning properties of robots: rapid moving speed and far vision for global exploration and slow moving speed and narrow vision for local object reconstruction, which are referred as two different scanning modes: explorer and reconstructor, respectively. When requiring multiple robots to collaborate for efficient exploration and fine-grained reconstruction, the questions on when to generate and how to assign those tasks should be carefully answered. Therefore, we propose a novel asynchronous collaborative autoscanning method with mode switching, which generates two kinds of scanning tasks with associated scanning modes, i.e., exploration task with explorer mode and reconstruction task with reconstructor mode, and assign them to the robots to execute in an asynchronous collaborative manner to highly boost the scanning efficiency and reconstruction quality. The task assignment is optimized by solving a modified Multi-Depot Multiple Traveling Salesman Problem (MDMTSP). Moreover, to further enhance the collaboration and increase the efficiency, we propose a task-flow model that actives the task generation and assignment process immediately when any of the robots finish all its tasks with no need to wait for all other robots to complete the tasks assigned in the previous iteration. Extensive experiments have been conducted to show the importance of each key component of our method and the superiority over previous methods in scanning efficiency and reconstruction quality.
We present a spectral measurement approach for the bulk optical properties of translucent materials using only low-cost components. We focus on the translucent inks used in full-color 3D printing, and develop a technique with a high spectral resolution, which is important for accurate color reproduction. We enable this by developing a new acquisition technique for the three unknown material parameters, namely, the absorption and scattering coefficients, and its phase function anisotropy factor, that only requires three point measurements with a spectrometer. In essence, our technique is based on us finding a three-dimensional appearance map, computed using Monte Carlo rendering, that allows the conversion between the three observables and the material parameters. Our measurement setup works without laboratory equipment or expensive optical components. We validate our results on a 3D printed color checker with various ink combinations. Our work paves a path for more accurate appearance modeling and fabrication even for low-budget environments or affordable embedding into other devices.
We present a novel semantic model for human head defined with neural radiance field. The 3D-consistent head model consist of a set of disentangled and interpretable bases, and can be driven by low-dimensional expression coefficients. Thanks to the powerful representation ability of neural radiance field, the constructed model can represent complex facial attributes including hair, wearings, which can not be represented by traditional mesh blendshape. To construct the personalized semantic facial model, we propose to define the bases as several multi-level voxel fields. With a short monocular RGB video as input, our method can construct the subject's semantic facial NeRF model with only ten to twenty minutes, and can render a photorealistic human head image in tens of miliseconds with a given expression coefficient and view direction. With this novel representation, we apply it to many tasks like facial retargeting and expression editing. Experimental results demonstrate its strong representation ability and training/inference speed. Demo videos and released code are provided in our project page: https://ustc3dv.github.io/NeRFBlendShape/
View-dependent effects such as reflections pose a substantial challenge for image-based and neural rendering algorithms. Above all, curved reflectors are particularly hard, as they lead to highly non-linear reflection flows as the camera moves. We introduce a new point-based representation to compute Neural Point Catacaustics allowing novel-view synthesis of scenes with curved reflectors, from a set of casually-captured input photos. At the core of our method is a neural warp field that models catacaustic trajectories of reflections, so complex specular effects can be rendered using efficient point splatting in conjunction with a neural renderer. One of our key contributions is the explicit representation of reflections with a reflection point cloud which is displaced by the neural warp field, and a primary point cloud which is optimized to represent the rest of the scene. After a short manual annotation step, our approach allows interactive high-quality renderings of novel views with accurate reflection flow. Additionally, the explicit representation of reflection flow supports several forms of scene manipulation in captured scenes, such as reflection editing, cloning of specular objects, reflection tracking across views, and comfortable stereo viewing. We provide the source code and other supplemental material on https://repo-sam.inria.fr/fungraph/neural_catacaustics/
Reproducing physically-based global illumination (GI) effects has been a long-standing demand for many real-time graphical applications. In pursuit of this goal, many recent engines resort to some form of light probes baked in a precomputation stage. Unfortunately, the GI effects stemming from the precomputed probes are rather limited due to the constraints in the probe storage, representation or query. In this paper, we propose a new method for probe-based GI rendering which can generate a wide range of GI effects, including glossy reflection with multiple bounces, in complex scenes. The key contributions behind our work include a gradient-based search algorithm and a neural image reconstruction method. The search algorithm is designed to reproject the probes' contents to any query viewpoint, without introducing parallax errors, and converges fast to the optimal solution. The neural image reconstruction method, based on a dedicated neural network and several G-buffers, tries to recover high-quality images from low-quality inputs due to limited resolution or (potential) low sampling rate of the probes. This neural method makes the generation of light probes efficient. Moreover, a temporal reprojection strategy and a temporal loss are employed to improve temporal stability for animation sequences. The whole pipeline runs in realtime (>30 frames per second) even for high-resolution (1920×1080) outputs, thanks to the fast convergence rate of the gradient-based search algorithm and a light-weight design of the neural network. Extensive experiments on multiple complex scenes have been conducted to show the superiority of our method over the state-of-the-arts.
Generating high-quality artistic portrait videos is an important and desirable task in computer graphics and vision. Although a series of successful portrait image toonification models built upon the powerful StyleGAN have been proposed, these image-oriented methods have obvious limitations when applied to videos, such as the fixed frame size, the requirement of face alignment, missing non-facial details and temporal inconsistency. In this work, we investigate the challenging controllable high-resolution portrait video style transfer by introducing a novel VToonify framework. Specifically, VToonify leverages the mid- and high-resolution layers of StyleGAN to render high-quality artistic portraits based on the multi-scale content features extracted by an encoder to better preserve the frame details. The resulting fully convolutional architecture accepts non-aligned faces in videos of variable size as input, contributing to complete face regions with natural motions in the output. Our framework is compatible with existing StyleGAN-based image toonification models to extend them to video toonification, and inherits appealing features of these models for flexible style control on color and intensity. This work presents two instantiations of VToonify built upon Toonify and DualStyleGAN for collection-based and exemplar-based portrait video style transfer, respectively. Extensive experimental results demonstrate the effectiveness of our proposed VToonify framework over existing methods in generating high-quality and temporally-coherent artistic portrait videos with flexible style controls. Code and pretrained models are available at our project page: www.mmlab-ntu.com/project/vtoonify/.
Colorization is multimodal by nature and challenges existing frameworks to achieve colorful and structurally consistent results. Even the sophisticated autoregressive model struggles to maintain long-distance color consistency due to the fragility of sequential dependence. To overcome this challenge, we propose a novel colorization framework that disentangles color multimodality and structure consistency through global color anchors, so that both aspects could be learned effectively. Our key insight is that several carefully located anchors could approximately represent the color distribution of an image, and conditioned on the anchor colors, we can predict the image color in a deterministic manner by utilizing internal correlation. To this end, we construct a colorization model with dual branches, where the color modeler predicts the color distribution for anchor color representation, and the color generator predicts the pixel colors by referring the sampled anchor colors. Importantly, the anchors are located under two principles: color independence and global coverage, which is realized with clustering analysis on the deep color features. To simplify the computation, we creatively adopt soft superpixel segmentation to reduce the image primitives, which still nicely reserves the reversibility to pixel-wise representation. Extensive experiments show that our method achieves notable superiority over various mainstream frameworks in perceptual quality. Thanks to anchor-based color representation, our model has the flexibility to support diverse and controllable colorization as well.
We propose the first unified framework UniColor to support colorization in multiple modalities, including both unconditional and conditional ones, such as stroke, exemplar, text, and even a mix of them. Rather than learning a separate model for each type of condition, we introduce a two-stage colorization framework for incorporating various conditions into a single model. In the first stage, multi-modal conditions are converted into a common representation of hint points. Particularly, we propose a novel CLIP-based method to convert the text to hint points. In the second stage, we propose a Transformer-based network composed of Chroma-VQGAN and Hybrid-Transformer to generate diverse and high-quality colorization results conditioned on hint points. Both qualitative and quantitative comparisons demonstrate that our method outperforms state-of-the-art methods in every control modality and further enables multi-modal colorization that was not feasible before. Moreover, we design an interactive interface showing the effectiveness of our unified framework in practical usage, including automatic colorization, hybrid-control colorization, local recolorization, and iterative color editing. Our code and models are available at https://luckyhzt.github.io/unicolor.
We introduce MyStyle, a personalized deep generative prior trained with a few shots of an individual. MyStyle allows to reconstruct, enhance and edit images of a specific person, such that the output is faithful to the person's key facial characteristics. Given a small reference set of portrait images of a person (~ 100), we tune the weights of a pretrained StyleGAN face generator to form a local, low-dimensional, personalized manifold in the latent space. We show that this manifold constitutes a personalized region that spans latent codes associated with diverse portrait images of the individual. Moreover, we demonstrate that we obtain a personalized generative prior, and propose a unified approach to apply it to various ill-posed image enhancement problems, such as inpainting and super-resolution, as well as semantic editing. Using the personalized generative prior we obtain outputs that exhibit high-fidelity to the input images and are also faithful to the key facial characteristics of the individual in the reference set. We demonstrate our method with fair-use images of numerous widely recognizable individuals for whom we have the prior knowledge for a qualitative evaluation of the expected outcome. We evaluate our approach against few-shots baselines and show that our personalized prior, quantitatively and qualitatively, outperforms state-of-the-art alternatives.
We propose a model that extracts a sketch from a colorized image in such a way that the extracted sketch has a line style similar to a given reference sketch while preserving the visual content identically to the colorized image. Authentic sketches drawn by artists have various sketch styles to add visual interest and contribute feeling to the sketch. However, existing sketch-extraction methods generate sketches with only one style. Moreover, existing style transfer models fail to transfer sketch styles because they are mostly designed to transfer textures of a source style image instead of transferring the sparse line styles from a reference sketch. Lacking the necessary volumes of data for standard training of translation systems, at the core of our GAN-based solution is a self-reference sketch style generator that produces various reference sketches with a similar style but different spatial layouts. We use independent attention modules to detect the edges of a colorized image and reference sketch as well as the visual correspondences between them. We apply several loss terms to imitate the style and enforce sparsity in the extracted sketches. Our sketch-extraction method results in a close imitation of a reference sketch style drawn by an artist and outperforms all baseline methods. Using our method, we produce a synthetic dataset representing various sketch styles and improve the performance of auto-colorization models, in high demand in comics. The validity of our approach is confirmed via qualitative and quantitative evaluations.
Production-level workflows for producing convincing 3D dynamic human faces have long relied on an assortment of labor-intensive tools for geometry and texture generation, motion capture and rigging, and expression synthesis. Recent neural approaches automate individual components but the corresponding latent representations cannot provide artists with explicit controls as in conventional tools. In this paper, we present a new learning-based, video-driven approach for generating dynamic facial geometries with high-quality physically-based assets. For data collection, we construct a hybrid multiview-photometric capture stage, coupling with ultra-fast video cameras to obtain raw 3D facial assets. We then set out to model the facial expression, geometry and physically-based textures using separate VAEs where we impose a global MLP based expression mapping across the latent spaces of respective networks, to preserve characteristics across respective attributes. We also model the delta information as wrinkle maps for the physically-based textures, achieving high-quality 4K dynamic textures. We demonstrate our approach in high-fidelity performer-specific facial capture and cross-identity facial motion retargeting. In addition, our multi-VAE-based neural asset, along with the fast adaptation schemes, can also be deployed to handle in-the-wild videos. Besides, we motivate the utility of our explicit facial disentangling strategy by providing various promising physically-based editing results with high realism. Comprehensive experiments show that our technique provides higher accuracy and visual fidelity than previous video-driven facial reconstruction and animation methods.
Automatic synthesis of realistic co-speech gestures is an increasingly important yet challenging task in artificial embodied agent creation. Previous systems mainly focus on generating gestures in an end-to-end manner, which leads to difficulties in mining the clear rhythm and semantics due to the complex yet subtle harmony between speech and gestures. We present a novel co-speech gesture synthesis method that achieves convincing results both on the rhythm and semantics. For the rhythm, our system contains a robust rhythm-based segmentation pipeline to ensure the temporal coherence between the vocalization and gestures explicitly. For the gesture semantics, we devise a mechanism to effectively disentangle both low- and high-level neural embeddings of speech and motion based on linguistic theory. The high-level embedding corresponds to semantics, while the low-level embedding relates to subtle variations. Lastly, we build correspondence between the hierarchical embeddings of the speech and the motion, resulting in rhythm- and semantics-aware gesture synthesis. Evaluations with existing objective metrics, a newly proposed rhythmic metric, and human feedback show that our method outperforms state-of-the-art systems by a clear margin.
Battery life is an increasingly urgent challenge for today's untethered VR and AR devices. However, the power efficiency of head-mounted displays is naturally at odds with growing computational requirements driven by better resolution, refresh rate, and dynamic ranges, all of which reduce the sustained usage time of untethered AR/VR devices. For instance, the Oculus Quest 2, under a fully-charged battery, can sustain only 2 to 3 hours of operation time. Prior display power reduction techniques mostly target smartphone displays. Directly applying smartphone display power reduction techniques, however, degrades the visual perception in AR/VR with noticeable artifacts. For instance, the "power-saving mode" on smartphones uniformly lowers the pixel luminance across the display and, as a result, presents an overall darkened visual perception to users if directly applied to VR content.
Our key insight is that VR display power reduction must be cognizant of the gaze-contingent nature of high field-of-view VR displays. To that end, we present a gaze-contingent system that, without degrading luminance, minimizes the display power consumption while preserving high visual fidelity when users actively view immersive video sequences. This is enabled by constructing 1) a gaze-contingent color discrimination model through psychophysical studies, and 2) a display power model (with respect to pixel color) through real-device measurements. Critically, due to the careful design decisions made in constructing the two models, our algorithm is cast as a constrained optimization problem with a closed-form solution, which can be implemented as a real-time, image-space shader. We evaluate our system using a series of psychophysical studies and large-scale analyses on natural images. Experiment results show that our system reduces the display power by as much as 24% (14% on average) with little to no perceptual fidelity degradation.
Natural interaction between multiple users within a shared virtual environment (VE) relies on each other's awareness of the current position of the interaction partners. This, however, cannot be warranted when users employ noncontinuous locomotion techniques, such as teleportation, which may cause confusion among bystanders.
In this paper, we pursue two approaches to create a pleasant experience for both the moving user and the bystanders observing that movement. First, we will introduce a Smart Avatar system that delivers continuous full-body human representations for noncontinuous locomotion in shared virtual reality (VR) spaces. Smart Avatars imitate their assigned user's real-world movements when close-by and autonomously navigate to their user when the distance between them exceeds a certain threshold, i.e., after the user teleports. As part of the Smart Avatar system, we implemented four avatar transition techniques and compared them to conventional avatar locomotion in a user study, revealing significant positive effects on the observers' spatial awareness, as well as pragmatic and hedonic quality scores.
Second, we introduce the concept of Stuttered Locomotion, which can be applied to any continuous locomotion method. By converting a continuous movement into short-interval teleport steps, we provide the merits of non-continuous locomotion for the moving user while observers can easily keep track of their path. Thus, while the experience for observers is similarly positive as with continuous motion, a user study confirmed that Stuttered Locomotion can significantly reduce the occurrence of cybersickness symptoms for the moving user, making it an attractive choice for shared VEs. We will discuss the potential of Smart Avatars and Stuttered Locomotion for shared VR experiences, both when applied individually and in combination.
Holographic displays promise to deliver unprecedented display capabilities in augmented reality applications, featuring a wide field of view, wide color gamut, spatial resolution, and depth cues all in a compact form factor. While emerging holographic display approaches have been successful in achieving large étendue and high image quality as seen by a camera, the large étendue also reveals a problem that makes existing displays impractical: the sampling of the holographic field by the eye pupil. Existing methods have not investigated this issue due to the lack of displays with large enough étendue, and, as such, they suffer from severe artifacts with varying eye pupil size and location.
We show that the holographic field as sampled by the eye pupil is highly varying for existing display setups, and we propose pupil-aware holography that maximizes the perceptual image quality irrespective of the size, location, and orientation of the eye pupil in a near-eye holographic display. We validate the proposed approach both in simulations and on a prototype holographic display and show that our method eliminates severe artifacts and significantly outperforms existing approaches.
Recent years have seen growing interest in 3D human face modeling due to its wide applications in digital human, character generation and animation. Existing approaches overwhelmingly emphasized on modeling the exterior shapes, textures and skin properties of faces, ignoring the inherent correlation between inner skeletal structures and appearance. In this paper, we present SCULPTOR, 3D face creations with Skeleton Consistency Using a Learned Parametric facial generaTOR, aiming to facilitate the easy creation of both anatomically correct and visually convincing face models via a hybrid parametric-physical representation. At the core of SCULPTOR is LUCY, the first large-scale shape-skeleton face dataset in collaboration with plastic surgeons. Named after the fossils of one of the oldest known human ancestors, our LUCY dataset contains high-quality Computed Tomography (CT) scans of the complete human head before and after orthognathic surgeries, which are critical for evaluating surgery results. LUCY consists of 144 scans of 72 subjects (31 male and 41 female), where each subject has two CT scans taken pre- and post-orthognathic operations. Based on our LUCY dataset, we learned a novel skeleton consistent parametric facial generator, SCULPTOR, which can create unique and nuanced facial features that help define a character and at the same time maintain physiological soundness. Our SCULPTOR jointly models the skull, face geometry and face appearance under a unified data-driven framework by separating the depiction of a 3D face into shape blend shape, pose blend shape and facial expression blend shape. SCULPTOR preserves both anatomic correctness and visual realism in facial generation tasks compared with existing methods. Finally, we showcase the robustness and effectiveness of SCULPTOR in various fancy applications unseen before, like archaeological skeletal facial completion, bone-aware character fusion, skull inference from images, face generation with lipo-Level change and facial animations, etc.
We present Recurrent Feature Alignment (ReFA), an end-to-end neural network for the very rapid creation of production-grade face assets from multi-view images. ReFA is on par with the industrial pipelines in quality for producing accurate, complete, registered, and textured assets directly applicable to physically-based rendering, but produces the asset end-to-end, fully automatically at a significantly faster speed at 4.5 FPS, which is unprecedented among neural-based techniques. Our method represents face geometry as a position map in the UV space. The network first extracts per-pixel features in both the multi-view image space and the UV space. A recurrent module then iteratively optimizes the geometry by projecting the image-space features to the UV space and comparing them with a reference UV-space feature. The optimized geometry then provides pixel-aligned signals for the inference of high-resolution textures. Experiments have validated that ReFA achieves a median error of 0.603mm in geometry reconstruction, is robust to extreme pose and expression, and excels in sparse-view settings. We believe that the progress achieved by our network enables lightweight, fast face assets acquisition that significantly boosts the downstream applications, such as avatar creation and facial performance capture. It will also enable massive database capturing for deep learning purposes.
In this work we take a novel perception-centered approach to quantify distortions on 3D geometry of faces, to which humans are particularly sensitive. We generated a dataset, composed of 100 high-quality and demographically-balanced face scans. We then subjected these meshes to distortions that cover relevant use cases in computer graphics, and conducted a large-scale perceptual study to subjectively evaluate them. Our dataset consists of over 84,000 quality comparisons, making it the largest ever psychophysical dataset for geometric distortions. Finally, we demonstrated how our data can be used for applications like metrics, compression, and level-of-detail rendering.
We propose LaplacianFusion, a novel approach that reconstructs detailed and controllable 3D clothed-human body shapes from an input depth or 3D point cloud sequence. The key idea of our approach is to use Laplacian coordinates, well-known differential coordinates that have been used for mesh editing, for representing the local structures contained in the input scans, instead of implicit 3D functions or vertex displacements used previously. Our approach reconstructs a controllable base mesh using SMPL, and learns a surface function that predicts Laplacian coordinates representing surface details on the base mesh. For a given pose, we first build and subdivide a base mesh, which is a deformed SMPL template, and then estimate Laplacian coordinates for the mesh vertices using the surface function. The final reconstruction for the pose is obtained by integrating the estimated Laplacian coordinates as a whole. Experimental results show that our approach based on Laplacian coordinates successfully reconstructs more visually pleasing shape details than previous methods. The approach also enables various surface detail manipulations, such as detail transfer and enhancement.
3D Morphable models of the human body capture variations among subjects and are useful in reconstruction and editing applications. Current dental models use an explicit mesh scene representation and model only the teeth, ignoring the gum. In this work, we present the first parametric 3D morphable dental model for both teeth and gum. Our model uses an implicit scene representation and is learned from rigidly aligned scans. It is based on a component-wise representation for each tooth and the gum, together with a learnable latent code for each of such components. It also learns a template shape thus enabling several applications such as segmentation, interpolation and tooth replacement. Our reconstruction quality is on par with the most advanced global implicit representations while enabling novel applications. The code will be available at https://github.com/cong-yi/DMM
The trade-off between speed and fidelity in cloth simulation is a fundamental computational problem in computer graphics and computational design. Coarse cloth models provide the interactive performance required by designers, but they can not be simulated at higher resolutions ("up-resed") without introducing simulation artifacts and/or unpredicted outcomes, such as different folds, wrinkles and drapes. But how can a coarse simulation predict the result of an unconstrained, high-resolution simulation that has not yet been run?
We propose Progressive Cloth Simulation (PCS), a new forward simulation method for efficient preview of cloth quasistatics on exceedingly coarse triangle meshes with consistent and progressive improvement over a hierarchy of increasingly higher-resolution models. PCS provides an efficient coarse previewing simulation method that predicts the coarse-scale folds and wrinkles that will be generated by a corresponding converged, high-fidelity C-IPC simulation of the cloth drape's equilibrium. For each preview PCS can generate an increasing-resolution sequence of consistent models that progress towards this converged solution. This successive improvement can then be interrupted at any point, for example, whenever design parameters are updated. PCS then ensures feasibility at all resolutions, so that predicted solutions remain intersection-free and capture the complex folding and buckling behaviors of frictionally contacting cloth.
Realistic dynamic garments on animated characters have many AR/VR applications. While authoring such dynamic garment geometry is still a challenging task, data-driven simulation provides an attractive alternative, especially if it can be controlled simply using the motion of the underlying character. In this work, we focus on motion guided dynamic 3D garments, especially for loose garments. In a data-driven setup, we first learn a generative space of plausible garment geometries. Then, we learn a mapping to this space to capture the motion dependent dynamic deformations, conditioned on the previous state of the garment as well as its relative position with respect to the underlying body. Technically, we model garment dynamics, driven using the input character motion, by predicting per-frame local displacements in a canonical state of the garment that is enriched with frame-dependent skinning weights to bring the garment to the global space. We resolve any remaining per-frame collisions by predicting residual local displacements. The resultant garment geometry is used as history to enable iterative roll-out prediction. We demonstrate plausible generalization to unseen body shapes and motion inputs, and show improvements over multiple state-of-the-art alternatives. Code and data is released in https://geometry.cs.ucl.ac.uk/projects/2022/MotionDeepGarment/
We present a general framework for the garment animation problem through unsupervised deep learning inspired in physically based simulation. Existing trends in the literature already explore this possibility. Nonetheless, these approaches do not handle cloth dynamics. Here, we propose the first methodology able to learn realistic cloth dynamics unsupervisedly, and henceforth, a general formulation for neural cloth simulation. The key to achieve this is to adapt an existing optimization scheme for motion from simulation based methodologies to deep learning. Then, analyzing the nature of the problem, we devise an architecture able to automatically disentangle static and dynamic cloth subspaces by design. We will show how this improves model performance. Additionally, this opens the possibility of a novel motion augmentation technique that greatly improves generalization. Finally, we show it also allows to control the level of motion in the predictions. This is a useful, never seen before, tool for artists. We provide of detailed analysis of the problem to establish the bases of neural cloth simulation and guide future research into the specifics of this domain.
Real-world fabrics often possess complicated nonlinear, anisotropic bending stiffness properties. Measuring the physical parameters of such properties for physics-based simulation is difficult yet unnecessary, due to the persistent existence of numerical errors in simulation technology. In this work, we propose to adopt a simulation-in-the-loop strategy: instead of measuring the physical parameters, we estimate the simulation parameters to minimize the discrepancy between reality and simulation. This strategy offers good flexibility in test setups, but the associated optimization problem is computationally expensive to solve by numerical methods. Our solution is to train a regression-based neural network for inferring bending stiffness parameters, directly from drape features captured in the real world. Specifically, we choose the Cusick drape test method and treat multiple-view depth images as the feature vector. To effectively and efficiently train our network, we develop a highly expressive and physically validated bending stiffness model, and we use the traditional cantilever test to collect the parameters of this model for 618 real-world fabrics. Given the whole parameter data set, we then construct a parameter subspace, generate new samples within the sub-space, and finally simulate and augment synthetic data for training purposes. The experiment shows that our trained system can replace cantilever tests for quick, reliable and effective estimation of simulation-ready parameters. Thanks to the use of the system, our simulator can now faithfully simulate bending effects comparable to those in the real world.
Despite recent progress in developing animatable full-body avatars, realistic modeling of clothing - one of the core aspects of human self-expression - remains an open challenge. State-of-the-art physical simulation methods can generate realistically behaving clothing geometry at interactive rates. Modeling photorealistic appearance, however, usually requires physically-based rendering which is too expensive for interactive applications. On the other hand, data-driven deep appearance models are capable of efficiently producing realistic appearance, but struggle at synthesizing geometry of highly dynamic clothing and handling challenging body-clothing configurations. To this end, we introduce pose-driven avatars with explicit modeling of clothing that exhibit both photorealistic appearance learned from real-world data and realistic clothing dynamics. The key idea is to introduce a neural clothing appearance model that operates on top of explicit geometry: at training time we use high-fidelity tracking, whereas at animation time we rely on physically simulated geometry. Our core contribution is a physically-inspired appearance network, capable of generating photorealistic appearance with view-dependent and dynamic shadowing effects even for unseen body-clothing configurations. We conduct a thorough evaluation of our model and demonstrate diverse animation results on several subjects and different types of clothing. Unlike previous work on photorealistic full-body avatars, our approach can produce much richer dynamics and more realistic deformations even for many examples of loose clothing. We also demonstrate that our formulation naturally allows clothing to be used with avatars of different people while staying fully animatable, thus enabling, for the first time, photorealistic avatars with novel clothing.
Hair rendering has been a focal point of attention in computer graphics for the last couple of decades. However, there have been few contributions to the modeling and rendering of the natural hair aging phenomenon. We present a new technique that simulates the process of hair graying and hair thinning on digital models due to aging. Given a 3D human head model with hair, we first compute a segmentation of the head using K-means since hair aging occurs at different rates in distinct head parts. Hair graying is simulated according to recent biological knowledge on aging factors for hairs, and hair thinning decreases hair diameters linearly with time. Our system is biologically inspired, supports facial hair, both genders and many ethnicities, and is compatible with different lengths of hair strands. Our real-time results resemble real-life hair aging, accomplished by simulating the stochastic nature of the process and the gradual decrease of melanin.
Existing generative models for 3D shapes are typically trained on a large 3D dataset, often of a specific object category. In this paper, we investigate the deep generative model that learns from only a single reference 3D shape. Specifically, we present a multi-scale GAN-based model designed to capture the input shape's geometric features across a range of spatial scales. To avoid large memory and computational cost induced by operating on the 3D volume, we build our generator atop the tri-plane hybrid representation, which requires only 2D convolutions. We train our generative model on a voxel pyramid of the reference shape, without the need of any external supervision or manual annotation. Once trained, our model can generate diverse and high-quality 3D shapes possibly of different sizes and aspect ratios. The resulting shapes present variations across different scales, and at the same time retain the global structure of the reference shape. Through extensive evaluation, both qualitative and quantitative, we demonstrate that our model can generate 3D shapes of various types.1
Exact 3D path generation is a fundamental problem of designing a mechanism to make a point exactly move along a prescribed 3D path, driven by a single actuator. Existing mechanisms are insufficient to address this problem. Planar linkages and their combinations with gears and/or plate cams can only generate 2D paths while 1-DOF spatial linkages can only generate 3D paths with rather simple shapes. In this paper, we present a new 3D cam-linkage mechanism, consisting of two 3D cams and five links, for exactly generating a continuous 3D path. To design a 3D cam-linkage mechanism, we first model a 3-DOF 5-bar spatial linkage to exactly generate a prescribed 3D path and then reduce the spatial linkage's DOFs from 3 to 1 by composing the linkage with two 3D cam-follower mechanisms. Our computational approach optimizes the 3D cam-linkage mechanism's topology and geometry to minimize the mechanism's total weight while ensuring smooth, collision-free, and singularity-free motion. We show that our 3D cam-linkage mechanism is able to exactly generate a continuous 3D path with arbitrary shape and a finite number of C0 points, evaluate the mechanism's kinematic performance with 3D printed prototypes, and demonstrate that the mechanism can be generalized for exact 3D motion generation.
We present a novel neural surface reconstruction method called NeuralRoom for reconstructing room-sized indoor scenes directly from a set of 2D images. Recently, implicit neural representations have become a promising way to reconstruct surfaces from multiview images due to their high-quality results and simplicity. However, implicit neural representations usually cannot reconstruct indoor scenes well because they suffer severe shape-radiance ambiguity. We assume that the indoor scene consists of texture-rich and flat texture-less regions. In texture-rich regions, the multiview stereo can obtain accurate results. In the flat area, normal estimation networks usually obtain a good normal estimation. Based on the above observations, we reduce the possible spatial variation range of implicit neural surfaces by reliable geometric priors to alleviate shape-radiance ambiguity. Specifically, we use multiview stereo results to limit the NeuralRoom optimization space and then use reliable geometric priors to guide NeuralRoom training. Then the NeuralRoom would produce a neural scene representation that can render an image consistent with the input training images. In addition, we propose a smoothing method called perturbation-residual restrictions to improve the accuracy and completeness of the flat region, which assumes that the sampling points in a local surface should have the same normal and similar distance to the observation center. Experiments on the ScanNet dataset show that our method can reconstruct the texture-less area of indoor scenes while maintaining the accuracy of detail. We also apply NeuralRoom to more advanced multiview reconstruction algorithms and significantly improve their reconstruction quality.
We introduce a statistical extension of the classic Poisson Surface Reconstruction algorithm for recovering shapes from 3D point clouds. Instead of outputting an implicit function, we represent the reconstructed shape as a modified Gaussian Process, which allows us to conduct statistical queries (e.g., the likelihood of a point in space being on the surface or inside a solid). We show that this perspective: improves PSR's integration into the online scanning process, broadens its application realm, and opens the door to other lines of research such as applying task-specific priors.
Feature lines are important geometric cues in characterizing the structure of a CAD model. Despite great progress in both explicit reconstruction and implicit reconstruction, it remains a challenging task to reconstruct a polygonal surface equipped with feature lines, especially when the input point cloud is noisy and lacks faithful normal vectors. In this paper, we develop a multistage algorithm, named RFEPS, to address this challenge. The key steps include (1) denoising the point cloud based on the assumption of local planarity, (2) identifying the feature-line zone by optimization of discrete optimal transport, (3) augmenting the point set so that sufficiently many additional points are generated on potential geometry edges, and (4) generating a polygonal surface that interpolates the augmented point set based on restricted power diagram. We demonstrate through extensive experiments that RFEPS, benefiting from the edge-point augmentation and the feature preserving explicit reconstruction, outperforms state of the art methods in terms of the reconstruction quality, especially in terms of the ability to reconstruct missing feature lines.
To reconstruct meshes from the widely-available 3D point cloud data, implicit shape representation is among the primary choices as an intermediate form due to its superior representation power and robustness in topological optimizations. Although different parameterizations of the implicit fields have been explored to model the underlying geometry, there is no explicit mechanism to ensure the fitting tightness of the surface to the input. We present in response, NeuralGalerkin, a neural Galerkin-method-based solver designed for reconstructing highly-accurate surfaces from the input point clouds. NeuralGalerkin internally discretizes the target implicit field as a linear combination of a set of spatially-varying basis functions inferred by an adaptive sparse convolution neural network. It then solves differentiably for a variational problem that incorporates both positional and normal constraints from the data in closed form within a single forward pass, highly respecting the raw input points. The reconstructed surface extracted from the implicit interpolants is hence very accurate and incorporates useful inductive biases benefiting from the training data. Extensive evaluations on various datasets demonstrate our method's promising reconstruction performance and scalability.
We introduce DeepJoin, an automated approach to generate high-resolution repairs for fractured shapes using deep neural networks. Existing approaches to perform automated shape repair operate exclusively on symmetric objects, require a complete proxy shape, or predict restoration shapes using low-resolution voxels which are too coarse for physical repair. We generate a high-resolution restoration shape by inferring a corresponding complete shape and a break surface from an input fractured shape. We present a novel implicit shape representation for fractured shape repair that combines the occupancy function, signed distance function, and normal field. We demonstrate repairs using our approach for synthetically fractured objects from ShapeNet, 3D scans from the Google Scanned Objects dataset, objects in the style of ancient Greek pottery from the QP Cultural Heritage dataset, and real fractured objects. We outperform six baseline approaches in terms of chamfer distance and normal consistency. Unlike existing approaches and restorations generated using subtraction, DeepJoin restorations do not exhibit surface artifacts and join closely to the fractured region of the fractured shape. Our code is available at: https://github.com/Terascale-All-sensing-Research-Studio/DeepJoin.
Given a portrait image of a person and an environment map of the target lighting, portrait relighting aims to re-illuminate the person in the image as if the person appeared in an environment with the target lighting. To achieve high-quality results, recent methods rely on deep learning. An effective approach is to supervise the training of deep neural networks with a high-fidelity dataset of desired input-output pairs, captured with a light stage. However, acquiring such data requires an expensive special capture rig and time-consuming efforts, limiting access to only a few resourceful laboratories. To address the limitation, we propose a new approach that can perform on par with the state-of-the-art (SOTA) relighting methods without requiring a light stage. Our approach is based on the realization that a successful relighting of a portrait image depends on two conditions. First, the method needs to mimic the behaviors of physically-based relighting. Second, the output has to be photorealistic. To meet the first condition, we propose to train the relighting network with training data generated by a virtual light stage that performs physically-based rendering on various 3D synthetic humans under different environment maps. To meet the second condition, we develop a novel synthetic-to-real approach to bring photorealism to the relighting network output. In addition to achieving SOTA results, our approach offers several advantages over the prior methods, including controllable glares on glasses and more temporally-consistent results for relighting videos.
The advancements in hardware have drawn more attention than ever to high-quality offline rendering with modern stream processors, both in the industry and in research fields. However, the graphics APIs are fragmented and existing shading languages lack high-level constructs such as polymorphism, which adds complexity to developing and maintaining cross-platform high-performance renderers. We present LuisaRender1, a high-performance rendering framework for modern stream-architecture hardware. Our main contribution is an expressive C++-embedded DSL for kernel programming with JIT code generation and compilation. We also implement a unified runtime layer with resource wrappers and an optimized Monte Carlo renderer. Experiments on test scenes show that LuisaRender achieves much higher performance than existing research renderers on modern graphics hardware, e.g., 5--11× faster than PBRT-v4 and 4--16× faster than Mitsuba 3.
Streaming rendered 3D content over a network to a thin client device, such as a phone or a VR/AR headset, brings high-fidelity graphics to platforms where it would not normally possible due to thermal, power, or cost constraints. Streamed 3D content must be transmitted with a representation that is both robust to latency and potential network dropouts. Transmitting a video stream and reprojecting to correct for changing viewpoints fails in the presence of disocclusion events; streaming scene geometry and performing high-quality rendering on the client is not possible on limited-power mobile GPUs. To balance the competing goals of disocclusion robustness and minimal client workload, we introduce QuadStream, a new streaming content representation that reduces motion-to-photon latency by allowing clients to efficiently render novel views without artifacts caused by disocclusion events. Motivated by traditional macroblock approaches to video codec design, we decompose the scene seen from positions in a view cell into a series of quad proxies, or view-aligned quads from multiple views. By operating on a rasterized G-Buffer, our approach is independent of the representation used for the scene itself; the resulting QuadStream is an approximate geometric representation of the scene that can be reconstructed by a thin client to render both the current view and nearby adjacent views. Our technical contributions are an efficient parallel quad generation, merging, and packing strategy for proxy views covering potential client movement in a scene; a packing and encoding strategy that allows masked quads with depth information to be transmitted as a frame-coherent stream; and an efficient rendering approach for rendering our QuadStream representation into entirely novel views on thin clients. We show that our approach achieves superior quality compared both to video data streaming methods, and to geometry-based streaming.
The practical deployment of Neural Radiance Fields (NeRF) in rendering applications faces several challenges, with the most critical one being low rendering speed on even high-end graphic processing units (GPUs). In this paper, we present ICARUS, a specialized accelerator architecture tailored for NeRF rendering. Unlike GPUs using general purpose computing and memory architectures for NeRF, ICARUS executes the complete NeRF pipeline using dedicated plenoptic cores (PLCore) consisting of a positional encoding unit (PEU), a multi-layer perceptron (MLP) engine, and a volume rendering unit (VRU). A PLCore takes in positions & directions and renders the corresponding pixel colors without any intermediate data going off-chip for temporary storage and exchange, which can be time and power consuming. To implement the most expensive component of NeRF, i.e., the MLP, we transform the fully connected operations to approximated reconfigurable multiple constant multiplications (MCMs), where common subexpressions are shared across different multiplications to improve the computation efficiency. We build a prototype ICARUS using Synopsys HAPS-80 S104, a field programmable gate array (FPGA)-based prototyping system for large-scale integrated circuits and systems design. We evaluate the power-performancearea (PPA) of a PLCore using 40nm LP CMOS technology. Working at 400 MHz, a single PLCore occupies 16.5 mm2 and consumes 282.8 mW, translating to 0.105 uJ/sample. The results are compared with those of GPU and tensor processing unit (TPU) implementations.
We have recently seen tremendous progress in the neural advances for photo-real human modeling and rendering. However, it's still challenging to integrate them into an existing mesh-based pipeline for downstream applications. In this paper, we present a comprehensive neural approach for high-quality reconstruction, compression, and rendering of human performances from dense multi-view videos. Our core intuition is to bridge the traditional animated mesh workflow with a new class of highly efficient neural techniques. We first introduce a neural surface reconstructor for high-quality surface generation in minutes. It marries the implicit volumetric rendering of the truncated signed distance field (TSDF) with multi-resolution hash encoding. We further propose a hybrid neural tracker to generate animated meshes, which combines explicit non-rigid tracking with implicit dynamic deformation in a self-supervised framework. The former provides the coarse warping back into the canonical space, while the latter implicit one further predicts the displacements using the 4D hash encoding as in our reconstructor. Then, we discuss the rendering schemes using the obtained animated meshes, ranging from dynamic texturing to lumigraph rendering under various bandwidth settings. To strike an intricate balance between quality and bandwidth, we propose a hierarchical solution by first rendering 6 virtual views covering the performer and then conducting occlusion-aware neural texture blending. We demonstrate the efficacy of our approach in a variety of mesh-based applications and photo-realistic free-view experiences on various platforms, i.e., inserting virtual human performances into real environments through mobile AR or immersively watching talent shows with VR headsets.
Implicit radiance functions emerged as a powerful scene representation for reconstructing and rendering photo-realistic views of a 3D scene. These representations, however, suffer from poor editability. On the other hand, explicit representations such as polygonal meshes allow easy editing but are not as suitable for reconstructing accurate details in dynamic human heads, such as fine facial features, hair, teeth, and eyes. In this work, we present Neural Parameterization (NeP), a hybrid representation that provides the advantages of both implicit and explicit methods. NeP is capable of photo-realistic rendering while allowing fine-grained editing of the scene geometry and appearance. We first disentangle the geometry and appearance by parameterizing the 3D geometry into 2D texture space. We enable geometric editability by introducing an explicit linear deformation blending layer. The deformation is controlled by a set of sparse key points, which can be explicitly and intuitively displaced to edit the geometry. For appearance, we develop a hybrid 2D texture consisting of an explicit texture map for easy editing and implicit view and time-dependent residuals to model temporal and view variations. We compare our method to several reconstruction and editing baselines. The results show that the NeP achieves almost the same level of rendering accuracy while maintaining high editability.
Photorealistic digital re-aging of faces in video is becoming increasingly common in entertainment and advertising. But the predominant 2D painting workflow often requires frame-by-frame manual work that can take days to accomplish, even by skilled artists. Although research on facial image re-aging has attempted to automate and solve this problem, current techniques are of little practical use as they typically suffer from facial identity loss, poor resolution, and unstable results across subsequent video frames. In this paper, we present the first practical, fully-automatic and production-ready method for re-aging faces in video images. Our first key insight is in addressing the problem of collecting longitudinal training data for learning to re-age faces over extended periods of time, a task that is nearly impossible to accomplish for a large number of real people. We show how such a longitudinal dataset can be constructed by leveraging the current state-of-the-art in facial re-aging that, although failing on real images, does provide photoreal re-aging results on synthetic faces. Our second key insight is then to leverage such synthetic data and formulate facial re-aging as a practical image-to-image translation task that can be performed by training a well-understood U-Net architecture, without the need for more complex network designs. We demonstrate how the simple U-Net, surprisingly, allows us to advance the state of the art for re-aging real faces on video, with unprecedented temporal stability and preservation of facial identity across variable expressions, viewpoints, and lighting conditions. Finally, our new face re-aging network (FRAN) incorporates simple and intuitive mechanisms that provides artists with localized control and creative freedom to direct and fine-tune the re-aging effect, a feature that is largely important in real production pipelines and often overlooked in related research work.
Image processing pipelines are ubiquitous and we rely on them either directly, by filtering or adjusting an image post-capture, or indirectly, as image signal processing (ISP) pipelines on broadly deployed camera systems. Used by artists, photographers, system engineers, and for downstream vision tasks, traditional image processing pipelines feature complex algorithmic branches developed over decades. Recently, image-to-image networks have made great strides in image processing, style transfer, and semantic understanding. The differentiable nature of these networks allows them to fit a large corpus of data; however, they do not allow for intuitive, fine-grained controls that photographers find in modern photo-finishing tools.
This work closes that gap and presents an approach to making complex photo-finishing pipelines differentiable, allowing legacy algorithms to be trained akin to neural networks using first-order optimization methods. By concatenating tailored network proxy models of individual processing steps (e.g. white-balance, tone-mapping, color tuning), we can model a non-differentiable reference image finishing pipeline more faithfully than existing proxy image-to-image network models. We validate the method for several diverse applications, including photo and video style transfer, slider regression for commercial camera ISPs, photography-driven neural demosaicking, and adversarial photo-editing.
Fluidic devices are crucial components in many industrial applications involving fluid mechanics. Computational design of a high-performance fluidic system faces multifaceted challenges regarding its geometric representation and physical accuracy. We present a novel topology optimization method to design fluidic devices in a Stokes flow context. Our approach is featured by its capability in accommodating a broad spectrum of boundary conditions at the solid-fluid interface. Our key contribution is an anisotropic and differentiable constitutive model that unifies the representation of different phases and boundary conditions in a Stokes model, enabling a topology optimization method that can synthesize novel structures with accurate boundary conditions from a background grid discretization. We demonstrate the efficacy of our approach by conducting several fluidic system design tasks with over four million design parameters.
We present a novel Monte Carlo-based fluid simulation approach capable of pointwise and stochastic estimation of fluid motion. Drawing on the Feynman-Kac representation of the vorticity transport equation, we propose a recursive Monte Carlo estimator of the Biot-Savart law and extend it with a stream function formulation that allows us to treat free-slip boundary conditions using a Walk-on-Spheres algorithm. Inspired by the Monte Carlo literature in rendering, we design and compare variance reduction schemes suited to a fluid simulation context for the first time, show its applicability to complex boundary settings, and detail a simple and practical implementation with temporal grid caching. We validate the correctness of our approach via quantitative and qualitative evaluations - across a range of settings and domain geometries - and thoroughly explore its parameters' design space. Finally, we provide an in-depth discussion of several axes of future work building on this new numerical simulation modality.
This paper presents a new representation of curve dynamics, with applications to vortex filaments in fluid dynamics. Instead of representing these filaments with explicit curve geometry and Lagrangian equations of motion, we represent curves implicitly with a new co-dimensional 2 level set description. Our implicit representation admits several redundant mathematical degrees of freedom in both the configuration and the dynamics of the curves, which can be tailored specifically to improve numerical robustness, in contrast to naive approaches for implicit curve dynamics that suffer from overwhelming numerical stability problems. Furthermore, we note how these hidden degrees of freedom perfectly map to a Clebsch representation in fluid dynamics. Motivated by these observations, we introduce untwisted level set functions and non-swirling dynamics which successfully regularize sources of numerical instability, particularly in the twisting modes around curve filaments. A consequence is a novel simulation method which produces stable dynamics for large numbers of interacting vortex filaments and effortlessly handles topological changes and re-connection events.
We present a new octree-based neighborhood search method for SPH simulation. A speedup of up to 1.9x is observed in comparison to state-of-the-art methods which rely on uniform grids. While our method focuses on maximizing performance in fixed-radius SPH simulations, we show that it can also be used in scenarios where the particle support radius is not constant thanks to the adaptive nature of the octree acceleration structure.
Neighborhood search methods typically consist of an acceleration structure that prunes the space of possible particle neighbor pairs, followed by direct distance comparisons between the remaining particle pairs. Previous works have focused on minimizing the number of comparisons. However, in an effort to minimize the actual computation time, we find that distance comparisons exhibit very high throughput on modern CPUs. By permitting more comparisons than strictly necessary, the time spent on preparing and searching the acceleration structure can be reduced, yielding a net positive speedup. The choice of an octree acceleration structure, instead of the uniform grid typically used in fixed-radius methods, ensures balanced computational tasks. This benefits both parallelism and provides consistently high computational intensity for the distance comparisons. We present a detailed account of high-level considerations that, together with low-level decisions, enable high throughput for performance-critical parts of the algorithm.
Finally, we demonstrate the high performance of our algorithm on a number of large-scale fixed-radius SPH benchmarks and show in experiments with a support radius ratio up to 3 that our method is also effective in multi-resolution SPH simulations.
We propose to augment standard grid-based fluid solvers with pointwise divergence-free velocity interpolation, thereby ensuring exact incompressibility down to the sub-cell level. Our method takes as input a discretely divergence-free velocity field generated by a staggered grid pressure projection, and first recovers a corresponding discrete vector potential. Instead of solving a costly vector Poisson problem for the potential, we develop a fast parallel sweeping strategy to find a candidate potential and apply a gauge transformation to enforce the Coulomb gauge condition and thereby make it numerically smooth. Interpolating this discrete potential generates a point-wise vector potential whose analytical curl is a pointwise incompressible velocity field. Our method further supports irregular solid geometry through the use of level set-based cut-cells and a novel Curl-Noise-inspired potential ramping procedure that simultaneously offers strictly non-penetrating velocities and incompressibility. Experimental comparisons demonstrate that the vector potential reconstruction procedure at the heart of our approach is consistently faster than prior such reconstruction schemes, especially those that solve vector Poisson problems. Moreover, in exchange for its modest extra cost, our overall Curl-Flow framework produces significantly improved particle trajectories that closely respect irregular obstacles, do not suffer from spurious sources or sinks, and yield superior particle distributions over time.
This paper presents a novel approach to simulating surface tension flow within a position-based dynamics (PBD) framework. We enhance the conventional PBD fluid method in terms of its surface representation and constraint enforcement to furnish support for the simulation of interfacial phenomena driven by strong surface tension and contact dynamics. The key component of our framework is an on-the-fly local meshing algorithm to build the local geometry around each surface particle. Based on this local mesh structure, we devise novel surface constraints that can be integrated seamlessly into a PBD framework to model strong surface tension effects. We demonstrate the efficacy of our approach by simulating a multitude of surface tension flow examples exhibiting intricate interfacial dynamics of films and drops, which were all infeasible for a traditional PBD method.
There currently exist two main approaches to reproducing visual appearance using Machine Learning (ML): The first is training models that generalize over different instances of a problem, e.g., different images of a dataset. As one-shot approaches, these offer fast inference, but often fall short in quality. The second approach does not train models that generalize across tasks, but rather over-fit a single instance of a problem, e.g., a flash image of a material. These methods offer high quality, but take long to train. We suggest to combine both techniques end-to-end using meta-learning: We over-fit onto a single problem instance in an inner loop, while also learning how to do so efficiently in an outer-loop across many exemplars. To this end, we derive the required formalism that allows applying meta-learning to a wide range of visual appearance reproduction problems: textures, Bidirectional Reflectance Distribution Functions (BRDFs), spatially-varying BRDFs (svBRDFs), illumination or the entire light transport of a scene. The effects of meta-learning parameters on several different aspects of visual appearance are analyzed in our framework, and specific guidance for different tasks is provided. Metappearance enables visual quality that is similar to over-fit approaches in only a fraction of their runtime while keeping the adaptivity of general models.
We present MIPNet, a novel approach for SVBRDF mipmapping which preserves material appearance under varying view distances and lighting conditions. As in classical mipmapping, our method explicitly encodes the multiscale appearance of materials in a SVBRDF mipmap pyramid. To do so, we use a tensor-based representation, coping with gradient-based optimization, for encoding anisotropy which is compatible with existing real-time rendering engines. Instead of relying on a simple texture patch average for each channel independently, we propose a cascaded architecture of multilayer perceptrons to approximate the material appearance using only the fixed material channels. Our neural model learns simple mipmapping filters using a differentiable rendering pipeline based on a rendering loss and is able to transfer signal from normal to anisotropic roughness. As a result, we obtain a drop-in replacement for standard material mipmapping, offering a significant improvement in appearance preservation while still boiling down to a single per-pixel mipmap texture fetch. We report extensive experiments on two distinct BRDF models.
We port Boolean set operations between 2D shapes to surfaces of any genus, with any number of open boundaries. We combine shapes bounded by sets of freely intersecting loops, consisting of geodesic lines and cubic Bézier splines lying on a surface. We compute the arrangement of shapes directly on the surface and assign integer labels to the cells of such arrangement. Differently from the Euclidean case, some arrangements on a manifold may be inconsistent. We detect inconsistent arrangements and help the user to resolve them. Also, we extend to the manifold setting recent work on Boundary-Sampled Halfspaces, thus supporting operations more general than standard Booleans, which are well defined on inconsistent arrangements, too. Our implementation discretizes the input shapes into polylines at an arbitrary resolution, independent of the level of resolution of the underlying mesh. We resolve the arrangement inside each triangle of the mesh independently and combine the results to reconstruct both the boundaries and the interior of each cell in the arrangement. We reconstruct the control points of curves bounding cells, in order to free the result from discretization and provide an output in vector format. We support interactive usage, editing shapes consisting up to 100k line segments on meshes of up to 1M triangles.
Boolean operations are among the most used paradigms to create and edit digital shapes. Despite being conceptually simple, the computation of mesh Booleans is notoriously challenging. Main issues come from numerical approximations that make the detection and processing of intersection points inconsistent and unreliable, exposing implementations based on floating point arithmetic to many kinds of degeneracy and failure. Numerical methods based on rational numbers or exact geometric predicates have the needed robustness guarantees, that are achieved at the cost of increased computation times that, as of today, has always restricted the use of robust mesh Booleans to offline applications. We introduce an algorithm for Boolean operations with robustness guarantees that is capable of operating at interactive frame rates on meshes with up to 200K triangles. We evaluate our tool thoroughly, considering not only interactive applications but also batch processing of large collections of meshes, processing of huge meshes containing millions of elements and variadic Booleans of hundreds of shapes altogether. In all these experiments, we consistently outperform prior robust floating point methods by at least one order of magnitude.
We present a novel method for blending hierarchical layouts with semantic labels. The core of our method is a hierarchical structure correspondence algorithm, which recursively finds optimal substructure correspondences, achieving a globally optimal correspondence between a pair of hierarchical layouts. This correspondence is consistent with the structures of both layouts, allowing us to define the union of the layouts' structures. The resulting compound structure helps extract intermediate layout structures, from which blended layouts can be generated via an optimization approach. The correspondence also defines a similarity measure between layouts in a hierarchically structured view. Our method provides a new way for novel layout creation. The introduced structural similarity measure regularizes the layouts in a hyperspace. We demonstrate two applications in this paper, i.e., exploratory design of novel layouts and sketch-based layout retrieval, and test them on a magazine layout dataset. The effectiveness and feasibility of these two applications are confirmed by the user feedback and the extensive results. The code is available at https://github.com/lyf7115/LayoutBlending.
We propose a novel technique to compose new 3D animated models, such as videogame characters, by combining pieces from existing ones. Our method works on production-ready rigged, skinned, and animated 3D models to reassemble new ones. We exploit mix-and-match operations on the skeletons to trigger the automatic creation of a new mesh, linked to the new skeleton by a set of skinning weights and complete with a set of animations. The resulting model preserves the quality of the input meshings (which can be quad-dominant and semi-regular), skinning weights (inducing believable deformation), and animations, featuring coherent movements of the new skeleton.
Our method enables content creators to reuse valuable, carefully designed assets by assembling new ready-to-use characters while preserving most of the hand-crafted subtleties of models authored by digital artists. As shown in the accompanying video, it allows for drastically cutting the time needed to obtain the final result.
We introduce a novel approach to describe mesh generation, mesh adaptation, and geometric modeling algorithms relying on changing mesh connectivity using a high-level abstraction. The main motivation is to enable easy customization and development of these algorithms via a declarative specification consisting of a set of per-element invariants, operation scheduling, and attribute transfer for each editing operation.
We demonstrate that widely used algorithms editing surfaces and volumes can be compactly expressed with our abstraction, and their implementation within our framework is simple, automatically parallelizable on shared-memory architectures, and with guaranteed satisfaction of the prescribed invariants. These algorithms are readable and easy to customize for specific use cases. We introduce a software library implementing this abstraction and providing automatic shared-memory parallelization.
Meshes are an indispensable representation in many graphics applications because they provide conformal spatial discretizations. However, mesh-based operations are often slow due to unstructured memory access patterns. We propose MeshTaichi, a novel mesh compiler that provides an intuitive programming model for efficient mesh-based operations. Our programming model hides the complex indexing system from users and allows users to write mesh-based operations using reference-style neighborhood queries. Our compiler achieves its high performance by exploiting data locality. We partition input meshes and prepare the wanted relations by inspecting users' code during compile time. During run time, we further utilize on-chip memory (shared memory on GPU and L1 cache on CPU) to access the wanted attributes of mesh elements efficiently. Our compiler decouples low-level optimization options with computations, so that users can explore different localized data attributes and different memory orderings without changing their computation code. As a result, users can write concise code using our programming model to generate efficient mesh-based computations on both CPU and GPU backends. We test MeshTaichi on a variety of physically-based simulation and geometry processing applications with both triangle and tetrahedron meshes. MeshTaichi achieves a consistent speedup ranging from 1.4× to 6×, compared to state-of-the-art mesh data structures and compilers.
We present a highly efficient-and-robust method for free-boundary flattening of disk-like triangle meshes in a globally injective manner. We show that by restricting the solution to a low-dimensional subspace of harmonic maps, we can dramatically accelerate the process while obtaining a low-distortion result. The algorithm consists of two main steps. A linear subspace construction, and a nonlinear nonconvex optimization for finding a low-distortion globally injective map within that subspace. The complexity of the first step dominates the algorithm's runtime and is merely that of solving a linear system. We combine recent results for computing locally-and-globally injective maps with that of harmonic maps into a conceptually simple algorithm that guarantees global injectivity. We demonstrate the great efficiency of our method over a dataset of 100 large scale models with more than 2M triangles each. Our algorithm is 10 times faster on average compared to the state-of-the-art Efficient Bijective Parameterizations (EBP) method [Su et al. 2020], on these high-resolution meshes, and more than 20 times faster on challenging examples (Figures 1,11). The speedup over [Jiang et al. 2017; Smith and Schaefer 2015] is even more dramatic.
We introduce a framework for representing face-based directional fields of an arbitrary piecewise-polynomial order. Our framework is based on a primal-dual decomposition of fields, where the exact component of a field is the gradient of piecewise-polynomial conforming function, and the coexact component is defined as the adjoint of a dimensionally-consistent discrete curl operator. Our novel formulation sidesteps the difficult problem of constructing high-order non-conforming function spaces, and makes it simple to harness the flexibility of higher-order finite elements for directional-field processing. Our representation is structure-preserving, and draws on principles from finite-element exterior calculus. We demonstrate its benefits for applications such as Helmholtz-Hodge decomposition, smooth PolyVector fields, the vector heat method, and seamless parameterization.
Simultaneous coupling of diverse physical systems poses significant computational challenges in terms of speed, quality, and stability. Rather than treating all components with a single discretization methodology (e.g., smoothed particles, material point method, Eulerian grid, etc.) that is ill-suited to some components, our solver, ElastoMonolith, addresses three-way interactions among standard particle-in-cell-based viscous and inviscid fluids, Lagrangian mesh-based deformable bodies, and rigid bodies. While prior methods often treat some terms explicitly or in a decoupled fashion for efficiency, often at the cost of robustness or stability, we demonstrate the effectiveness of a strong coupling approach that expresses all of the relevant physics within one consistent and unified optimization problem, including fluid pressure and viscosity, elasticity of the deformables, frictional solid-solid contact, and solid-fluid interface conditions. We further develop a numerical solver to tackle this difficult optimization problem, incorporating projected Newton, an active set method, and a transformation of the inner linear system matrix to ensure symmetric positive definiteness. Our experimental evaluations show that our framework can achieve high quality coupling results that avoid artifacts such as volume loss, instability, sticky contacts, and spurious interpenetrations.
We propose a novel solid-fluid coupling method to capture the subtle hydrophobic and hydrophilic interactions between liquid, solid, and air at their multi-phase junctions. The key component of our approach is a Lagrangian model that tackles the coupling, evolution, and equilibrium of dynamic contact lines evolving on the interface between surface-tension fluid and deformable objects. This contact-line model captures an ensemble of small-scale geometric and physical processes, including dynamic waterfront tracking, local momentum transfer and force balance, and interfacial tension calculation. On top of this contact-line model, we further developed a mesh-based level set method to evolve the three-phase T-junction on a deformable solid surface. Our dynamic contact-line model, in conjunction with its monolithic coupling system, unifies the simulation of various hydrophobic and hydrophilic solid-fluid-interaction phenomena and enables a broad range of challenging small-scale elastocapillary phenomena that were previously difficult or impractical to solve, such as the elastocapillary origami and self-assembly, dynamic contact angles of drops, capillary adhesion, as well as wetting and splashing on vibrating surfaces.
Artistically controlling fluids has always been a challenging task. Recently, volumetric Neural Style Transfer (NST) techniques have been used to artistically manipulate smoke simulation data with 2D images. In this work, we revisit previous volumetric NST techniques for smoke, proposing a suite of upgrades that enable stylizations that are significantly faster, simpler, more controllable and less prone to artifacts. Moreover, the energy minimization solved by previous methods is camera dependent. To avoid that, a computationally expensive iterative optimization performed for multiple views sampled around the original simulation is needed, which can take up to several minutes per frame. We propose a simple feed-forward neural network architecture that is able to infer view-independent stylizations that are three orders of the magnitude faster than its optimization-based counterpart.
We introduce a novel differentiable hybrid traffic simulator, which simulates traffic using a hybrid model of both macroscopic and microscopic models and can be directly integrated into a neural network for traffic control and flow optimization. This is the first differentiable traffic simulator for macroscopic and hybrid models that can compute gradients for traffic states across time steps and inhomogeneous lanes. To compute the gradient flow between two types of traffic models in a hybrid framework, we present a novel intermediate conversion component that bridges the lanes in a differentiable manner as well. We also show that we can use analytical gradients to accelerate the overall process and enhance scalability. Thanks to these gradients, our simulator can provide more efficient and scalable solutions for complex learning and control problems posed in traffic engineering than other existing algorithms. Refer to https://sites.google.com/umd.edu/diff-hybrid-traffic-sim for our project.
We propose an adaptive sampling and reconstruction method for offline Monte Carlo rendering. Our method produces sampling maps constrained by a user-defined budget that minimize the expected future denoising error. Compared to other state-of-the-art methods, which produce the necessary training data on the fly by composing pre-rendered images, our method samples from analytic noise distributions instead. These distributions are compact and closely approximate the pixel value distributions stemming from Monte Carlo rendering. Our method can efficiently sample training data by leveraging only a few per-pixel statistics of the target distribution, which provides several benefits over the current state of the art. Most notably, our analytic distributions' modeling accuracy and sampling efficiency increase with sample count, essential for high-quality offline rendering. Although our distributions are approximate, our method supports joint end-to-end training of the sampling and denoising networks. Finally, we propose the addition of a global summary module to our architecture that accumulates valuable information from image regions outside of the network's receptive field. This information discourages sub-optimal decisions based on local information. Our evaluation against other state-of-the-art neural sampling methods demonstrates denoising quality and data efficiency improvements.
Among the various approaches for producing point distributions with blue noise spectrum, we argue for an optimization framework using Gaussian kernels. We show that with a wise selection of optimization parameters, this approach attains unprecedented quality, provably surpassing the current state of the art attained by the optimal transport (BNOT) approach. Further, we show that our algorithm scales smoothly and feasibly to high dimensions while maintaining the same quality, realizing unprecedented high-quality high-dimensional blue noise sets. Finally, we show an extension to adaptive sampling.
We propose a multi-class point optimization formulation based on continuous Wasserstein barycenters. Our formulation is designed to handle hundreds to thousands of optimization objectives and comes with a practical optimization scheme. We demonstrate the effectiveness of our framework on various sampling applications like stippling, object placement, and Monte-Carlo integration. We a derive multi-class error bound for perceptual rendering error which can be minimized using our optimization. We provide source code at https://github.com/iribis/filtered-sliced-optimal-transport.
Unbiased rendering algorithms such as path tracing produce accurate images given a huge number of samples, but in practice, the techniques often leave visually distracting artifacts (i.e., noise) in their rendered images due to a limited time budget. A favored approach for mitigating the noise problem is applying learning-based denoisers to unbiased but noisy rendered images and suppressing the noise while preserving image details. However, such denoising techniques typically introduce a systematic error, i.e., the denoising bias, which does not decline as rapidly when increasing the sample size, unlike the other type of error, i.e., variance. It can technically lead to slow numerical convergence of the denoising techniques. We propose a new combination framework built upon the James-Stein (JS) estimator, which merges a pair of unbiased and biased rendering images, e.g., a path-traced image and its denoised result. Unlike existing post-correction techniques for image denoising, our framework helps an input denoiser have lower errors than its unbiased input without relying on accurate estimation of per-pixel denoising errors. We demonstrate that our framework based on the well-established JS theories allows us to improve the error reduction rates of state-of-the-art learning-based denoisers more robustly than recent post-denoisers.
Achieving a pure-compression stress state is considered central to the form-finding of shell structures. However, the pure-compression assumption restricts the geometry of the structure's plan in that any free boundary edges cannot bulge outward. Allowing both tension and compression is essential so that overhanging leaves can stretch out toward the sky. When performing tension-compression mixed form-finding, a problem with boundary condition (BC) compatibility arises. Since the form-finding equation is hyperbolic, boundary information propagates along the asymptotic lines of the stress function. If conflicting BC data is prescribed at either end of an asymptotic line, the problem becomes ill-posed. This requires a user of a form-finding method to know the solution in advance. By contrast, pure-tension or pure-compression problems are elliptic and always give solutions under any BCs sufficient to restrain rigid motion. To solve the form-finding problem for tension-compression mixed shells, we focus on the Airy's stress function, which describes the stress field in a shell. Rather than taking the stress function as given, we instead treat both the stress function and the shell as unknowns. This doubles the solution variables, turning the problem to one that has an infinity of different solutions. By enforcing equilibrium in the shell interior and prescribing the correct matching pairs of BCs to both the stress function and the shell, a stress function and shell can be simultaneously found such that equilibrium is satisfied everywhere in the shell interior and thus automatically has compatible BCs by construction. The problem of a potentially over-constrained form-finding is thus avoided by expanding the solution space and creating an under-determined problem. By varying inputs and repeatedly searching for stress function-shell pairs that fall within the solution space, a user is allowed to interactively explore the possible forms of tension-compression mixed shells under the given plan of the shell.
Multiple sketch datasets have been proposed to understand how people draw 3D objects. However, such datasets are often of small scale and cover a small set of objects or categories. In addition, these datasets contain freehand sketches mostly from expert users, making it difficult to compare the drawings by expert and novice users, while such comparisons are critical in informing more effective sketch-based interfaces for either user groups. These observations motivate us to analyze how differently people with and without adequate drawing skills sketch 3D objects. We invited 70 novice users and 38 expert users to sketch 136 3D objects, which were presented as 362 images rendered from multiple views. This leads to a new dataset of 3,620 freehand multi-view sketches, which are registered with their corresponding 3D objects under certain views. Our dataset is an order of magnitude larger than the existing datasets. We analyze the collected data at three levels, i.e., sketch-level, stroke-level, and pixel-level, under both spatial and temporal characteristics, and within and across groups of creators. We found that the drawings by professionals and novices show significant differences at stroke-level, both intrinsically and extrinsically. We demonstrate the usefulness of our dataset in two applications: (i) freehand-style sketch synthesis, and (ii) posing it as a potential benchmark for sketch-based 3D reconstruction. Our dataset and code are available at https://chufengxiao.github.io/DifferSketching/.
In this paper, we propose a novel optimization-based method to estimate the reflectance properties of a near planar surface from a single input image. Specifically, we perform test-time optimization by directly updating the parameters of a neural network to minimize the test error. Since single image SVBRDF estimation is a highly ill-posed problem, such an optimization is prone to overfitting. Our main contribution is to address this problem by introducing a training mechanism that takes the test-time optimization into account. Specifically, we train our network by minimizing the training loss after one or more gradient updates with the test loss. By training the network in this manner, we ensure that the network does not overfit to the input image during the test-time optimization process. Additionally, we propose a learned reflectance loss to augment the typically used rendering loss during the test-time optimization. We do so by using an auxiliary network that estimates pseudo ground truth reflectance parameters and train it in combination with the main network. Our approach is able to converge with a small number of iterations of the test-time optimization and produces better results compared to the state-of-the-art methods.
The median filter is a simple yet powerful noise reduction technique that is extensively applied in image, signal, and speech processing. It can effectively remove impulsive noise while preserving the content of the image by taking the median of neighboring pixels; thus, it has various applications, such as restoration of a damaged image and facial beautification. The median filter is typically implemented in one of two major approaches: the histogram-based method, which requires O(1) computation time per pixel when focusing on the kernel radius r, and the sorting-based method, which requires approximately O(r2) computation time per pixel but has a light constant factor. These are used differently depending on the kernel radius and the number of bits in the image. However, the computation time is still slow, particularly when the kernel radius is in the mid to large range.
This paper introduces novel and efficient median filter with constant complexity O(1) for kernel size using the wavelet matrix data structure, which has been applied to query-based searches on one-dimensional data. We extended the original wavelet matrix to two-dimensional data for application to computer graphics problems. The objective of this study was to achieve high-speed median filter computation in parallel computing environment with many threads (i.e., GPUs). Our implementation for the GPU is an order of magnitude faster than the histogram method for 8-bit images. Unlike traditional histogram methods, which suffer from significant computational overhead, the proposed method can handle images with high pixel depth (e.g., 16- and 32-bit high dynamic range images). When the kernel radius is greater than 12 for 8-bit images, the proposed method outperforms the other median filter computation methods.
While tremendous advances in visual and auditory realism have been made for virtual and augmented reality (VR/AR), introducing a plausible sense of physicality into the virtual world remains challenging. Closing the gap between real-world physicality and immersive virtual experience requires a closed interaction loop: applying user-exerted physical forces to the virtual environment and generating haptic sensations back to the users. However, existing VR/AR solutions either completely ignore the force inputs from the users or rely on obtrusive sensing devices that compromise user experience.
By identifying users' muscle activation patterns while engaging in VR/AR, we design a learning-based neural interface for natural and intuitive force inputs. Specifically, we show that lightweight electromyography sensors, resting non-invasively on users' forearm skin, inform and establish a robust understanding of their complex hand activities. Fuelled by a neural-network-based model, our interface can decode finger-wise forces in real-time with 3.3% mean error, and generalize to new users with little calibration. Through an interactive psychophysical study, we show that human perception of virtual objects' physical properties, such as stiffness, can be significantly enhanced by our interface. We further demonstrate that our interface enables ubiquitous control via finger tapping. Ultimately, we envision our findings to push forward research towards more realistic physicality in future VR/AR.
We propose Neural Brushstroke Engine, the first method to apply deep generative models to learn a distribution of interactive drawing tools. Our conditional GAN model learns the latent space of drawing styles from a small set (about 200) of unlabeled images in different media. Once trained, a single model can texturize stroke patches drawn by the artist, emulating a diverse collection of brush styles in the latent space. In order to enable interactive painting on a canvas of arbitrary size, we design a painting engine able to support real-time seamless patch-based generation, while allowing artists direct control of stroke shape, color and thickness. We show that the latent space learned by our model generalizes to unseen drawing and more experimental styles (e.g. beads) by embedding real styles into the latent space. We explore other applications of the continuous latent space, such as optimizing brushes to enable painting in the style of an existing artwork, automatic line drawing stylization, brush interpolation, and even natural language search over a continuous space of drawing tools. Our prototype received positive feedback from a small group of digital artists.
Existing 3D-aware facial generation methods face a dilemma in quality versus editability: they either generate editable results in low resolution, or high-quality ones with no editing flexibility. In this work, we propose a new approach that brings the best of both worlds together. Our system consists of three major components: (1) a 3D-semantics-aware generative model that produces view-consistent, disentangled face images and semantic masks; (2) a hybrid GAN inversion approach that initializes the latent codes from the semantic and texture encoder, and further optimizes them for faithful reconstruction; and (3) a canonical editor that enables efficient manipulation of semantic masks in canonical view and produces high-quality editing results. Our approach is competent for many applications, e.g. free-view face drawing, editing and style control. Both quantitative and qualitative results show that our method reaches the state-of-the-art in terms of photorealism, faithfulness and efficiency.
We tackle the problem of estimating correspondences from a general marker, such as a movie poster, to an image that captures such a marker. Conventionally, this problem is addressed by fitting a homography model based on sparse feature matching. However, they are only able to handle plane-like markers and the sparse features do not sufficiently utilize appearance information. In this paper, we propose a novel framework NeuralMarker, training a neural network estimating dense marker correspondences under various challenging conditions, such as marker deformation, harsh lighting, etc. Deep learning has presented an excellent performance in correspondence learning once provided with sufficient training data. However, annotating pixel-wise dense correspondence for training marker correspondence is too expensive. We observe that the challenges of marker correspondence estimation come from two individual aspects: geometry variation and appearance variation. We, therefore, design two components addressing these two challenges in NeuralMarker. First, we create a synthetic dataset FlyingMarkers containing marker-image pairs with ground truth dense correspondences. By training with FlyingMarkers, the neural network is encouraged to capture various marker motions. Second, we propose the novel Symmetric Epipolar Distance (SED) loss, which enables learning dense correspondence from posed images. Learning with the SED loss and the cross-lighting posed images collected by Structure-from-Motion (SfM), NeuralMarker is remarkably robust in harsh lighting environments and avoids synthetic image bias. Besides, we also propose a novel marker correspondence evaluation method circumstancing annotations on real marker-image pairs and create a new benchmark. We show that NeuralMarker significantly outperforms previous methods and enables new interesting applications, including Augmented Reality (AR) and video editing.
We propose a simple and practical approach for incorporating the effects of muscle inertia, which has been ignored by previous musculoskeletal simulators in both graphics and biomechanics. We approximate the inertia of the muscle by assuming that muscle mass is distributed along the centerline of the muscle. We express the motion of the musculotendons in terms of the motion of the skeletal joints using a chain of Jacobians, so that at the top level, only the reduced degrees of freedom of the skeleton are used to completely drive both bones and musculotendons. Our approach can handle all commonly used musculotendon path types, including those with multiple path points and wrapping surfaces. For muscle paths involving wrapping surfaces, we use neural networks to model the Jacobians, trained using existing wrapping surface libraries, which allows us to effectively handle the Jacobian discontinuities that occur when musculotendon paths collide with wrapping surfaces. We demonstrate support for higher-order time integrators, complex joints, inverse dynamics, Hill-type muscle models, and differentiability. In the limit, as the muscle mass is reduced to zero, our approach gracefully degrades to traditional simulators without support for muscle inertia. Finally, it is possible to mix and match inertial and non-inertial musculotendons, depending on the application.
Precision modeling of the hand internal musculoskeletal anatomy has been largely limited to individual poses, and has not been connected into continuous volumetric motion of the hand anatomy actuating across the hand's entire range of motion. This is for a good reason, as hand anatomy and its motion are extremely complex and cannot be predicted merely from the anatomy in a single pose. We give a method to simulate the volumetric shape of hand's musculoskeletal organs to any pose in the hand's range of motion, producing external hand shapes and internal organ shapes that match ground truth optical scans and medical images (MRI) in multiple scanned poses. We achieve this by combining MRI images in multiple hand poses with FEM multibody nonlinear elastoplastic simulation. Our system models bones, muscles, tendons, joint ligaments and fat as separate volumetric organs that mechanically interact through contact and attachments, and whose shape matches medical images (MRI) in the MRI-scanned hand poses. The match to MRI is achieved by incorporating pose-space deformation and plastic strains into the simulation. We show how to do this in a non-intrusive manner that still retains all the simulation benefits, namely the ability to prescribe realistic material properties, generalize to arbitrary poses, preserve volume and obey contacts and attachments. We use our method to produce volumetric renders of the internal anatomy of the human hand in motion, and to compute and render highly realistic hand surface shapes. We evaluate our method by comparing it to optical scans, and demonstrate that we qualitatively and quantitatively substantially decrease the error compared to previous work. We test our method on five complex hand sequences, generated either using keyframe animation or performance animation using modern hand tracking techniques.
Objects with different shapes can dissolve in significantly different ways inside a solution. Predicting different shapes' dissolution dynamics is an important problem especially in pharmaceutics. More important and challenging, however, is controlling the dissolution via shape, i.e., designing shapes that lead to a desired release behavior of materials in a solvent over a specific time. Here, we tackle this challenge by introducing a computational inverse design pipeline. We begin by introducing a simple, physically-inspired differentiable forward model of dissolution. We then formulate our inverse design as a PDE-constrained topology optimization that has access to analytical derivatives obtained via sensitivity analysis. Furthermore, we incorporate fabricability terms in the optimization objective that enable physically realizing our designs. We thoroughly analyze our approach on a diverse set of examples via both simulation and fabrication.
Isotropic As-Rigid-As-Possible (ARAP) energy has been popular for shape editing, mesh parametrisation and soft-body simulation for almost two decades. However, a formulation using Cauchy-Green (CG) invariants has always been unclear, due to a rotation-polluted trace term that cannot be directly expressed using these invariants. We show how this incongruent trace term can be understood via an implicit relationship to the CG invariants. Our analysis reveals this relationship to be a polynomial where the roots equate to the trace term, and where the derivatives also give rise to closed-form expressions of the Hessian to guarantee positive semi-definiteness for a fast and concise Newton-type implicit time integration. A consequence of this analysis is a novel analytical formulation to compute rotations and singular values of deformation-gradient tensors without explicit/numerical factorization which is significant, resulting in up-to 3.5× speedup and benefits energy function evaluation for reducing solver time. We validate our energy formulation by experiments and comparison, demonstrating that our resulting eigendecomposition using the CG invariants is equivalent to existing ARAP formulations. We thus reveal isotropic ARAP energy to be a member of the "Cauchy-Green club", meaning that it can indeed be defined using CG invariants and therefore that the closed-form expressions of the resulting Hessian are shared with other energies written in their terms.
We present a novel implicit representation --- neural halfspace representation (NH-Rep), to convert manifold B-Rep solids to implicit representations. NH-Rep is a Boolean tree built on a set of implicit functions represented by the neural network, and the composite Boolean function is capable of representing solid geometry while preserving sharp features. We propose an efficient algorithm to extract the Boolean tree from a manifold B-Rep solid and devise a neural network-based optimization approach to compute the implicit functions. We demonstrate the high quality offered by our conversion algorithm on ten thousand manifold B-Rep CAD models that contain various curved patches including NURBS, and the superiority of our learning approach over other representative implicit conversion algorithms in terms of surface reconstruction, sharp feature preservation, signed distance field approximation, and robustness to various surface geometry, as well as a set of applications supported by NH-Rep.
Multi-axis motion introduces more degrees of freedom into the process of 3D printing to enable different objectives of fabrication by accumulating materials layers upon curved layers. An existing challenge is how to effectively generate the curved layers satisfying multiple objectives simultaneously. This paper presents a general slicing framework for achieving multiple fabrication objectives including support free, strength reinforcement and surface quality. These objectives are formulated as local printing directions varied in the volume of a solid, which are achieved by computing the rotation-driven deformation for the input model. The height field of a deformed model is mapped into a scalar field on its original shape, the isosurfaces of which give the curved layers of multi-axis 3D printing. The deformation can be effectively optimized with the help of quaternion fields to achieve the fabrication objectives. The effectiveness of our method has been verified on a variety of models.
Assembly planning is the core of automating product assembly, maintenance, and recycling for modern industrial manufacturing. Despite its importance and long history of research, planning for mechanical assemblies when given the final assembled state remains a challenging problem. This is due to the complexity of dealing with arbitrary 3D shapes and the highly constrained motion required for real-world assemblies. In this work, we propose a novel method to efficiently plan physically plausible assembly motion and sequences for real-world assemblies. Our method leverages the assembly-by-disassembly principle and physics-based simulation to efficiently explore a reduced search space. To evaluate the generality of our method, we define a large-scale dataset consisting of thousands of physically valid industrial assemblies with a variety of assembly motions required. Our experiments on this new benchmark demonstrate we achieve a state-of-the-art success rate and the highest computational efficiency compared to other baseline algorithms. Our method also generalizes to rotational assemblies (e.g., screws and puzzles) and solves 80-part assemblies within several minutes.
Concept sketches are ubiquitous in industrial design, as they allow designers to quickly depict imaginary 3D objects. To construct their sketches with accurate perspective, designers rely on longstanding drawing techniques, including the use of auxiliary construction lines to identify midpoints of perspective planes, to align points vertically and horizontally, and to project planar curves from one perspective plane to another. We present a method to synthesize such construction lines from CAD sequences. Importantly, our method balances the presence of construction lines with overall clutter, such that the resulting sketch is both well-constructed and readable, as professional designers are trained to do. In addition to generating sketches that are visually similar to real ones, we apply our method to synthesize a large quantity of paired sketches and normal maps, and show that the resulting dataset can be used to train a neural network to infer normals from concept sketches.1