Vector graphics are widely used in digital art and highly favored by designers due to their scalability and layer-wise properties. However, the process of creating and editing vector graphics requires creativity and design expertise, making it a time-consuming task. Recent advancements in text-to-vector (T2V) generation have aimed to make this process more accessible. However, existing T2V methods directly optimize control points of vector graphics paths, often resulting in intersecting or jagged paths due to the lack of geometry constraints. To overcome these limitations, we propose a novel neural path representation by designing a dual-branch Variational Autoencoder (VAE) that learns the path latent space from both sequence and image modalities. By optimizing the combination of neural paths, we can incorporate geometric constraints while preserving expressivity in generated SVGs. Furthermore, we introduce a two-stage path optimization method to improve the visual and topological quality of generated SVGs. In the first stage, a pre-trained text-to-image diffusion model guides the initial generation of complex vector graphics through the Variational Score Distillation (VSD) process. In the second stage, we refine the graphics using a layer-wise image vectorization strategy to achieve clearer elements and structure. We demonstrate the effectiveness of our method through extensive experiments and showcase various applications. The project page is https://intchous.github.io/T2V-NPR.
We introduce an algorithm for sketch vectorization with state-of-the-art accuracy and capable of handling complex sketches. We approach sketch vectorization as a surface extraction task from an unsigned distance field, which is implemented using a two-stage neural network and a dual contouring domain post processing algorithm. The first stage consists of extracting unsigned distance fields from an input raster image. The second stage consists of an improved neural dual contouring network more robust to noisy input and more sensitive to line geometry. To address the issue of under-sampling inherent in grid-based surface extraction approaches, we explicitly predict undersampling and keypoint maps. These are used in our post-processing algorithm to resolve sharp features and multi-way junctions. The keypoint and undersampling maps are naturally controllable, which we demonstrate in an interactive topology refinement interface. Our proposed approach produces far more accurate vectorizations on complex input than previous approaches with efficient running time.
Point containment queries for regions bound by watertight geometric surfaces, i.e., closed and without self-intersections, can be evaluated straightforwardly with a number of well-studied algorithms. When this assumption on domain geometry is not met, such methods are either unusable, or prone to misclassifications that can lead to cascading errors in downstream applications. More robust point classification schemes based on generalized winding numbers have been proposed, as they are indifferent to these imperfections. However, existing algorithms are limited to point clouds and collections of linear elements. We extend this methodology to encompass more general curved shapes with an algorithm that evaluates the winding number scalar field over unstructured collections of rational parametric curves. In particular, we evaluate the winding number for each curve independently, making the derived containment query robust to how the curves are arranged. We ensure geometric fidelity in our queries by treating each curve as equivalent to an adaptively constructed polyline that provably has the same generalized winding number at the point of interest. Our algorithm is numerically stable for points that are arbitrarily close to the model, and explicitly treats points that are coincident with curves. We demonstrate the improvements in computational performance granted by this method over conventional techniques as well as the robustness induced by its application.
Recent advancements in 2D diffusion models allow appearance generation on untextured raw meshes. These methods create RGB textures by distilling a 2D diffusion model, which often contains unwanted baked-in shading effects and results in unrealistic rendering effects in the downstream applications. Generating Physically Based Rendering (PBR) materials instead of just RGB textures would be a promising solution. However, directly distilling the PBR material parameters from 2D diffusion models still suffers from incorrect material decomposition, such as baked-in shading effects in albedo. We introduce DreamMat, an innovative approach to resolve the aforementioned problem, to generate high-quality PBR materials from text descriptions. We find out that the main reason for the incorrect material distillation is that large-scale 2D diffusion models are only trained to generate final shading colors, resulting in insufficient constraints on material decomposition during distillation. To tackle this problem, we first finetune a new light-aware 2D diffusion model to condition on a given lighting environment and generate the shading results on this specific lighting condition. Then, by applying the same environment lights in the material distillation, DreamMat can generate high-quality PBR materials that are not only consistent with the given geometry but also free from any baked-in shading effects in albedo. Extensive experiments demonstrate that the materials produced through our methods exhibit greater visual appeal to users and achieve significantly superior rendering quality compared to baseline methods, which are preferable for downstream tasks such as game and film production.
We present a novel approach for single-image mesh texturing, which employs a diffusion model with judicious conditioning to seamlessly transfer an object's texture from a single RGB image to a given 3D mesh object. We do not assume that the two objects belong to the same category, and even if they do, there can be significant discrepancies in their geometry and part proportions. Our method aims to rectify the discrepancies by conditioning a pre-trained Stable Diffusion generator with edges describing the mesh through ControlNet, and features extracted from the input image using IP-Adapter to generate textures that respect the underlying geometry of the mesh and the input texture without any optimization or training. We also introduce Image Inversion, a novel technique to quickly personalize the diffusion model for a single concept using a single image, for cases where the pre-trained IP-Adapter falls short in capturing all the details from the input image faithfully. Experimental results demonstrate the efficiency and effectiveness of our edge-aware single-image mesh texturing approach, coined EASI-Tex, in preserving the details of the input texture on diverse 3D objects, while respecting their geometry.
Code: https://github.com/sairajk/easi-tex
Numerous scientific and engineering applications require solutions to boundary value problems (BVPs) involving elliptic partial differential equations, such as the Laplace or Poisson equations, on geometrically intricate domains. We develop a Monte Carlo method for solving such BVPs with arbitrary first-order linear boundary conditions---Dirichlet, Neumann, and Robin. Our method directly generalizes the walk on stars (WoSt) algorithm, which previously tackled only the first two types of boundary conditions, with a few simple modifications. Unlike conventional numerical methods, WoSt does not need finite element meshing or global solves. Similar to Monte Carlo rendering, it instead computes pointwise solution estimates by simulating random walks along star-shaped regions inside the BVP domain, using efficient ray-intersection and distance queries. To ensure WoSt produces bounded-variance estimates in the presence of Robin boundary conditions, we show that it is sufficient to modify how WoSt selects the size of these star-shaped regions. Our generalized WoSt algorithm reduces estimation error by orders of magnitude relative to alternative grid-free methods such as the walk on boundary algorithm. We also develop bidirectional and boundary value caching strategies to further reduce estimation error. Our algorithm is trivial to parallelize, scales sublinearly with increasing geometric detail, and enables progressive and view-dependent evaluation.
This paper presents a practical and general approach for computing barycentric coordinates through stochastic sampling. Our key insight is a reformulation of the kernel integral defining barycentric coordinates into a weighted least-squares minimization that enables Monte Carlo integration without sacrificing linear precision. Our method can thus compute barycentric coordinates directly at the points of interest, both inside and outside the cage, using just proximity queries to the cage such as closest points and ray intersections. As a result, we can evaluate barycentric coordinates for a large variety of cage representations (from quadrangulated surface meshes to parametric curves) seamlessly, bypassing any volumetric discretization or custom solves. To address the archetypal noise induced by sample-based estimates, we also introduce a denoising scheme tailored to barycentric coordinates. We demonstrate the efficiency and flexibility of our formulation by implementing a stochastic generation of harmonic coordinates, mean-value coordinates, and positive mean-value coordinates.
We present BlockFusion, a diffusion-based model that generates 3D scenes as unit blocks and seamlessly incorporates new blocks to extend the scene. BlockFusion is trained using datasets of 3D blocks that are randomly cropped from complete 3D scene meshes. Through per-block fitting, all training blocks are converted into the hybrid neural fields: with a tri-plane containing the geometry features, followed by a Multi-layer Perceptron (MLP) for decoding the signed distance values. A variational auto-encoder is employed to compress the tri-planes into the latent tri-plane space, on which the denoising diffusion process is performed. Diffusion applied to the latent representations allows for high-quality and diverse 3D scene generation.
To expand a scene during generation, one needs only to append empty blocks to overlap with the current scene and extrapolate existing latent tri-planes to populate new blocks. The extrapolation is done by conditioning the generation process with the feature samples from the overlapping tri-planes during the denoising iterations. Latent tri-plane extrapolation produces semantically and geometrically meaningful transitions that harmoniously blend with the existing scene. A 2D layout conditioning mechanism is used to control the placement and arrangement of scene elements. Experimental results indicate that BlockFusion is capable of generating diverse, geometrically consistent and unbounded large 3D scenes with unprecedented high-quality shapes in both indoor and outdoor scenarios.
Existing text-based 3D generation methods generate attractive results but lack detailed geometry control. Sketches, known for their conciseness and expressiveness, have contributed to intuitive 3D modeling but are confined to producing texture-less mesh models within predefined categories. Integrating sketch and text simultaneously for 3D generation promises enhanced control over geometry and appearance but faces challenges from 2D-to-3D translation ambiguity and multi-modal condition integration. Moreover, further editing of 3D models in arbitrary views will give users more freedom to customize their models. However, it is difficult to achieve high generation quality, preserve unedited regions, and manage proper interactions between shape components. To solve the above issues, we propose a text-driven 3D content generation and editing method, SketchDream, which supports NeRF generation from given hand-drawn sketches and achieves free-view sketch-based local editing. To tackle the 2D-to-3D ambiguity challenge, we introduce a sketch-based multi-view image generation diffusion model, which leverages depth guidance to establish spatial correspondence. A 3D ControlNet with a 3D attention module is utilized to control multi-view images and ensure their 3D consistency. To support local editing, we further propose a coarse-to-fine editing approach: the coarse phase analyzes component interactions and provides 3D masks to label edited regions, while the fine stage generates realistic results with refined details by local enhancement. Extensive experiments validate that our method generates higher-quality results compared with a combination of 2D ControlNet and image-to-3D generation techniques and achieves detailed control compared with existing diffusion-based 3D editing approaches.
Existing neural rendering-based text-to-3D-portrait generation methods typically make use of human geometry prior and diffusion models to obtain guidance. However, relying solely on geometry information introduces issues such as the Janus problem, over-saturation, and over-smoothing. We present Portrait3D, a novel neural rendering-based framework with a novel joint geometry-appearance prior to achieve text-to-3D-portrait generation that overcomes the aforementioned issues. To accomplish this, we train a 3D portrait generator, 3DPortraitGAN, as a robust prior. This generator is capable of producing 360° canonical 3D portraits, serving as a starting point for the subsequent diffusion-based generation process. To mitigate the "grid-like" artifact caused by the high-frequency information in the feature-map-based 3D representation commonly used by most 3D-aware GANs, we integrate a novel pyramid tri-grid 3D representation into 3DPortraitGAN. To generate 3D portraits from text, we first project a randomly generated image aligned with the given prompt into the pre-trained 3DPortraitGAN's latent space. The resulting latent code is then used to synthesize a pyramid tri-grid. Beginning with the obtained pyramid tri-grid, we use score distillation sampling to distill the diffusion model's knowledge into the pyramid tri-grid. Following that, we utilize the diffusion model to refine the rendered images of the 3D portrait and then use these refined images as training data to further optimize the pyramid tri-grid, effectively eliminating issues with unrealistic color and unnatural artifacts. Our experimental results show that Portrait3D can produce realistic, high-quality, and canonical 3D portraits that align with the prompt.
The generation of stylistic 3D facial animations driven by speech presents a significant challenge as it requires learning a many-to-many mapping between speech, style, and the corresponding natural facial motion. However, existing methods either employ a deterministic model for speech-to-motion mapping or encode the style using a one-hot encoding scheme. Notably, the one-hot encoding approach fails to capture the complexity of the style and thus limits generalization ability. In this paper, we propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder that extracts style embeddings from short reference videos. During inference, we employ classifier-free guidance to guide the generation process based on the speech and style. In particular, our style includes the generation of head poses, thereby enhancing user perception. Additionally, we address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset. Extensive experiments and user study demonstrate that our approach outperforms state-of-the-art methods. The code and dataset are at https://diffposetalk.github.io.
We present S3, a novel approach to generating expressive, animator-centric 3D head and eye animation of characters in conversation. Given speech audio, a Directorial script and a cinematographic 3D scene as input, we automatically output the animated 3D rotation of each character's head and eyes. S3 distills animation and psycho-linguistic insights into a novel modular framework for conversational gaze capturing: audio-driven rhythmic head motion; narrative script-driven emblematic head and eye gestures; and gaze trajectories computed from audio-driven gaze focus/aversion and 3D visual scene salience. Our evaluation is four-fold: we quantitatively validate our algorithm against ground truth data and baseline alternatives; we conduct a perceptual study showing our results to compare favourably to prior art; we present examples of animator control and critique of S3 output; and present a large number of compelling and varied animations of conversational gaze.
The efficiency of inverse optimization in physically based differentiable rendering heavily depends on the variance of Monte Carlo estimation. Despite recent advancements emphasizing the necessity of tailored differential sampling strategies, the general approaches remain unexplored.
In this paper, we investigate the interplay between local sampling decisions and the estimation of light path derivatives. Considering that modern differentiable rendering algorithms share the same path for estimating differential radiance and ordinary radiance, we demonstrate that conventional guiding approaches, conditioned solely on the last vertex, cannot attain this density. Instead, a mixture of different sampling distributions is required, where the weights are conditioned on all the previously sampled vertices in the path. To embody our theory, we implement a conditional mixture path guiding that explicitly computes optimal weights on the fly. Furthermore, we show how to perform positivization to eliminate sign variance and extend to scenes with millions of parameters.
To the best of our knowledge, this is the first generic framework for applying path guiding to differentiable rendering. Extensive experiments demonstrate that our method achieves nearly one order of magnitude improvements over state-of-the-art methods in terms of variance reduction in gradient estimation and errors of inverse optimization. The implementation of our proposed method is available at https://github.com/mollnn/conditional-mixture.
Gradient-based optimization is now ubiquitous across graphics, but unfortunately can not be applied to problems with undefined or zero gradients. To circumvent this issue, the loss function can be manually replaced by a "surrogate" that has similar minima but is differentiable. Our proposed framework, ZeroGrads, automates this process by learning a neural approximation of the objective function, which in turn can be used to differentiate through arbitrary black-box graphics pipelines. We train the surrogate on an actively smoothed version of the objective and encourage locality, focusing the surrogate's capacity on what matters at the current training episode. The fitting is performed online, alongside the parameter optimization, and self-supervised, without pre-computed data or pre-trained models. As sampling the objective is expensive (it requires a full rendering or simulator run), we devise an efficient sampling scheme that allows for tractable run-times and competitive performance at little overhead. We demonstrate optimizing diverse non-convex, non-differentiable black-box problems in graphics, such as visibility in rendering, discrete parameter spaces in procedural modelling or optimal control in physics-driven animation. In contrast to other derivative-free algorithms, our approach scales well to higher dimensions, which we demonstrate on problems with up to 35k interlinked variables.
Learning from multi-view images using neural implicit signed distance functions shows impressive performance on 3D Reconstruction of opaque objects. However, existing methods struggle to reconstruct accurate geometry when applied to translucent objects due to the non-negligible bias in their rendering function. To address the inaccuracies in the existing model, we have reparameterized the density function of the neural radiance field by incorporating an estimated constant extinction coefficient. This modification forms the basis of our innovative framework, which is geared towards highfidelity surface reconstruction and the novel-view synthesis of translucent objects. Our framework contains two stages. In the reconstruction stage, we introduce a novel weight function to achieve accurate surface geometry reconstruction. Following the recovery of geometry, the second phase involves learning the distinct scattering properties of the participating media to enhance rendering. A comprehensive dataset, comprising both synthetic and real translucent objects, has been built for conducting extensive experiments. Experiments reveal that our method outperforms existing approaches in terms of reconstruction and novel-view synthesis.
Despite recent advances in reconstructing an organic model with the neural signed distance function (SDF), the high-fidelity reconstruction of a CAD model directly from low-quality unoriented point clouds remains a significant challenge. In this paper, we address this challenge based on the prior observation that the surface of a CAD model is generally composed of piecewise surface patches, each approximately developable even around the feature line. Our approach, named NeurCADRecon, is self-supervised, and its loss includes a developability term to encourage the Gaussian curvature toward 0 while ensuring fidelity to the input points (see the teaser figure). Noticing that the Gaussian curvature is non-zero at tip points, we introduce a double-trough curve to tolerate the existence of these tip points. Furthermore, we develop a dynamic sampling strategy to deal with situations where the given points are incomplete or too sparse. Since our resulting neural SDFs can clearly manifest sharp feature points/lines, one can easily extract the feature-aligned triangle mesh from the SDF and then decompose it into smooth surface patches, greatly reducing the difficulty of recovering the parametric CAD design. A comprehensive comparison with existing state-of-the-art methods shows the significant advantage of our approach in reconstructing faithful CAD shapes.
Text-to-image models offer a new level of creative flexibility by allowing users to guide the image generation process through natural language. However, using these models to consistently portray the same subject across diverse prompts remains challenging. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects or add image conditioning to the model. These methods require lengthy persubject optimization or large-scale pre-training. Moreover, they struggle to align generated images with text prompts and face difficulties in portraying multiple subjects. Here, we present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model. We introduce a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images. Additionally, we develop strategies to encourage layout diversity while maintaining subject consistency. We compare ConsiStory to a range of baselines, and demonstrate state-of-the-art performance on subject consistency and text alignment, without requiring a single optimization step. Finally, ConsiStory can naturally extend to multi-subject scenarios, and even enable training-free personalization for common objects.
This paper introduces a novel physically-based vortex fluid model for films, aimed at accurately simulating cascading vortical structures on deforming thin films. Central to our approach is a novel mechanism decomposing the film's tangential velocity into circulation and dilatation components. These components are then evolved using a hybrid particle-mesh method, enabling the effective reconstruction of three-dimensional tangential velocities and seamlessly integrating surfactant and thickness dynamics into a unified framework. By coupling with its normal component and surface-tension model, our method is particularly adept at depicting complex interactions between in-plane vortices and out-of-plane physical phenomena, such as gravity, surfactant dynamics, and solid boundary, leading to highly realistic simulations of complex thin-film dynamics, achieving an unprecedented level of vortical details and physical realism.
We introduce a multi-material non-manifold mesh-based surface tracking algorithm that converts self-intersections into topological changes. Our algorithm generalizes prior work on manifold surface tracking with topological changes: it preserves surface features like mesh-based methods, and it robustly handles topological changes like level set methods. Our method also offers improved efficiency and robustness over the state of the art. We demonstrate the effectiveness of the approach on a range of examples, including complex soap film simulations with thousands of interacting bubbles, and boolean unions of non-manifold meshes consisting of millions of triangles.
Despite its visual appeal, the simulation of separated multiphase flows (i.e., streams of fluids separated by interfaces) faces numerous challenges in accurately reproducing complex behaviors such as guggling, wetting, or bubbling. These difficulties are especially pronounced for high Reynolds numbers and large density variations between fluids, most likely explaining why they have received comparatively little attention in Computer Graphics compared to single- or two-phase flows. In this paper, we present a full LBM solver for multifluid simulation. We derive a conservative phase field model with which the spatial presence of each fluid or phase is encoded to allow for the simulation of miscible, immiscible and even partially-miscible fluids, while the temporal evolution of the phases is performed using a D3Q7 lattice-Boltzmann discretization. The velocity field, handled through the recent high-order moment-encoded LBM (HOME-LBM) framework to minimize its memory footprint, is simulated via a velocity-based distribution stored on a D3Q27 or D3Q19 discretization to offer accuracy and stability to large density ratios even in turbulent scenarios, while coupling with the phases through pressure, viscosity, and interfacial forces is achieved by leveraging the diffuse encoding of interfaces. The resulting solver addresses a number of limitations of kinetic methods in both computational fluid dynamics and computer graphics: it offers a fast, accurate, and low-memory fluid solver enabling efficient turbulent multiphase simulations free of the typical oscillatory pressure behavior near boundaries. We present several numerical benchmarks, examples and comparisons of multiphase flows to demonstrate our solver's visual complexity, accuracy, and realism.
This paper introduces a novel Induce-on-Boundary (IoB) solver designed to address the magnetostatic governing equations of ferrofluids. The IoB solver is based on a single-layer potential and utilizes only the surface point cloud of the object, offering a lightweight, fast, and accurate solution for calculating magnetic fields. Compared to existing methods, it eliminates the need for complex linear system solvers and maintains minimal computational complexities. Moreover, it can be seamlessly integrated into conventional fluid simulators without compromising boundary conditions. Through extensive theoretical analysis and experiments, we validate both the convergence and scalability of the IoB solver, achieving state-of-the-art performance. Additionally, a straightforward coupling approach is proposed and executed to showcase the solver's effectiveness when integrated into a grid-based fluid simulation pipeline, allowing for realistic simulations of representative ferrofluid instabilities.
Given a sequence of poses of a body we study the motion resulting when the body is immersed in a (possibly) moving, incompressible medium. With the poses given, say, by an animator, the governing second-order ordinary differential equations are those of a rigid body with time-dependent inertia acted upon by various forces. Some of these forces, like lift and drag, depend on the motion of the body in the surrounding medium. Additionally, the inertia must encode the effect of the medium through its added mass. We derive the corresponding dynamics equations which generalize the standard rigid body dynamics equations. All forces are based on local computations using only physical parameters such as mass density. Notably, we approximate the effect of the medium on the body through local computations avoiding any global simulation of the medium. Consequently, the system of equations we must integrate in time is only 6 dimensional (rotation and translation). Our proposed algorithm displays linear complexity and captures intricate natural phenomena that depend on body-fluid interactions.
Erosion simulation is a common approach used for generating and authoring mountainous terrains. While water is considered the primary erosion factor, its simulation fails to capture steep slopes near the ridges. In these low-drainage areas, erosion is often approximated with slope-reducing erosion, which yields unrealistically uniform slopes. However, geomorphology observed that another process dominates the low-drainage areas: erosion by debris flow, which is a mixture of mud and rocks triggered by strong climatic events. We propose a new method to capture the interactions between debris flow and fluvial erosion thanks to a new mathematical formulation for debris flow erosion derived from geomorphology and a unified GPU algorithm for erosion and deposition. In particular, we observe that sediment and debris deposition tend to intersect river paths, which motivates the design of a new, approximate flow routing algorithm on the GPU to estimate the water path out of these newly formed depressions. We demonstrate that debris flow carves distinct patterns in the form of erosive scars on steep slopes and cones of deposited debris competing with fluvial erosion downstream.
We propose a new structure called a higher-order shell, which is composed of a set of triangular prisms. Each triangular prism is enveloped by three Bézier triangles (top, middle, and bottom) and three side surfaces, each of which is trimmed from a bilinear surface. Moreover, we define a continuous vector field to smoothly and bijectively transfer attributes between two surfaces inside the shell. Since the higher-order shell has several hard construction constraints, we apply an interior-point strategy to robustly and automatically construct a high-order shell for an input mesh. Specifically, the strategy starts from a valid linear shell with a small thickness. Then, the shell is optimized until the specified thickness is reached, where explicit checks ensure that the constraints are always satisfied. We extensively test our method on more than 8300 models, demonstrating its robustness and performance. Compared to state-of-the-art methods, our bijective projection is smoother, and the space between the shell and input mesh is more uniform.
Directional fields, including unit vector, line, and cross fields, are essential tools in the geometry processing toolkit. The topology of directional fields is characterized by their singularities. While singularities play an important role in downstream applications such as meshing, existing methods for computing directional fields either require them to be specified in advance, ignore them altogether, or treat them as zeros of a relaxed field. While fields are ill-defined at their singularities, the graphs of directional fields with singularities are well-defined surfaces in a circle bundle. By lifting optimization of fields to optimization over their graphs, we can exploit a natural convex relaxation to a minimal section problem over the space of currents in the bundle. This relaxation treats singularities as first-class citizens, expressing the relationship between fields and singularities as an explicit boundary condition. As curvature frustrates finite element discretization of the bundle, we devise a hybrid spectral method for representing and optimizing minimal sections. Our method supports field optimization on both flat and curved domains and enables more precise control over singularity placement.
We introduce a conceptually simple and efficient algorithm for seamless parametrization, a key element in constructing quad layouts and texture charts on surfaces. More specifically, we consider the construction of parametrizations with prescribed holonomy signatures i.e., a set of angles at singularities, and rotations along homology loops, preserving which is essential for constructing parametrizations following an input field, as well as for user control of the parametrization structure. Our algorithm performs exceptionally well on a large dataset based on Thingi10k [Zhou and Jacobson 2016], (16156 meshes) as well as on a challenging smaller dataset of [Myles et al. 2014], converging, on average, in 9 iterations. Although the algorithm lacks a formal mathematical guarantee, presented empirical evidence and the connections between convex optimization and closely related algorithms, suggest that a similar formulation can be found for this algorithm in the future.
Novel view synthesis has seen major advances in recent years, with 3D Gaussian splatting offering an excellent level of visual quality, fast training and real-time rendering. However, the resources needed for training and rendering inevitably limit the size of the captured scenes that can be represented with good visual quality. We introduce a hierarchy of 3D Gaussians that preserves visual quality for very large scenes, while offering an efficient Level-of-Detail (LOD) solution for efficient rendering of distant content with effective level selection and smooth transitions between levels. We introduce a divide-and-conquer approach that allows us to train very large scenes in independent chunks. We consolidate the chunks into a hierarchy that can be optimized to further improve visual quality of Gaussians merged into intermediate nodes. Very large captures typically have sparse coverage of the scene, presenting many challenges to the original 3D Gaussian splatting training method; we adapt and regularize training to account for these issues. We present a complete solution, that enables real-time rendering of very large scenes and can adapt to available resources thanks to our LOD method. We show results for captured scenes with up to tens of thousands of images with a simple and affordable rig, covering trajectories of up to several kilometers and lasting up to one hour.
Recent techniques for real-time view synthesis have rapidly advanced in fidelity and speed, and modern methods are capable of rendering near-photorealistic scenes at interactive frame rates. At the same time, a tension has arisen between explicit scene representations amenable to rasterization and neural fields built on ray marching, with state-of-the-art instances of the latter surpassing the former in quality while being prohibitively expensive for real-time applications. We introduce SMERF, a view synthesis approach that achieves state-of-the-art accuracy among real-time methods on large scenes with footprints up to 300 m2 at a volumetric resolution of 3.5 mm3. Our method is built upon two primary contributions: a hierarchical model partitioning scheme, which increases model capacity while constraining compute and memory consumption, and a distillation training strategy that simultaneously yields high fidelity and internal consistency. Our method enables full six degrees of freedom navigation in a web browser and renders in real-time on commodity smartphones and laptops. Extensive experiments show that our method exceeds the state-of-the-art in real-time novel view synthesis by 0.78 dB on standard benchmarks and 1.78 dB on large scenes, renders frames three orders of magnitude faster than state-of-the-art radiance field models, and achieves real-time performance across a wide variety of commodity devices, including smartphones. We encourage readers to explore these models interactively at our project website: https://smerf-3d.github.io.
Gaussian Splatting has emerged as a prominent model for constructing 3D representations from images across diverse domains. However, the efficiency of the 3D Gaussian Splatting rendering pipeline relies on several simplifications. Notably, reducing Gaussian to 2D splats with a single viewspace depth introduces popping and blending artifacts during view rotation. Addressing this issue requires accurate per-pixel depth computation, yet a full per-pixel sort proves excessively costly compared to a global sort operation. In this paper, we present a novel hierarchical rasterization approach that systematically resorts and culls splats with minimal processing overhead. Our software rasterizer effectively eliminates popping artifacts and view inconsistencies, as demonstrated through both quantitative and qualitative measurements. Simultaneously, our method mitigates the potential for cheating view-dependent effects with popping, ensuring a more authentic representation. Despite the elimination of cheating, our approach achieves comparable quantitative results for test images, while increasing the consistency for novel view synthesis in motion. Due to its design, our hierarchical approach is only 4% slower on average than the original Gaussian Splatting. Notably, enforcing consistency enables a reduction in the number of Gaussians by approximately half with nearly identical quality and view-consistency. Consequently, rendering performance is nearly doubled, making our approach 1.6x faster than the original Gaussian Splatting, with a 50% reduction in memory requirements. Our renderer is publicly available at https://github.com/r4dl/StopThePop.
Foveated rendering takes advantage of the reduced spatial sensitivity in peripheral vision to greatly reduce rendering cost without noticeable spatial quality degradation. Due to its benefits, it has emerged as a key enabler for real-time high-quality virtual and augmented realities. Interestingly though, a large body of work advocates that a key role of peripheral vision may be motion detection, yet foveated rendering lowers the image quality in these regions, which may impact our ability to detect and quantify motion. The problem is critical for immersive simulations where the ability to detect and quantify movement drives actions and decisions. In this work, we diverge from the contemporary approach towards the goal of foveated graphics, and demonstrate that a loss of high-frequency spatial details in the periphery inhibits motion perception, leading to underestimating motion cues such as velocity. Furthermore, inspired by an interesting visual illusion, we design a perceptually motivated real-time technique that synthesizes controlled spatio-temporal motion energy to offset the loss in motion perception. Finally, we perform user experiments demonstrating our method's effectiveness in recovering motion cues without introducing objectionable quality degradation.
Virtual reality has ushered in a revolutionary era of immersive content perception. However, a persistent challenge in dynamic environments is the occurrence of cybersickness arising from a conflict between visual and vestibular cues. Prior techniques have demonstrated that limiting illusory self-motion, so-called vection, by blurring the peripheral part of images, introducing tunnel vision, or altering the camera path can effectively reduce the problem. Unfortunately, these methods often alter the user's experience with visible changes to the content. In this paper, we propose a new technique for reducing vection and combating cybersickness by subtly lowering the screen-space speed of objects in the user's peripheral vision. The method is motivated by our hypothesis that small modifications to the objects' velocity in the periphery and geometrical distortions in the peripheral vision can remain unnoticeable yet lead to reduced vection. This paper describes the experiments supporting this hypothesis and derives its limits. Furthermore, we present a method that exploits these findings by introducing subtle, screen-space geometrical distortions to animation frames to counteract the motion contributing to vection. We implement the method as a realtime post-processing step that can be integrated into existing rendering frameworks. The final validation of the technique and comparison to an alternative approach confirms its effectiveness in reducing cybersickness.
Display power consumption is an emerging concern for untethered devices. This goes double for augmented and virtual extended reality (XR) displays, which target high refresh rates and high resolutions while conforming to an ergonomically light form factor. A number of image mapping techniques have been proposed to extend battery usage. However, there is currently no comprehensive quantitative understanding of how the power savings provided by these methods compare to their impact on visual quality. We set out to answer this question.
To this end, we present a perceptual evaluation of algorithms (PEA) for power optimization in XR displays (PODs). Consolidating a portfolio of six power-saving display mapping approaches, we begin by performing a large-scale perceptual study to understand the impact of each method on perceived quality in the wild. This results in a unified quality score for each technique, scaled in just-objectionable-difference (JOD) units. In parallel, each technique is analyzed using hardware-accurate power models.
The resulting JOD-to-Milliwatt transfer function provides a first-of-its-kind look into tradeoffs offered by display mapping techniques, and can be directly employed to make architectural decisions for power budgets on XR displays. Finally, we leverage our study data and power models to address important display power applications like the choice of display primary, power implications of eye tracking, and more1.
Holographic near-eye displays are a promising technology to solve long-standing challenges in virtual and augmented reality display systems. Over the last few years, many different computer-generated holography (CGH) algorithms have been proposed that are supervised by different types of target content, such as 2.5D RGB-depth maps, 3D focal stacks, and 4D light fields. It is unclear, however, what the perceptual implications are of the choice of algorithm and target content type. In this work, we build a perceptual testbed of a full-color, high-quality holographic near-eye display. Under natural viewing conditions, we examine the effects of various CGH supervision formats and conduct user studies to assess their perceptual impacts on 3D realism. Our results indicate that CGH algorithms designed for specific viewpoints exhibit noticeable deficiencies in achieving 3D realism. In contrast, holograms incorporating parallax cues consistently outperform other formats across different viewing conditions, including the center of the eyebox. This finding is particularly interesting and suggests that the inclusion of parallax cues in CGH rendering plays a crucial role in enhancing the overall quality of the holographic experience. This work represents an initial stride towards delivering a perceptually realistic 3D experience with holographic near-eye displays.
Navigating topological transitions in cellular mechanical systems is a significant challenge for existing simulation methods. While abstract models lack predictive capabilities at the cellular level, explicit network representations struggle with topology changes, and per-cell representations are computationally too demanding for large-scale simulations. To address these challenges, we propose a novel cell-centered approach based on differentiable Voronoi diagrams. Representing each cell with a Voronoi site, our method defines shape and topology of the interface network implicitly. In this way, we substantially reduce the number of problem variables, eliminate the need for explicit contact handling, and ensure continuous geometry changes during topological transitions. Closed-form derivatives of network positions facilitate simulation with Newton-type methods for a wide range of per-cell energies. Finally, we extend our differentiable Voronoi diagrams to enable coupling with arbitrary rigid and deformable boundaries. We apply our approach to a diverse set of examples, highlighting splitting and merging of cells as well as neighborhood changes. We illustrate applications to inverse problems by matching soap foam simulations to real-world images. Comparative analysis with explicit cell models reveals that our method achieves qualitatively comparable results at significantly faster computation times.
Wildfires are a complex physical phenomenon that involves the combustion of a variety of flammable materials ranging from fallen leaves and dried twigs to decomposing organic material and living flora. All these materials can potentially act as fuel with different properties that determine the progress and severity of a wildfire. In this paper, we propose a novel approach for simulating the dynamic interaction between the varying components of a wildfire, including processes of convection, combustion and heat transfer between vegetation, soil and atmosphere. We propose a novel representation of vegetation that includes detailed branch geometry, fuel moisture, and distribution of grass, fine fuel, and duff. Furthermore, we model the ignition, generation, and transport of fire by firebrands and embers. This allows simulating and rendering virtual 3D wildfires that realistically capture key aspects of the process, such as progressions from ground to crown fires, the impact of embers carried by wind, and the effects of fire barriers and other human intervention methods. We evaluate our approach through numerous experiments and based on comparisons to real-world wildfire data.
Cyclones are large-scale phenomena that result from complex heat and water transfer processes in the atmosphere, as well as from the interaction of multiple hydrometeors, i.e., water and ice particles. When cyclones make landfall, they are considered natural disasters and spawn dread and awe alike. We propose a physically-based approach to describe the 3D development of cyclones in a visually convincing and physically plausible manner. Our approach allows us to capture large-scale heat and water continuity, turbulent microphysical dynamics of hydrometeors, and mesoscale cyclonic processes within the planetary boundary layer. Modeling these processes enables us to simulate multiple hurricane and tornado phenomena. We evaluate our simulations quantitatively by comparing to real data from storm soundings and observations of hurricane landfall from climatology research. Additionally, qualitative comparisons to previous methods are performed to validate the different parts of our scheme. In summary, our model simulates cyclogenesis in a comprehensive way that allows us to interactively render animations of some of the most complex weather events.
Apparel's significant role in human appearance underscores the importance of garment digitalization for digital human creation. Recent advances in 3D content creation are pivotal for digital human creation. Nonetheless, garment generation from text guidance is still nascent. We introduce a text-driven 3D garment generation framework, DressCode, which aims to democratize design for novices and offer immense potential in fashion design, virtual try-on, and digital human creation. We first introduce SewingGPT, a GPT-based architecture integrating cross-attention with text-conditioned embedding to generate sewing patterns with text guidance. We then tailor a pre-trained Stable Diffusion to generate tile-based Physically-based Rendering (PBR) textures for the garments. By leveraging a large language model, our framework generates CG-friendly garments through natural language interaction. It also facilitates pattern completion and texture editing, streamlining the design process through user-friendly interaction. This framework fosters innovation by allowing creators to freely experiment with designs and incorporate unique elements into their work. With comprehensive evaluations and comparisons with other state-of-the-art methods, our method showcases superior quality and alignment with input prompts. User studies further validate our high-quality rendering results, highlighting its practical utility and potential in production settings. Our project page is https://IHe-KaiI.github.io/DressCode/.
Simulating high-resolution cloth poses computational challenges in real-time applications. In the gaming industry, the proxy mesh technique offers an alternative, simulating a simplified low-resolution cloth geometry, proxy mesh. This proxy mesh's dynamics drive the detailed high-resolution geometry, visual mesh, through Linear Blended Skinning (LBS). However, generating a suitable proxy mesh with appropriate skinning weights from a given visual mesh is non-trivial, often requiring skilled artists several days for fine-tuning. This paper presents an automatic pipeline to convert an ill-conditioned highresolution visual mesh into a single-layer low-poly proxy mesh. Given that the input visual mesh may not be simulation-ready, our approach then simulates the proxy mesh based on specific use scenarios and optimizes the skinning weights, relying on differential skinning with several well-designed loss functions to ensure the skinned visual mesh appears plausible in the final simulation. We have tested our method on various challenging cloth models, demonstrating its robustness and effectiveness.
The rapid advancement of digital fashion and generative AI technology calls for an automated approach to transform digital sewing patterns into well-fitted garments on human avatars. When given a sewing pattern with its associated sewing relationships, the primary challenge is to establish an initial arrangement of sewing pieces that is free from folding and intersections. This setup enables a physics-based simulator to seamlessly stitch them into a digital garment, avoiding undesirable local minima. To achieve this, we harness AI classification, heuristics, and numerical optimization. This has led to the development of an innovative hybrid system that minimizes the need for user intervention in the initialization of garment pieces. The seeding process of our system involves the training of a classification network for selecting seed pieces, followed by solving an optimization problem to determine their positions and shapes. Subsequently, an iterative selection-arrangement procedure automates the selection of pattern pieces and employs a phased initialization approach to mitigate local minima associated with numerical optimization. Our experiments confirm the reliability, efficiency, and scalability of our system when handling intricate garments with multiple layers and numerous pieces. According to our findings, 68 percent of garments can be initialized with zero user intervention, while the remaining garments can be easily corrected through user operations.
The generation of global illumination in real time has been a long-standing challenge in the graphics community, particularly in dynamic scenes with complex illumination. Recent neural rendering techniques have shown great promise by utilizing neural networks to represent the illumination of scenes and then decoding the final radiance. However, incorporating object parameters into the representation may limit their effectiveness in handling fully dynamic scenes. This work presents a neural rendering approach, dubbed LightFormer, that can generate realistic global illumination for fully dynamic scenes, including dynamic lighting, materials, cameras, and animated objects, in real time. Inspired by classic many-lights methods, the proposed approach focuses on the neural representation of light sources in the scene rather than the entire scene, leading to the overall better generalizability. The neural prediction is achieved by leveraging the virtual point lights and shading clues for each light. Specifically, two stages are explored. In the light encoding stage, each light generates a set of virtual point lights in the scene, which are then encoded into an implicit neural light representation, along with screen-space shading clues like visibility. In the light gathering stage, a pixel-light attention mechanism composites all light representations for each shading point. Given the geometry and material representation, in tandem with the composed light representations of all lights, a lightweight neural network predicts the final radiance. Experimental results demonstrate that the proposed LightFormer can yield reasonable and realistic global illumination in fully dynamic scenes with real-time performance.
We propose a novel Particle Flow Map (PFM) method to enable accurate long-range advection for incompressible fluid simulation. The foundation of our method is the observation that a particle trajectory generated in a forward simulation naturally embodies a perfect flow map. Centered on this concept, we have developed an Eulerian-Lagrangian framework comprising four essential components: Lagrangian particles for a natural and precise representation of bidirectional flow maps; a dual-scale map representation to accommodate the mapping of various flow quantities; a particle-to-grid interpolation scheme for accurate quantity transfer from particles to grid nodes; and a hybrid impulse-based solver to enforce incompressibility on the grid. The efficacy of PFM has been demonstrated through various simulation scenarios, highlighting the evolution of complex vortical structures and the details of turbulent flows. Notably, compared to NFM, PFM reduces computing time by up to 49 times and memory consumption by up to 41%, while enhancing vorticity preservation as evidenced in various tests like leapfrog, vortex tube, and turbulent flow.
The method of fundamental solutions (MFS) and its associated boundary element method (BEM) have gained popularity in computer graphics due to the reduced dimensionality they offer: for three-dimensional linear problems, they only require variables on the domain boundary to solve and evaluate the solution throughout space, making them a valuable tool in a wide variety of applications. However, MFS and BEM have poor computational scalability and huge memory requirements for large-scale problems, limiting their applicability and efficiency in practice. By leveraging connections with Gaussian Processes and exploiting the sparse structure of the inverses of boundary integral matrices, we introduce a variational preconditioner that can be computed via a sparse inverse-Cholesky factorization in a massively parallel manner. We show that applying our preconditioner to the Preconditioned Conjugate Gradient algorithm greatly improves the efficiency of MFS or BEM solves, up to four orders of magnitude in our series of tests.
The behavior of a rigid body primarily depends on its mass moments, which consist of the mass, center of mass, and moments of inertia. It is possible to manipulate these quantities without altering the geometric appearance of an object by introducing cavities in its interior. Algorithms that find cavities of suitable shapes and sizes have enabled the computational design of spinning tops, yo-yos, wheels, buoys, and statically balanced objects. Previous work is based, for example, on topology optimization on voxel grids, which introduces a large number of optimization variables and box constraints, or offset surface computation, which cannot guarantee that solutions to a feasible problem will always be found.
In this work, we provide a mathematical analysis of constrained topology optimization problems that depend only on mass moments. This class of problems covers, among others, all applications mentioned above. Our main result is to show that no matter the outer shape of the rigid body to be optimized or the optimization objective and constraints considered, the optimal solution always features a quadric-shaped interface between material and cavities. This proves that optimal interfaces are always ellipsoids, hyperboloids, paraboloids, or one of a few degenerate cases, such as planes.
This insight lets us replace a difficult topology optimization problem with a provably equivalent non-linear equation system in a small number (<10) of variables, which represent the coefficients of the quadric. This system can be solved in a few seconds for most examples, provides insights into the geometric structure of many specific applications, and lets us describe their solution properties. Finally, our method integrates seamlessly into modern fabrication workflows because our solutions are analytical surfaces that are native to the CAD domain.
We present X-SLAM, a real-time dense differentiable SLAM system that leverages the complex-step finite difference (CSFD) method for efficient calculation of numerical derivatives, bypassing the need for a large-scale computational graph. The key to our approach is treating the SLAM process as a differentiable function, enabling the calculation of the derivatives of important SLAM parameters through Taylor series expansion within the complex domain. Our system allows for the real-time calculation of not just the gradient, but also higher-order differentiation. This facilitates the use of high-order optimizers to achieve better accuracy and faster convergence. Building on X-SLAM, we implemented end-to-end optimization frameworks for two important tasks: camera relocalization in wide outdoor scenes and active robotic scanning in complex indoor environments. Comprehensive evaluations on public benchmarks and intricate real scenes underscore the improvements in the accuracy of camera relocalization and the efficiency of robotic navigation achieved through our task-aware optimization. The code and data are available at https://gapszju.github.io/X-SLAM.
In mesh simplification, common requirements like accuracy, triangle quality, and feature alignment are often considered as a trade-off. Existing algorithms concentrate on just one or a few specific aspects of these requirements. For example, the well-known Quadric Error Metrics (QEM) approach [Garland and Heckbert 1997] prioritizes accuracy and can preserve strong feature lines/points as well, but falls short in ensuring high triangle quality and may degrade weak features that are not as distinctive as strong ones. In this paper, we propose a smooth functional that simultaneously considers all of these requirements. The functional comprises a normal anisotropy term and a Centroidal Voronoi Tessellation (CVT) [Du et al. 1999] energy term, with the variables being a set of movable points lying on the surface. The former inherits the spirit of QEM but operates in a continuous setting, while the latter encourages even point distribution, allowing various surface metrics. We further introduce a decaying weight to automatically balance the two terms. We selected 100 CAD models from the ABC dataset [Koch et al. 2019], along with 21 organic models, to compare the existing mesh simplification algorithms with ours. Experimental results reveal an important observation: the introduction of a decaying weight effectively reduces the conflict between the two terms and enables the alignment of weak features. This distinctive feature sets our approach apart from most existing mesh simplification methods and demonstrates significant potential in shape understanding. Please refer to the teaser figure for illustration.
In architecture, shapes of surfaces that can withstand gravity with no bending action are considered ideal for shell structures. Those shells have special geometries through which they can stream gravitational force toward the ground via stresses strictly tangent to the surface, making them highly efficient. The process of finding these special forms is called form-finding. Recently, [Miki and Mitchell 2022] presented a method to reliably produce mixed tension-compression continuum shells, a type of shells known to be especially difficult to form-find. The key to this method was to use the concept of the Airy stress function to derive a valid bending-free shell shape by iterating on both the shell shape and the Airy stress function; this turns a problem that is over-constrained in general into a problem with many solutions.
In [Miki and Mitchell 2022], it was proposed that the method could also be used to design grid shells by tracing curves on a continuum shell such that the resulting grid has bars that are both bending-free and form flat panels, a property useful for construction of real grid shells made of glass and steel. However, this special type of grid is not guaranteed to exist in general on a mixed-tension compression shell, even when the shell is in bending-free equilibrium [Miki and Mitchell 2023]. Additional conditions must be imposed on the shell shape to guarantee the existence of simultaneously bending-free and conjugate grid directions. The current study resolves the existence issue by adding alignment conditions.
We consider several practical curve alignment conditions: alignment with the lines of curvature of the shell, approximate alignment with a bidirectional set of user-prescribed guide curves, and exact alignment with a single direction of user-prescribed guide curves. We report that the variable projection method originally used to solve the form-finding problem in the work of [Miki and Mitchell 2022] can be successfully extended to solve the newly introduced alignment conditions, and conclude with results for several practical design examples. To our knowledge, this is the first method that can take a user-input grid and find a "nearby" grid that is both flat-panelled and in bending-free equilibrium for the general case of mixed tension-compression grid shells.
We present a method for generating a simplicial (e.g., triangular or tetrahedral) grid to enable adaptive discretization of implicit shapes defined by a vector function. Such shapes, which we call implicit complexes, are generalizations of implicit surfaces and useful for representing non-smooth and non-manifold structures. While adaptive grid generation has been extensively studied for polygonizing implicit surfaces, few methods are designed for implicit complexes. Our method can generate adaptive grids for several implicit complexes, including arrangements of implicit surfaces, CSG shapes, material interfaces, and curve networks. Importantly, our method adapts the grid to the geometry of not only the implicit surfaces but also their lower-dimensional intersections. We demonstrate how our method enables efficient and detail-preserving discretization of non-trivial implicit shapes.
In this paper, we introduce a new method for the task of interaction transfer. Given an example interaction between a source object and an agent, our method can automatically infer both surface and spatial relationships for the agent and target objects within the same category, yielding more accurate and valid transfers. Specifically, our method characterizes the example interaction using a combined spatial and surface representation. We correspond the agent points and object points related to the representation to the target object space using a learned spatial and surface correspondence field, which represents objects as deformed and rotated signed distance fields. With the corresponded points, an optimization is performed under the constraints of our spatial and surface interaction representation and additional regularization. Experiments conducted on human-chair and hand-mug interaction transfer tasks show that our approach can handle larger geometry and topology variations between source and target shapes, significantly outperforming state-of-the-art methods.
BNRist, Department of Computer Science and Technology, Tsinghua University, China
In the field of digital content creation, generating high-quality 3D characters from single images is challenging, especially given the complexities of various body poses and the issues of self-occlusion and pose ambiguity. In this paper, we present CharacterGen, a framework developed to efficiently generate 3D characters. CharacterGen introduces a streamlined generation pipeline along with an image-conditioned multi-view diffusion model. This model effectively calibrates input poses to a canonical form while retaining key attributes of the input image, thereby addressing the challenges posed by diverse poses. A transformer-based, generalizable sparse-view reconstruction model is the other core component of our approach, facilitating the creation of detailed 3D models from multi-view images. We also adopt a texture-back-projection strategy to produce high-quality texture maps. Additionally, we have curated a dataset of anime characters, rendered in multiple poses and views, to train and evaluate our model. Our approach has been thoroughly evaluated through quantitative and qualitative experiments, showing its proficiency in generating 3D characters with high-quality shapes and textures, ready for downstream applications such as rigging and animation.
We introduce a novel neural network-based computational pipeline as a representation-agnostic slicer for multi-axis 3D printing. This advanced slicer can work on models with diverse representations and intricate topology. The approach involves employing neural networks to establish a deformation mapping, defining a scalar field in the space surrounding an input model. Isosurfaces are subsequently extracted from this field to generate curved layers for 3D printing. Creating a differentiable pipeline enables us to optimize the mapping through loss functions directly defined on the field gradients as the local printing directions. New loss functions have been introduced to meet the manufacturing objectives of support-free and strength reinforcement. Our new computation pipeline relies less on the initial values of the field and can generate slicing results with significantly improved performance.
We present a computational approach for modeling the mechanical behavior of flexible scaled sheet materials---3D-printed hard scales embedded in a soft substrate. Balancing strength and flexibility, these structured materials find applications in protective gear, soft robotics, and 3D-printed fashion. To unlock their full potential, however, we must unravel the complex relation between scale pattern and mechanical properties. To address this problem, we propose a contact-aware homogenization approach that distills native-level simulation data into a novel macromechanical model. This macro-model combines piecewise-quadratic uniaxial fits with polar interpolation using circular harmonics, allowing for efficient simulation of large-scale patterns. We apply our approach to explore the space of isohedral scale patterns, revealing a diverse range of anisotropic and nonlinear material behaviors. Through an extensive set of experiments, we show that our models reproduce various scale-level effects while offering good qualitative agreement with physical prototypes on the macro-level.
Surface-based inflatables are composed of two thin layers of nearly inextensible sheet material joined together along carefully selected fusing curves. During inflation, pressure forces separate the two sheets to maximize the enclosed volume. The fusing curves restrict this expansion, leading to a spatially varying in-plane contraction and hence metric frustration. The inflated structure settles into a 3D equilibrium that balances pressure forces with the internal elastic forces of the sheets.
We present a computational framework for analyzing and designing surface-based inflatable structures with arbitrary fusing patterns. Our approach employs numerical homogenization to characterize the behavior of parametric families of periodic inflatable patch geometries, which can then be combined to tessellate the sheet with smoothly varying patterns. We propose a novel parametrization of the underlying deformation space that allows accurate, efficient, and systematical analysis of the stretching and bending behavior of inflated patches with potentially open boundaries.
We apply our homogenization algorithm to create a database of geometrically diverse fusing patterns spanning a wide range of material properties and deformation characteristics. This database is employed in an inverse design algorithm that solves for fusing curves to best approximate a given input target surface. Local patches are selected and blended to form a global network of curves based on a geometric flattening algorithm. These fusing curves are then further optimized to minimize the distance of the deployed structure to target surface. We show that this approach offers greater flexibility to approximate given target geometries compared to previous work while significantly improving structural performance.
We introduce solid knitting, a new fabrication technique that combines the layer-by-layer volumetric approach of 3D printing with the topologically-entwined stitch structure of knitting to produce solid 3D objects. We define the basic building blocks of solid knitting and demonstrate a working prototype of a solid knitting machine controlled by a low-level instruction language, along with a volumetric design tool for creating machine-knittable patterns. Solid knitting uses a course-wale-layer structure, where every loop in a solid-knit object passes through both a loop from the previous layer and a loop from the previous course. Our machine uses two beds of latch needles to create stitches like a conventional V-bed knitting machine, but augments these needles with a pair of rotating hook arrays to provide storage locations for all of the loops in one layer of the object. It can autonomously produce solid-knit prisms of arbitrary length, although it requires manual intervention to cast on the first layer and bind off the final row. Our design tool allows users to create solid knitting patterns by connecting elementary stitches; objects designed in our interface can---after basic topological checks and constraint propagation---be exported as a sequence of instructions for fabrication on the solid knitting machine. We validate our solid knitting hardware and software on prism examples, detail the mechanical errors which we have encountered, and discuss potential extensions to the capability of our solid knitting machine.
We present a novel method for realizing freeform surfaces with pieces of flat fabric, where curvature is created by stitching together points on the fabric using a technique known as smocking. Smocking is renowned for producing intricate geometric textures with voluminous pleats. However, it has been mostly used to realize flat shapes or manually designed, limited classes of curved surfaces. Our method combines the computation of directional fields with continuous optimization of a Tangram graph in the plane, which together allow us to realize surfaces of arbitrary topology and curvature with smocking patterns of diverse symmetries. Given a target surface and the desired smocking pattern, our method outputs a corresponding 2D smocking pattern that can be fabricated by sewing specified points together. The resulting textile fabrication approximates the target shape and exhibits visually pleasing pleats. We validate our method through physical fabrication of various smocked examples.
In this paper, we touch on the problem of markerless multi-modal human motion capture especially for string performance capture which involves inherently subtle hand-string contacts and intricate movements. To fulfill this goal, we first collect a dataset, named String Performance Dataset (SPD), featuring cello and violin performances. The dataset includes videos captured from up to 23 different views, audio signals, and detailed 3D motion annotations of the body, hands, instrument, and bow. Moreover, to acquire the detailed motion annotations, we propose an audio-guided multi-modal motion capture framework that explicitly incorporates hand-string contacts detected from the audio signals for solving detailed hand poses. This framework serves as a baseline for string performance capture in a completely markerless manner without imposing any external devices on performers, eliminating the potential of introducing distortion in such delicate movements. We argue that the movements of performers, particularly the sound-producing gestures, contain subtle information often elusive to visual methods but can be inferred and retrieved from audio cues. Consequently, we refine the vision-based motion capture results through our innovative audio-guided approach, simultaneously clarifying the contact relationship between the performer and the instrument, as deduced from the audio. We validate the proposed framework and conduct ablation studies to demonstrate its efficacy. Our results outperform current state-of-the-art vision-based algorithms, underscoring the feasibility of augmenting visual motion capture with audio modality. To the best of our knowledge, SPD is the first dataset for musical instrument performance, covering fine-grained hand motion details in a multi-modal, large-scale collection. It holds significant implications and guidance for string instrument pedagogy, animation, and virtual concerts, as well as for both musical performance analysis and generation. Our code and SPD dataset are available at https://github.com/Yitongishere/string_performance.
Computing intrinsic distances on discrete surfaces is at the heart of many minimization problems in geometry processing and beyond. Solving these problems is extremely challenging as it demands the computation of on-surface distances along with their derivatives. We present a novel approach for intrinsic minimization of distance-based objectives defined on triangle meshes. Using a variational formulation of shortest-path geodesics, we compute first and second-order distance derivatives based on the implicit function theorem, thus opening the door to efficient Newton-type minimization solvers. We demonstrate our differentiable geodesic distance framework on a wide range of examples, including geodesic networks and membranes on surfaces of arbitrary genus, two-way coupling between hosting surface and embedded system, differentiable geodesic Voronoi diagrams, and efficient computation of Karcher means on complex shapes. Our analysis shows that second-order descent methods based on our differentiable geodesics outperform existing first-order and quasi-Newton methods by large margins.
We introduce a method for approximating the signed distance function (SDF) of geometry corrupted by holes, noise, or self-intersections. The method implicitly defines a completed version of the shape, rather than explicitly repairing the given input. Our starting point is a modified version of the heat method for geodesic distance, which diffuses normal vectors rather than a scalar distribution. This formulation provides robustness akin to generalized winding numbers (GWN), but provides distance function rather than just an inside/outside classification. Our formulation also offers several features not common to classic distance algorithms, such as the ability to simultaneously fit multiple level sets, a notion of distance for geometry that does not topologically bound any region, and the ability to mix and match signed and unsigned distance. The method can be applied in any dimension and to any spatial discretization, including triangle meshes, tet meshes, point clouds, polygonal meshes, voxelized surfaces, and regular grids. We evaluate the method on several challenging examples, implementing normal offsets and other morphological operations directly on imperfect curve and surface data. In many cases we also obtain an inside/outside classification dramatically more robust than the one obtained provided by GWN.
Faithful real-time facial animation is essential for avatar-mediated telepresence in Virtual Reality (VR). To emulate authentic communication, avatar animation needs to be efficient and accurate: able to capture both extreme and subtle expressions within a few milliseconds to sustain the rhythm of natural conversations. The oblique and incomplete views of the face, variability in the donning of headsets, and illumination variation due to the environment are some of the unique challenges in generalization to unseen faces. In this paper, we present a method that can animate a photorealistic avatar in realtime from head-mounted cameras (HMCs) on a consumer VR headset. We present a self-supervised learning approach, based on a cross-view reconstruction objective, that enables generalization to unseen users. We present a lightweight expression calibration mechanism that increases accuracy with minimal additional cost to run-time efficiency. We present an improved parameterization for precise ground-truth generation that provides robustness to environmental variation. The resulting system produces accurate facial animation for unseen users wearing VR headsets in realtime. We compare our approach to prior face-encoding methods demonstrating significant improvements in both quantitative metrics and qualitative results.
Physically-based simulation is a powerful approach for 3D facial animation as the resulting deformations are governed by physical constraints, allowing to easily resolve self-collisions, respond to external forces and perform realistic anatomy edits. Today's methods are data-driven, where the actuations for finite elements are inferred from captured skin geometry. Unfortunately, these approaches have not been widely adopted due to the complexity of initializing the material space and learning the deformation model for each character separately, which often requires a skilled artist followed by lengthy network training. In this work, we aim to make physics-based facial animation more accessible by proposing a generalized physical face model that we learn from a large 3D face dataset. Once trained, our model can be quickly fit to any unseen identity and produce a ready-to-animate physical face model automatically. Fitting is as easy as providing a single 3D face scan, or even a single face image. After fitting, we offer intuitive animation controls, as well as the ability to retarget animations across characters. All the while, the resulting animations allow for physical effects like collision avoidance, gravity, paralysis, bone reshaping and more.
Strand-based hair simulations have recently become increasingly popular for a range of real-time applications. However, accurately simulating the full number of hair strands remains challenging. A commonly employed technique involves simulating a subset of guide hairs to capture the overall behavior of the hairstyle. Details are then enriched by interpolation using linear skinning. Hair interpolation enables fast real-time simulations but frequently leads to various artifacts during runtime. As the skinning weights are often pre-computed, substantial variations between the initial and deformed shapes of the hair can cause severe deviations in fine hair geometry. Straight hairs may become kinked, and curly hairs may become zigzags.
This work introduces a novel physical-driven hair interpolation scheme that utilizes existing simulated guide hair data. Instead of directly operating on positions, we interpolate the internal forces from the guide hairs before efficiently reconstructing the rendered hairs based on their material model. We formulate our problem as a constraint satisfaction problem for which we present an efficient solution. Further practical considerations are addressed using regularization terms that regulate penetration avoidance and drift correction. We have tested various hairstyles to illustrate that our approach can generate visually plausible rendered hairs with only a few guide hairs and minimal computational overhead, amounting to only about 20% of conventional linear hair interpolation. This efficiency underscores the practical viability of our method for real-time applications.
We propose a generalization of the rendering equation that captures both the realistic light transport of physically-based rendering (PBR) and a subset of non-photorealistic rendering (NPR) stylizations in a principled manner. The proposed formulation is based on the key observation that both classical transport and certain NPR stylizations can be modeled as a function of expectation. Given this observation, we generalize the recursive integrals of the rendering equation to recursive functions of expectation. As estimating functions of expectation can be challenging, especially recursive ones, we provide a toolkit for unbiased and biased estimation comprising prior work, general strategies, and a novel build-your-own strategy for constructing more complex unbiased estimators from simpler unbiased estimators. We then use this toolkit to construct a complete estimator for the proposed recursive formulation, and implement a sampling algorithm that is both conceptually simple and leverages many of the components of an ordinary path tracer. To demonstrate the practicality of the proposed method we showcase how it captures several existing stylizations like color mapping, cel shading, and cross-hatching, fuses NPR and PBR visuals, and allows us to explore visuals that were previously challenging under existing formulations.
Robust light transport algorithms, particularly bidirectional path tracing (BDPT), face significant challenges when dealing with specular or highly glossy involved paths. BDPT constructs the full path by connecting sub-paths traced individually from the light source and camera. However, it remains difficult to sample by connecting vertices on specular and glossy surfaces with narrow-lobed BSDF, as it poses severe constraints on sampling in the feasible direction. To address this issue, we propose a novel approach, called proxy sampling, that enables efficient sub-path connection of these challenging paths. When a low-contribution specular/glossy connection occurs, we drop out the problematic neighboring vertex next to this specular/glossy vertex from the original path, then retrace an alternative sub-path as a proxy to complement this incomplete path. This newly constructed complete path ensures that the connection adheres to the constraint of the narrow lobe within the BSDF of the specular/glossy surface. Unbiased reciprocal estimation is the key to our method to obtain a probability density function (PDF) reciprocal to ensure unbiased rendering. We derive the reciprocal estimation method and provide an efficiency-optimized setting for efficient sampling and connection. Our method provides a robust tool for substituting problematic paths with favorable alternatives while ensuring unbiasedness. We validate this approach in the probabilistic connections BDPT for addressing specular-involved difficult paths. Experimental results have proved the effectiveness and efficiency of our approach, showcasing high-performance rendering capabilities across diverse settings.
Recent advancements in spatiotemporal reservoir resampling (ReSTIR) leverage sample reuse from neighbors to efficiently evaluate the path integral. Like rasterization, ReSTIR methods implicitly assume a pinhole camera and evaluate the light arriving at a pixel through a single predetermined subpixel location at a time (e.g., the pixel center). This prevents efficient path reuse in and near pixels with high-frequency details.
We introduce Area ReSTIR, extending ReSTIR reservoirs to also integrate each pixel's 4D ray space, including 2D areas on the film and lens. We design novel subpixel-tracking temporal reuse and shift mappings that maximize resampling quality in such regions. This robustifies ReSTIR against high-frequency content, letting us importance sample subpixel and lens coordinates and efficiently render antialiasing and depth of field.
Sphere tracing is a fast and high-quality method for visualizing surfaces encoded by signed distance functions (SDFs). We introduce a similar method for a completely different class of surfaces encoded by harmonic functions, opening up rich new possibilities for visual computing. Our starting point is similar in spirit to sphere tracing: using conservative Harnack bounds on the growth of harmonic functions, we develop a Harnack tracing algorithm for visualizing level sets of harmonic functions, including those that are angle-valued and exhibit singularities. The method takes much larger steps than naïve ray marching, avoids numerical issues common to generic root finding methods and, like sphere tracing, needs only perform pointwise evaluation of the function at each step. For many use cases, the method is fast enough to run real time in a shader program. We use it to visualize smooth surfaces directly from point clouds (via Poisson surface reconstruction) or polygon soup (via generalized winding numbers) without linear solves or mesh extraction. We also use it to visualize nonplanar polygons (possibly with holes), surfaces from architectural geometry, mesh "exoskeletons", and key mathematical objects including knots, links, spherical harmonics, and Riemann surfaces. Finally we show that, at least in theory, Harnack tracing provides an alternative mechanism for visualizing arbitrary implicit surfaces.
We present an approach enabling large-scale pretrained latent diffusion models to generate transparent images. The method allows generation of single transparent images or of multiple transparent layers. The method learns a "latent transparency" that encodes alpha channel transparency into the latent manifold of a pretrained latent diffusion model. It preserves the production-ready quality of the large diffusion model by regulating the added transparency as a latent offset with minimal changes to the original latent distribution of the pretrained model. In this way, any latent diffusion model can be converted into a transparent image generator by finetuning it with the adjusted latent space. We train the model with 1M transparent image layer pairs collected using a human-in-the-loop collection scheme. We show that latent transparency can be applied to different open source image generators, or be adapted to various conditional control systems to achieve applications like foreground/background-conditioned layer generation, joint layer generation, structural control of layer contents, etc. A user study finds that in most cases (97%) users prefer our natively generated transparent content over previous ad-hoc solutions such as generating and then matting. Users also report the quality of our generated transparent images is comparable to real commercial transparent assets like Adobe Stock.
The ability to represent artworks as stacks of layers is fundamental to modern graphics design, as it allows artists to easily separate visual elements, edit them in isolation, and blend them to achieve rich visual effects. Despite their ubiquity in 2D painting software, layers have not yet made their way to VR painting, where users paint strokes directly in 3D space by gesturing a 6-degrees-of-freedom controller. But while the concept of a stack of 2D layers was inspired by real-world layers in cell animation, what should 3D layers be? We propose to define 3D-Layers as groups of 3D strokes, and we distinguish the ones that represent 3D geometry from the ones that represent color modifications of the geometry. We call the former substrate layers and the latter appearance layers. Strokes in appearance layers modify the color of the substrate strokes they intersect. Thanks to this distinction, artists can define sequences of color modifications as stacks of appearance layers, and edit each layer independently to finely control the final color of the substrate. We have integrated 3D-Layers into a VR painting application and we evaluate its flexibility and expressiveness by conducting a usability study with experienced VR artists.
Natural gestures are crucial for mid-air interaction, but predicting and managing muscle fatigue is challenging. Existing torque-based models are limited in their ability to model above-shoulder interactions and to account for fatigue recovery. We introduce a new hybrid model, NICER, which combines a torque-based approach with a new term derived from the empirical measurement of muscle contraction and a recovery factor to account for decreasing fatigue during rest. We evaluated NICER in a mid-air selection task using two interaction methods with different degrees of perceived fatigue. Results show that NICER can accurately model above-shoulder interactions as well as reflect fatigue recovery during rest periods. Moreover, both interaction methods show a stronger correlation with subjective fatigue measurement (ρ = 0.978/0.976) than a previous model, Cumulative Fatigue (ρ = 0.966/0.923), confirming that NICER is a powerful analytical tool to predict fatigue across a variety of gesture-based interactive applications.
Mutual-capacitive sensing is the most common technology for detecting multi-touch, especially on flat and simple curvature surfaces. Its extension to a more complex shape is still challenging, as a uniform distribution of sensing electrodes is required for consistent touch sensitivity across the surface. To overcome this problem, we propose a method to adapt the sensor layout of common capacitive multi-touch sensors to more complex 3D surfaces, ensuring high-resolution, robust multi-touch detection. The method automatically computes a grid of transmitter and receiver electrodes with as regular distribution as possible over a general 3D shape. It starts with the computation of a proxy geometry by quad meshing used to place the electrodes through the dual-edge graph. It then arranges electrodes on the surface to minimize the number of touch controllers required for capacitive sensing and the number of input/output pins to connect the electrodes with the controllers. We reach these objectives using a new simplification and clustering algorithm for a regular quad-patch layout. The reduced patch layout is used to optimize the routing of all the structures (surface grooves and internal pipes) needed to host all electrodes on the surface and inside the object's volume, considering the geometric constraints of the 3D shape. Finally, we print the 3D object prototype ready to be equipped with the electrodes. We analyze the performance of the proposed quad layout simplification and clustering algorithm using different quad meshing and characterize the signal quality and accuracy of the capacitive touch sensor for different non-planar geometries. The tested prototypes show precise and robust multi-touch detection with good Signal-to-Noise Ratio and spatial accuracy of about 1mm.
We propose Progressive Dynamics, a coarse-to-fine, level-of-detail simulation method for the physics-based animation of complex frictionally contacting thin shell and cloth dynamics. Progressive Dynamics provides tight-matching consistency and progressive improvement across levels, with comparable quality and realism to high-fidelity, IPC-based shell simulations [Li et al. 2021] at finest resolutions. Together these features enable an efficient animation-design pipeline with predictive coarse-resolution previews providing rapid design iterations for a final, to-be-generated, high-resolution animation. In contrast, previously, to design such scenes with comparable dynamics would require prohibitively slow design iterations via repeated direct simulations on high-resolution meshes. We evaluate and demonstrate Progressive Dynamics's features over a wide range of challenging stress-tests, benchmarks, and animation design tasks. Here Progressive Dynamics efficiently computes consistent previews at costs comparable to coarsest-level direct simulations. Its matching progressive refinements across levels then generate rich, high-resolution animations with high-speed dynamics, impacts, and the complex detailing of the dynamic wrinkling, folding, and sliding of frictionally contacting thin shells and fabrics.
Creating super-resolution cloth animations, which refine coarse cloth meshes with fine wrinkle details, faces challenges in preserving spatial consistency and temporal coherence across frames. In this paper, we introduce a general framework to address these issues, leveraging two core modules. The first module interleaves a simulator and a corrector. The simulator handles cloth dynamics, while the corrector rectifies differences in low-frequency features across various resolutions. This interleaving ensures prompt correction of spatial errors from the coarse simulation, effectively preventing their temporal propagation. The second module performs mesh-based super-resolution for detailed wrinkle enhancements. We decompose garment meshes into overlapping patches for adaptability to various styles and geometric continuity. Our method achieves an 8× improvement in resolution for cloth animations. We showcase the effectiveness of our method through diverse animation examples, including simple cloth pieces and intricate garments.
Perceiving 3D structures from RGB images based on CAD model primitives can enable an effective, efficient 3D object-based representation of scenes. However, current approaches rely on supervision from expensive yet imperfect annotations of CAD models associated with real images, and encounter challenges due to the inherent ambiguities in the task - both in depth-scale ambiguity in monocular perception, as well as inexact matches of CAD database models to real observations. We thus propose DiffCAD, the first weakly-supervised probabilistic approach to CAD retrieval and alignment from an RGB image. We learn a probabilistic model through diffusion, modeling likely distributions of shape, pose, and scale of CAD objects in an image. This enables multi-hypothesis generation of different plausible CAD reconstructions, requiring only a few hypotheses to characterize ambiguities in depth/scale and inexact shape matches. Our approach is trained only on synthetic data, leveraging monocular depth and mask estimates to enable robust zero-shot adaptation to various real target domains. Despite being trained solely on synthetic data, our multi-hypothesis approach can even surpass the supervised state-of-the-art on the Scan2CAD dataset by 5.9% with 8 hypotheses.
While free-hand sketching has long served as an efficient representation to convey characteristics of an object, they are often subjective, deviating significantly from realistic representations. Moreover, sketches are not consistent for arbitrary viewpoints, making it hard to catch 3D shapes. We propose 3Dooole, generating descriptive and view-consistent sketch images given multi-view images of the target object. Our method is based on the idea that a set of 3D strokes can efficiently represent 3D structural information and render view-consistent 2D sketches. We express 2D sketches as a union of view-independent and view-dependent components. 3D cubic Bézier curves indicate view-independent 3D feature lines, while contours of superquadrics express a smooth outline of the volume of varying viewpoints. Our pipeline directly optimizes the parameters of 3D stroke primitives to minimize perceptual losses in a fully differentiable manner. The resulting sparse set of 3D strokes can be rendered as abstract sketches containing essential 3D characteristic shapes of various objects. We demonstrate that 3Doodle can faithfully express concepts of the original images compared with recent sketch generation approaches.1
We introduce a novel method for acquiring boundary representations (B-Reps) of 3D CAD models which involves a two-step process: it first applies a spatial partitioning, referred to as the "split", followed by a "fit" operation to derive a single primitive within each partition. Specifically, our partitioning aims to produce the classical Voronoi diagram of the set of ground-truth (GT) B-Rep primitives. In contrast to prior B-Rep constructions which were bottom-up, either via direct primitive fitting or point clustering, our Split-and-Fit approach is top-down and structure-aware, since a Voronoi partition explicitly reveals both the number of and the connections between the primitives. We design a neural network to predict the Voronoi diagram from an input point cloud or distance field via a binary classification. We show that our network, coined NVD-Net for neural Voronoi diagrams, can effectively learn Voronoi partitions for CAD models from training data and exhibits superior generalization capabilities. Extensive experiments and evaluation demonstrate that the resulting B-Reps, consisting of parametric surfaces, curves, and vertices, are more plausible than those obtained by existing alternatives, with significant improvements in reconstruction quality. Code will be released on https://github.com/yilinliu77/NVDNet.
Across many scientific disciplines, the pursuit of even higher grid resolutions leads to a severe scalability problem in scientific computing. Feature extraction is a commonly chosen approach to reduce the amount of information from dense fields down to geometric primitives that further enable a quantitative analysis. Examples of common features are isolines, extremal lines, or vortex corelines. Due to the rising complexity of the observed phenomena, or in the event of discretization issues with the data, a straightforward application of textbook feature definitions is unfortunately insufficient. Thus, feature extraction from spatial data often requires substantial pre- or post-processing to either clean up the results or to include additional domain knowledge about the feature in question. Such a separate pre- or post-processing of features not only leads to suboptimal and incomparable solutions, it also results in many specialized feature extraction algorithms arising in the different application domains. In this paper, we establish a mathematical language that not only encompasses commonly used feature definitions, it also provides a set of regularizers that can be applied across the bounds of individual application domains. By using the language of variational calculus, we treat features as variational minimizers, which can be combined and regularized as needed. Our formulation not only encompasses existing feature definitions as special case, it also opens the path to novel feature definitions. This work lays the foundations for many new research directions regarding formal definitions, data representations, and numerical extraction algorithms.
In the field of trajectory generation for objects, ensuring continuous collision-free motion remains a huge challenge, especially for non-convex geometries and complex environments. Previous methods either oversimplify object shapes, which results in a sacrifice of feasible space or rely on discrete sampling, which suffers from the "tunnel effect". To address these limitations, we propose a novel hierarchical trajectory generation pipeline, which utilizes the Swept Volume Signed Distance Field (SVSDF) to guide trajectory optimization for Continuous Collision Avoidance (CCA). Our interdisciplinary approach, blending techniques from graphics and robotics, exhibits outstanding effectiveness in solving this problem. We formulate the computation of the SVSDF as a Generalized Semi-Infinite Programming model, and we solve for the numerical solutions at query points implicitly, thereby eliminating the need for explicit reconstruction of the surface. Our algorithm has been validated in a variety of complex scenarios and applies to robots of various dynamics, including both rigid and deformable shapes. It demonstrates exceptional universality and superior CCA performance compared to typical algorithms. The code will be released at https://github.com/ZJU-FAST-Lab/Implicit-SVSDF-Planner for the benefit of the community.
We introduce an improved version of the micrograin BSDF model [Lucas et al. 2023] for the rendering of anisotropic porous layers. Our approach leverages the properties of micrograins to take into account the correlation between their height and normal, as well as the correlation between the light and view directions. This allows us to derive an exact analytical expression for the Geometrical Attenuation Factor (GAF), summarizing shadowing and masking inside the porous layer. This fully-correlated GAF is then used to define appropriate mixing weights to blend the BSDFs of the porous and base layers. Furthermore, by generalizing the micrograins shape to anisotropy, combined with their fully-correlated GAF, our improved BSDF model produces effects specific to porous layers such as retro-reflection visible on dust layers at grazing angles or height and color correlation that can be found on rusty materials. Finally, we demonstrate very close matches between our BSDF model and light transport simulations realized with explicit instances of micrograins, thus validating our model.
Stochastic geometry models have enjoyed immense success in graphics for modeling interactions of light with complex phenomena such as participating media, rough surfaces, fibers, and more. Although each of these models operates on the same principle of replacing intricate geometry by a random process and deriving the average light transport across all instances thereof, they are each tailored to one specific application and are fundamentally distinct. Each type of stochastic geometry present in the scene is firmly encapsulated in its own appearance model, with its own statistics and light transport average, and no cross-talk between different models or deterministic and stochastic geometry is possible.
In this paper, we derive a theory of light transport on stochastic implicit surfaces, a geometry model capable of expressing deterministic geometry, microfacet surfaces, participating media, and an exciting new continuum in between containing aggregate appearance, non-classical media, and more. Our model naturally supports spatial correlations, missing from most existing stochastic models.
Our theory paves the way for tractable rendering of scenes in which all geometry is described by the same stochastic model, while leaving ample future work for developing efficient sampling and rendering algorithms.
Free-space diffractions are an optical phenomenon where light appears to "bend" around the geometric edges and corners of scene objects. In this paper we present an efficient method to simulate such effects. We derive an edge-based formulation of Fraunhofer diffraction, which is well suited to the common (triangular) geometric meshes used in computer graphics. Our method dynamically constructs a free-space diffraction BSDF by considering the geometry around the intersection point of a ray of light with an object, and we present an importance sampling strategy for these BSDFs. Our method is unique in requiring only ray tracing to produce free-space diffractions, works with general meshes, requires no geometry preprocessing, and is designed to work with path tracers with a linear rendering equation. We show that we are able to reproduce accurate diffraction lobes, and, in contrast to any existing method, are able to handle complex, real-world geometry. This work serves to connect free-space diffractions to the efficient path tracing tools from computer graphics.
Procedural noise is a fundamental component of computer graphics pipelines, offering a flexible way to generate textures that exhibit "natural" random variation. Many different types of noise exist, each produced by a separate algorithm. In this paper, we present a single generative model which can learn to generate multiple types of noise as well as blend between them. In addition, it is capable of producing spatially-varying noise blends despite not having access to such data for training. These features are enabled by training a denoising diffusion model using a novel combination of data augmentation and network conditioning techniques. Like procedural noise generators, the model's behavior is controllable via interpretable parameters plus a source of randomness. We use our model to produce a variety of visually compelling noise textures. We also present an application of our model to improving inverse procedural material design; using our model in place of fixed-type noise nodes in a procedural material graph results in higher-fidelity material reconstructions without needing to know the type of noise in advance. Open-sourced materials can be found at https://armanmaesumi.github.io/onenoise/
Position based dynamics [Müller et al. 2007] is a powerful technique for simulating a variety of materials. Its primary strength is its robustness when run with limited computational budget. Even though PBD is based on the projection of static constraints, it does not work well for quasistatic problems. This is particularly relevant since the efficient creation of large data sets of plausible, but not necessarily accurate elastic equilibria is of increasing importance with the emergence of quasistatic neural networks [Bailey et al. 2018; Chentanez et al. 2020; Jin et al. 2022; Luo et al. 2020]. Recent work [Macklin et al. 2016] has shown that PBD can be related to the Gauss-Seidel approximation of a Lagrange multiplier formulation of backward Euler time stepping, where each constraint is solved/projected independently of the others in an iterative fashion. We show that a position-based, rather than constraint-based nonlinear Gauss-Seidel approach resolves a number of issues with PBD, particularly in the quasistatic setting. Our approach retains the essential PBD feature of stable behavior with constrained computational budgets, but also allows for convergent behavior with expanded budgets. We demonstrate the efficacy of our method on a variety of representative hyperelastic problems and show that both successive over relaxation (SOR), Chebyshev and multiresolution-based acceleration can be easily applied.
We introduce vertex block descent, a block coordinate descent solution for the variational form of implicit Euler through vertex-level Gauss-Seidel iterations. It operates with local vertex position updates that achieve reductions in global variational energy with maximized parallelism. This forms a physics solver that can achieve numerical convergence with unconditional stability and exceptional computation performance. It can also fit in a given computation budget by simply limiting the iteration count while maintaining its stability and superior convergence rate.
We present and evaluate our method in the context of elastic body dynamics, providing details of all essential components and showing that it outperforms alternative techniques. In addition, we discuss and show examples of how our method can be used for other simulation systems, including particle-based simulations and rigid bodies.
The proliferation of 3D representations, from explicit meshes to implicit neural fields and more, motivates the need for simulators agnostic to representation. We present a data-, mesh-, and grid-free solution for elastic simulation for any object in any geometric representation undergoing large, nonlinear deformations. We note that every standard geometric representation can be reduced to an occupancy function queried at any point in space, and we define a simulator atop this common interface. For each object, we fit a small implicit neural network encoding spatially varying weights that act as a reduced deformation basis. These weights are trained to learn physically significant motions in the object via random perturbations. Our loss ensures we find a weight-space basis that best minimizes deformation energy by stochastically evaluating elastic energies through Monte Carlo sampling of the deformation volume. At runtime, we simulate in the reduced basis and sample the deformations back to the original domain. Our experiments demonstrate the versatility, accuracy, and speed of this approach on data including signed distance functions, point clouds, neural primitives, tomography scans, radiance fields, Gaussian splats, surface meshes, and volume meshes, as well as showing a variety of material energies, contact models, and time integration schemes.
We present a comprehensive neural network to model the deformation of human soft tissues including muscle, tendon, fat and skin. Our approach provides kinematic and active correctives to linear blend skinning [Magnenat-Thalmann et al. 1989] that enhance the realism of soft tissue deformation at modest computational cost. Our network accounts for deformations induced by changes in the underlying skeletal joint state as well as the active contractile state of relevant muscles. Training is done to approximate quasistatic equilibria produced from physics-based simulation of hyperelastic soft tissues in close contact. We use a layered approach to equilibrium data generation where deformation of muscle is computed first, followed by an inner skin/fascia layer, and lastly a fat layer between the fascia and outer skin. We show that a simple network model which decouples the dependence on skeletal kinematics and muscle activation state can produce compelling behaviors with modest training data burden. Active contraction of muscles is estimated using inverse dynamics where muscle moment arms are accurately predicted using the neural network to model kinematic musculotendon geometry. Results demonstrate the ability to accurately replicate compelling musculoskeletal and skin deformation behaviors over a representative range of motions, including the effects of added weights in body building motions.
This paper presents BrepGen, a diffusion-based generative approach that directly outputs a Boundary representation (B-rep) Computer-Aided Design (CAD) model. BrepGen represents a B-rep model as a novel structured latent geometry in a hierarchical tree. With the root node representing a whole CAD solid, each element of a B-rep model (i.e., a face, an edge, or a vertex) progressively turns into a child-node from top to bottom. B-rep geometry information goes into the nodes as the global bounding box of each primitive along with a latent code describing the local geometric shape. The B-rep topology information is implicitly represented by node duplication. When two faces share an edge, the edge curve will appear twice in the tree, and a T-junction vertex with three incident edges appears six times in the tree with identical node features. Starting from the root and progressing to the leaf, BrepGen employs Transformer-based diffusion models to sequentially denoise node features while duplicated nodes are detected and merged, recovering the B-Rep topology information. Extensive experiments show that BrepGen advances the task of CAD B-rep generation, surpassing existing methods on various benchmarks. Results on our newly collected furniture dataset further showcase its exceptional capability in generating complicated geometry. While previous methods were limited to generating simple prismatic shapes, BrepGen incorporates free-form and doubly-curved surfaces for the first time. Additional applications of BrepGen include CAD autocomplete and design interpolation. The code, pretrained models, and dataset are available at https://github.com/samxuxiang/BrepGen.
In the realm of digital creativity, our potential to craft intricate 3D worlds from imagination is often hampered by the limitations of existing digital tools, which demand extensive expertise and efforts. To narrow this disparity, we introduce CLAY, a 3D geometry and material generator designed to effortlessly transform human imagination into intricate 3D digital structures. CLAY supports classic text or image inputs as well as 3D-aware controls from diverse primitives (multi-view images, voxels, bounding boxes, point clouds, implicit representations, etc). At its core is a large-scale generative model composed of a multi-resolution Variational Autoencoder (VAE) and a minimalistic latent Diffusion Transformer (DiT), to extract rich 3D priors directly from a diverse range of 3D geometries. Specifically, it adopts neural fields to represent continuous and complete surfaces and uses a geometry generative module with pure transformer blocks in latent space. We present a progressive training scheme to train CLAY on an ultra large 3D model dataset obtained through a carefully designed processing pipeline, resulting in a 3D native geometry generator with 1.5 billion parameters. For appearance generation, CLAY sets out to produce physically-based rendering (PBR) textures by employing a multi-view material diffusion model that can generate 2K resolution textures with diffuse, roughness, and metallic modalities. We demonstrate using CLAY for a range of controllable 3D asset creations, from sketchy conceptual designs to production ready assets with intricate details. Even first time users can easily use CLAY to bring their vivid 3D imaginations to life, unleashing unlimited creativity.
Text-driven 3D scene editing has gained significant attention owing to its convenience and user-friendliness. However, existing methods still lack accurate control of the specified appearance and location of the editing result due to the inherent limitations of the text description. To this end, we propose a 3D scene editing framework, TIP-Editor, that accepts both text and image prompts and a 3D bounding box to specify the editing region. With the image prompt, users can conveniently specify the detailed appearance/style of the target content in complement to the text description, enabling accurate control of the appearance. Specifically, TIP-Editor employs a stepwise 2D personalization strategy to better learn the representation of the existing scene and the reference image, in which a localization loss is proposed to encourage correct object placement as specified by the bounding box. Additionally, TIP-Editor utilizes explicit and flexible 3D Gaussian splatting (GS) as the 3D representation to facilitate local editing while keeping the background unchanged. Extensive experiments have demonstrated that TIP-Editor conducts accurate editing following the text and image prompts in the specified bounding box region, consistently outperforming the baselines in editing quality, and the alignment to the prompts, qualitatively and quantitatively.
Texture modeling and synthesis are essential for enhancing the realism of virtual environments. Methods that directly synthesize textures in 3D offer distinct advantages to the UV-mapping-based methods as they can create seamless textures and align more closely with the ways textures form in nature. We propose Mesh Neural Cellular Automata (MeshNCA), a method that directly synthesizes dynamic textures on 3D meshes without requiring any UV maps. MeshNCA is a generalized type of cellular automata that can operate on a set of cells arranged on non-grid structures such as the vertices of a 3D mesh. MeshNCA accommodates multi-modal supervision and can be trained using different targets such as images, text prompts, and motion vector fields. Only trained on an Icosphere mesh, MeshNCA shows remarkable test-time generalization and can synthesize textures on unseen meshes in real time. We conduct qualitative and quantitative comparisons to demonstrate that MeshNCA outperforms other 3D texture synthesis methods in terms of generalization and producing high-quality textures. Moreover, we introduce a way of grafting trained MeshNCA instances, enabling interpolation between textures. MeshNCA allows several user interactions including texture density/orientation controls, grafting/regenerate brushes, and motion speed/direction controls. Finally, we implement the forward pass of our MeshNCA model using the WebGL shading language and showcase our trained models in an online interactive demo, which is accessible on personal computers and smartphones and is available at https://meshnca.github.io/.
Metropolis Light Transport (MLT) is a global illumination algorithm that is well-known for rendering challenging scenes with intricate light paths. However, MLT methods tend to produce unpredictable correlation artifacts in images, which can introduce visual inconsistencies for animation rendering. This drawback also makes it challenging to denoise MLT renderings while maintaining temporal stability. We tackle this issue with modern learning-based methods and build a sequence denoiser combining the recurrent connections with the cutting-edge vision transformer architecture. We demonstrate that our sophisticated denoiser can consistently improve the quality and temporal stability of MLT renderings with difficult light paths. Our method is efficient and scalable for complex scene renderings that require high sample counts.
Physically based differentiable rendering allows an accurate light transport simulation to be differentiated with respect to the rendering input, i.e., scene parameters, and it enables inferring scene parameters from target images, e.g., photos or synthetic images, via an iterative optimization. However, this inverse Monte Carlo rendering inherits the fundamental problem of the Monte Carlo integration, i.e., noise, resulting in a slow optimization convergence. An appealing approach to addressing such noise is exploiting an image denoiser to improve optimization convergence. Unfortunately, the direct adoption of existing image denoisers designed for ordinary rendering scenarios can drive the optimization into undesirable local minima due to denoising bias. It motivates us to reformulate a new image denoiser specialized for inverse rendering. Unlike existing image denoisers, we conduct our denoising by considering the target images, i.e., specific information in inverse rendering. For our target-aware denoising, we determine our denoising weights via a linear regression technique using the target. We demonstrate that our denoiser enables inverse rendering optimization to infer scene parameters robustly through a diverse set of tests.
We propose a real-time path guiding method, Voxel Path Guiding (VXPG), that significantly improves fitting efficiency under limited sampling budget. Our key idea is to use a spatial irradiance voxel data structure across all shading points to guide the location of path vertices. For each frame, we first populate the voxel data structure with irradiance and geometry information. To sample from the data structure for a shading point, we need to select a voxel with high contribution to that point. To importance sample the voxels while taking visibility into consideration, we adapt techniques from offline many-lights rendering by clustering pairs of shading points and voxels. Finally, we unbiasedly sample within the selected voxel while taking the geometry inside into consideration. Our experiments show that VXPG achieves significantly lower perceptual error compared to other real-time path guiding and virtual point light methods under equal-time comparison. Furthermore, our method does not rely on temporal information, but can be used together with other temporal reuse sampling techniques such as ReSTIR to further improve sampling efficiency.
Finding valid light paths that involve specular vertices in Monte Carlo rendering requires solving many non-linear, transcendental equations in high-dimensional space. Existing approaches heavily rely on Newton iterations in path space, which are limited to obtaining at most a single solution each time and easily diverge when initialized with improper seeds.
We propose specular polynomials, a Newton iteration-free methodology for finding a complete set of admissible specular paths connecting two arbitrary endpoints in a scene. The core is a reformulation of specular constraints into polynomial systems, which makes it possible to reduce the task to a univariate root-finding problem. We first derive bivariate systems utilizing rational coordinate mapping between the coordinates of consecutive vertices. Subsequently, we adopt the hidden variable resultant method for variable elimination, converting the problem into finding zeros of the determinant of univariate matrix polynomials. This can be effectively solved through Laplacian expansion for one bounce and a bisection solver for more bounces.
Our solution is generic, completely deterministic, accurate for the case of one bounce, and GPU-friendly. We develop efficient CPU and GPU implementations and apply them to challenging glints and caustic rendering. Experiments on various scenarios demonstrate the superiority of specular polynomial-based solutions compared to Newton iteration-based counterparts. Our implementation is available at https://github.com/mollnn/spoly.
The objective of polarization rendering is to simulate the interaction of light with materials exhibiting polarization-dependent behavior. However, integrating polarization into rendering is challenging and increases computational costs significantly. The primary difficulty lies in efficiently modeling and computing the complex reflection phenomena associated with polarized light. Specifically, frequency-domain analysis, essential for efficient environment lighting and storage of complex light interactions, is lacking. To efficiently simulate and reproduce polarized light interactions using frequency-domain techniques, we address the challenge of maintaining continuity in polarized light transport represented by Stokes vectors within angular domains. The conventional spherical harmonics method cannot effectively handle continuity and rotation invariance for Stokes vectors. To overcome this, we develop a new method called polarized spherical harmonics (PSH) based on the spin-weighted spherical harmonics theory. Our method provides a rotation-invariant representation of Stokes vector fields. Furthermore, we introduce frequency domain formulations of polarized rendering equations and spherical convolution based on PSH. We first define spherical convolution on Stokes vector fields in the angular domain, and it also provides efficient computation of polarized light transport, nearly on an entry-wise product in the frequency domain. Our frequency domain formulation, including spherical convolution, led to the development of the first real-time polarization rendering technique under polarized environmental illumination, named precomputed polarized radiance transfer, using our polarized spherical harmonics. Results demonstrate that our method can effectively and accurately simulate and reproduce polarized light interactions in complex reflection phenomena, including polarized environmental illumination and soft shadows.
Genetic studies indicate that more than 50% of women are genetically tetrachromatic, expressing four distinct types of color photoreceptors (cone cells) in the retina. At least one functional tetrachromat has been identified in laboratory tests. We hypothesize that there is a large latent group in the population capable of fundamentally richer color experience, but we are not yet aware of this group because of a lack of tetrachromatic colors in the visual environment. This paper develops theory and engineering practice for fabricating tetrachromatic colors and potentially identifying tetrachromatic color vision in the wild. First, we apply general d-dimensional color theory to derive and compute all the key color structures of human tetrachromacy for the first time, including its 4D space of possible object colors, 3D space of chromaticities, and yielding a predicted 2D sphere of tetrachromatic hues. We compare this predicted hue sphere to the familiar hue circle of trichromatic color, extending the theory to predict how the higher dimensional topology produces an expanded color experience for tetrachromats. Second, we derive the four reflectance functions for the ideal tetrachromatic inkset, analogous to the well-known CMY printing basis for trichromacy. Third, we develop a method for prototyping tetrachromatic printers using a library of fountain pen inks and a multi-pass inkjet printing platform. Fourth, we generalize existing color tests - sensitive hue ordering tests and rapid isochromatic plate screening tests - to higher-dimensional vision, and prototype variants of these tests for identifying and characterizing tetrachromacy in the wild.
ColorVideoVDP is a video and image quality metric that models spatial and temporal aspects of vision for both luminance and color. The metric is built on novel psychophysical models of chromatic spatiotemporal contrast sensitivity and cross-channel contrast masking. It accounts for the viewing conditions, geometric, and photometric characteristics of the display. It was trained to predict common video-streaming distortions (e.g., video compression, rescaling, and transmission errors) and also 8 new distortion types related to AR/VR displays (e.g., light source and waveguide non-uniformities). To address the latter application, we collected our novel XR-Display-Artifact-Video quality dataset (XR-DAVID), comprised of 336 distorted videos. Extensive testing on XR-DAVID, as well as several datasets from the literature, indicate a significant gain in prediction performance compared to existing metrics. ColorVideoVDP opens the doors to many novel applications that require the joint automated spatiotemporal assessment of luminance and color distortions, including video streaming, display specification, and design, visual comparison of results, and perceptually-guided quality optimization. The code for the metric can be found at https://github.com/gfxdisp/ColorVideoVDP.
Visual In-Context Learning (ICL) has emerged as a promising research area due to its capability to accomplish various tasks with limited example pairs through analogical reasoning. However, training-based visual ICL has limitations in its ability to generalize to unseen tasks and requires the collection of a diverse task dataset. On the other hand, existing methods in the inference-based visual ICL category solely rely on textual prompts, which fail to capture fine-grained contextual information from given examples and can be time-consuming when converting from images to text prompts. To address these challenges, we propose Analogist, a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques using a text-to-image diffusion model pretrained for image inpainting. For visual prompting, we propose a self-attention cloning (SAC) method to guide the fine-grained structural-level analogy between image examples. For textual prompting, we leverage GPT-4V's visual reasoning capability to efficiently generate text prompts and introduce a cross-attention masking (CAM) operation to enhance the accuracy of semantic-level analogy guided by text prompts. Our method is out-of-the-box and does not require fine-tuning or optimization. It is also generic and flexible, enabling a wide range of visual tasks to be performed in an in-context manner. Extensive experiments demonstrate the superiority of our method over existing approaches, both qualitatively and quantitatively. Our project webpage is available at https://analogist2d.github.io.
The real world exhibits rich structure and detail across many scales of observation. It is difficult, however, to capture and represent a broad spectrum of scales using ordinary images. We devise a novel paradigm for learning a representation that captures an orders-of-magnitude variety of scales from an unstructured collection of ordinary images. We treat this collection as a distribution of scale-space slices to be learned using adversarial training, and additionally enforce coherency across slices. Our approach relies on a multiscale generator with carefully injected procedural frequency content, which allows to interactively explore the emerging continuous scale space. Training across vastly different scales poses challenges regarding stability, which we tackle using a supervision scheme that involves careful sampling of scales. We show that our generator can be used as a multiscale generative model, and for reconstructions of scale spaces from unstructured patches. Significantly outperforming the state of the art, we demonstrate zoom-in factors of up to 256x at high quality and scale consistency.
Computer Graphics has a long history in the design of effective algorithms for handling contact and friction between solid objects. For the sake of simplicity and versatility, most methods rely on low-order primitives such as line segments or triangles, both for the detection and the response stages. In this paper we carefully analyse, in the case of fibre systems, the impact of such choices on the retrieved contact forces. We highlight the presence of artifacts in the force response that are tightly related to the low-order geometry used for contact detection. Our analysis draws upon thorough comparisons between the high-order super-helix model and the low-order discrete elastic rod model. These reveal that when coupled to a low-order, segment-based detection scheme, both models yield spurious jumps in the contact force profile. Moreover, these artifacts are shown to be all the more visible as the geometry of fibres at contact is curved. In order to remove such artifacts we develop an accurate high-order detection scheme between two smooth curves, which relies on an efficient adaptive pruning strategy. We use this algorithm to detect contact between super-helices at high precision, allowing us to recover, in the range of wavy to highly curly fibres, much smoother force profiles during sliding motion than with a classical segment-based strategy. Furthermore, we show that our approach offers better scaling properties in terms of efficiency vs. precision compared to segment-based approaches, making it attractive for applications where accurate and reliable forces are desired. Finally, we demonstrate the robustness and accuracy of our fully high-order approach on a challenging hair combing scenario.
We present fVDB, a novel GPU-optimized framework for deep learning on large-scale 3D data. fVDB provides a complete set of differentiable primitives to build deep learning architectures for common tasks in 3D learning such as convolution, pooling, attention, ray-tracing, meshing, etc. fVDB simultaneously provides a much larger feature set (primitives and operators) than established frameworks with no loss in efficiency: our operators match or exceed the performance of other frameworks with narrower scope. Furthermore, fVDB can process datasets with much larger footprint and spatial resolution than prior works, while providing a competitive memory footprint on small inputs. To achieve this combination of versatility and performance, fVDB relies on a single novel VDB index grid acceleration structure paired with several key innovations including GPU accelerated sparse grid construction, convolution using tensorcores, fast ray tracing kernels using a Hierarchical Digital Differential Analyzer algorithm (HDDA), and jagged tensors. Our framework is fully integrated with PyTorch enabling interoperability with existing pipelines, and we demonstrate its effectiveness on a number of representative tasks such as large-scale point-cloud segmentation, high resolution 3D generative modeling, unbounded scale Neural Radiance Fields, and large-scale point cloud reconstruction.
Gaussian scale spaces are a cornerstone of signal representation and processing, with applications in filtering, multiscale analysis, anti-aliasing, and many more. However, obtaining such a scale space is costly and cumbersome, in particular for continuous representations such as neural fields. We present an efficient and lightweight method to learn the fully continuous, anisotropic Gaussian scale space of an arbitrary signal. Based on Fourier feature modulation and Lipschitz bounding, our approach is trained self-supervised, i.e., training does not require any manual filtering. Our neural Gaussian scale-space fields faithfully capture multiscale representations across a broad range of modalities, and support a diverse set of applications. These include images, geometry, light-stage data, texture anti-aliasing, and multiscale optimization.
Recent advances in generative imagery have brought forth outpainting and inpainting models that can produce high-quality, plausible image content in unknown regions. However, the content these models hallucinate is necessarily inauthentic, since they are unaware of the true scene. In this work, we propose RealFill, a novel generative approach for image completion that fills in missing regions of an image with the content that should have been there. RealFill is a generative inpainting model that is personalized using only a few reference images of a scene. These reference images do not have to be aligned with the target image, and can be taken with drastically varying viewpoints, lighting conditions, camera apertures, or image styles. Once personalized, RealFill is able to complete a target image with visually compelling contents that are faithful to the original scene. We evaluate RealFill on a new image completion benchmark that covers a set of diverse and challenging scenarios, and find that it outperforms existing approaches by a large margin. Project page: https://realfill.github.io.
In this work, we present Semantic Gesticulator, a novel framework designed to synthesize realistic gestures accompanying speech with strong semantic correspondence. Semantically meaningful gestures are crucial for effective non-verbal communication, but such gestures often fall within the long tail of the distribution of natural human motion. The sparsity of these movements makes it challenging for deep learning-based systems, trained on moderately sized datasets, to capture the relationship between the movements and the corresponding speech semantics. To address this challenge, we develop a generative retrieval framework based on a large language model. This framework efficiently retrieves suitable semantic gesture candidates from a motion library in response to the input speech. To construct this motion library, we summarize a comprehensive list of commonly used semantic gestures based on findings in linguistics, and we collect a high-quality motion dataset encompassing both body and hand movements. We also design a novel GPT-based model with strong generalization capabilities to audio, capable of generating high-quality gestures that match the rhythm of speech. Furthermore, we propose a semantic alignment mechanism to efficiently align the retrieved semantic gestures with the GPT's output, ensuring the naturalness of the final animation. Our system demonstrates robustness in generating gestures that are rhythmically coherent and semantically explicit, as evidenced by a comprehensive collection of examples. User studies confirm the quality and human-likeness of our results, and show that our system outperforms state-of-the-art systems in terms of semantic appropriateness by a clear margin. We will release the code and dataset for academic research.
Procedural animation has seen widespread use in the design of expressive walking gaits for virtual characters. While similar tools could breathe life into robotic characters, existing techniques are largely unaware of the kinematic and dynamic constraints imposed by physical robots. In this paper, we propose a system for the artist-directed authoring of stylized bipedal walking gaits, tailored for execution on robotic characters. The artist interfaces with an interactive editing tool that generates the desired character motion in realtime, either on the physical or simulated robot, using a model-based control stack. Each walking style is encoded as a set of sample parameters which are translated into whole-body reference trajectories using the proposed procedural animation technique. In order to generalize the stylized gait over a continuous range of input velocities, we employ a phase-space blending strategy that interpolates a set of example walk cycles authored by the animator while preserving contact constraints. To demonstrate the utility of our approach, we animate gaits for a custom, free-walking robotic character, and show, with two additional in-simulation examples, how our procedural animation technique generalizes to bipeds with different degrees of freedom, proportions, and mass distributions.
As a natural extension to the harmonic coordinates, the biharmonic coordinates have been found superior for planar shape and image manipulation with an enriched deformation space. However, the 3D biharmonic coordinates and their derivatives have remained unexplored. In this work, we derive closed-form expressions for biharmonic coordinates and their derivatives for 3D triangular cages. The core of our derivation lies in computing the closed-form expressions for the integral of the Euclidean distance over a triangle and its derivatives. The derived 3D biharmonic coordinates not only fill a missing component in methods of generalized barycentric coordinates but also pave the way for various interesting applications in practice, including producing a family of biharmonic deformations, solving variational shape deformations, and even unlocking the closed-form expressions for recently-introduced Somigliana coordinates for both fast and accurate evaluations.
In this paper, we introduce a novel approach for computing conformal deformations in ℝ3 while minimizing curvature-based energies. Curvature-based energies serve as fundamental tools in geometry processing, essential for tasks such as surface fairing, deformation, and approximation using developable or cone metric surfaces. However, accurately computing the geometric embedding, especially for the latter, has been a challenging endeavor. The complexity arises from inherent numerical instabilities in curvature estimation and the intricate nature of differentiating these energies. To address these challenges, we concentrate on conformal deformations, leveraging the curvature tensor as the primary variable in our model. This strategic choice renders curvature-based energies easily applicable, mitigating previous manipulation difficulties. Our key contribution lies in identifying a previously unknown integrability condition that establishes a connection between conformal deformations and changes in curvature. We use this insight to deform surfaces of arbitrary genus, aiming to minimize bending energies or prescribe Gaussian curvature while sticking to positional constraints.
This paper develops a shape space framework for collision-aware geometric modeling, where basic geometric operations automatically avoid inter-penetration. Shape spaces are a powerful tool for surface modeling, shape analysis, nonrigid motion planning, and animation, but past formulations permit nonphysical intersections. Our framework augments an existing shape space using a repulsive energy such that collision avoidance becomes a first-class property, encoded in the Riemannian metric itself. In turn, tasks like intersection-free shape interpolation or motion extrapolation amount to simply computing geodesic paths via standard numerical algorithms. To make optimization practical, we develop an adaptive collision penalty that prevents mesh self-intersection, and converges to a meaningful limit energy under refinement. The final algorithms apply to any category of shape, and do not require a dataset of examples, training, rigging, nor any other prior information. For instance, to interpolate between two shapes we need only a single pair of meshes with the same connectivity. We evaluate our method on a variety of challenging examples from modeling and animation.
While conventional cameras offer versatility for applications ranging from amateur photography to autonomous driving, computational cameras allow for domain-specific adaption. Cameras with co-designed optics and image processing algorithms enable high-dynamic-range image recovery, depth estimation, and hyperspectral imaging through optically encoding scene information that is otherwise undetected by conventional cameras. However, this optical encoding creates a challenging inverse reconstruction problem for conventional image recovery, and often lowers the overall photographic quality. Thus computational cameras with domain-specific optics have only been adopted in a few specialized applications where the captured information cannot be acquired in other ways. In this work, we investigate a method that combines two optical systems into one to tackle this challenge. We split the aperture of a conventional camera into two halves: one which applies an application-specific modulation to the incident light via a diffractive optical element to produce a coded image capture, and one which applies no modulation to produce a conventional image capture. Co-designing the phase modulation of the split aperture with a dual-pixel sensor allows us to simultaneously capture these coded and uncoded images without increasing physical or computational footprint. With an uncoded conventional image alongside the optically coded image in hand, we investigate image reconstruction methods that are conditioned on the conventional image, making it possible to eliminate artifacts and compute costs that existing methods struggle with. We assess the proposed method with 2-in-1 cameras for optical high-dynamic-range reconstruction, monocular depth estimation, and hyperspectral imaging, comparing favorably to all tested methods in all applications.
Translating motions from a real user onto a virtual embodied avatar is a key challenge for character animation in the metaverse. In this work, we present a novel generative framework that enables mapping from a set of sparse sensor signals to a full body avatar motion in real-time while faithfully preserving the motion context of the user. In contrast to existing techniques that require training a motion prior and its mapping from control to motion separately, our framework is able to learn the motion manifold as well as how to sample from it at the same time in an end-to-end manner. To achieve that, we introduce a technique called codebook matching which matches the probability distribution between two categorical codebooks for the inputs and outputs for synthesizing the character motions. We demonstrate this technique can successfully handle ambiguity in motion generation and produce high quality character controllers from unstructured motion capture data. Our method is especially useful for interactive applications like virtual reality or video games where high accuracy and responsiveness are needed.
Real-time character control is an essential component for interactive experiences, with a broad range of applications, including physics simulations, video games, and virtual reality. The success of diffusion models for image synthesis has led to the use of these models for motion synthesis. However, the majority of these motion diffusion models are primarily designed for offline applications, where space-time models are used to synthesize an entire sequence of frames simultaneously with a pre-specified length. To enable real-time motion synthesis with diffusion model that allows time-varying controls, we propose A-MDM (Auto-regressive Motion Diffusion Model). Our conditional diffusion model takes an initial pose as input, and auto-regressively generates successive motion frames conditioned on the previous frame. Despite its streamlined network architecture, which uses simple MLPs, our framework is capable of generating diverse, long-horizon, and high-fidelity motion sequences. Furthermore, we introduce a suite of techniques for incorporating interactive controls into A-MDM, such as task-oriented sampling, in-painting, and hierarchical reinforcement learning (See Figure 1). These techniques enable a pre-trained A-MDM to be efficiently adapted for a variety of new downstream tasks. We conduct a comprehensive suite of experiments to demonstrate the effectiveness of A-MDM, and compare its performance against state-of-the-art auto-regressive methods.
In this work, we present MoConVQ, a novel unified framework for physics-based motion control leveraging scalable discrete representations. Building upon vector quantized variational autoencoders (VQ-VAE) and model-based reinforcement learning, our approach effectively learns motion embeddings from a large, unstructured dataset spanning tens of hours of motion examples. The resultant motion representation not only captures diverse motion skills but also offers a robust and intuitive interface for various applications. We demonstrate the versatility of MoConVQ through several applications: universal tracking control from various motion sources, interactive character control with latent motion representations using supervised learning, physics-based motion generation from natural language descriptions using the GPT framework, and, most interestingly, seamless integration with large language models (LLMs) with in-context learning to tackle complex and abstract tasks.
Modeling high-resolution terrains is a perennial challenge in the creation of virtual worlds. In this paper, we focus on the amplification of a low-resolution input terrain into a high-resolution, hydrologically consistent terrain featuring complex patterns by a multi-scale approach. Our framework combines the best of both worlds, relying on physics-inspired erosion models producing consistent erosion landmarks and introducing control at different scales, thus bridging the gap between physics-based erosion simulations and multi-scale procedural modeling. The method uses a fast and accurate approximation of different simulations, including thermal, stream power erosion and deposition performed at different scales to obtain a range of effects. Our approach provides landscape designers with tools for amplifying mountain ranges and valleys with consistent details.
Generating realistic models of trees and plants is a complex problem because of the vast variety of shapes trees can form. Procedural modeling algorithms are popular for defining branching structures and steadily increasing their expressive power by considering more biological findings. Most existing methods focus on defining the branching structure of trees based on skeletal graphs, while the surface mesh of branches is most commonly defined as simple cylinders. One critical open problem is defining and controlling the complex details observed in real trees. This paper aims to advance tree modeling by proposing a strand-based volumetric representation for tree models. Strands are fixed-size volumetric pipes that define the branching structure. By leveraging strands, our approach captures the lateral development of trees. We combine the strands with a novel branch development formulation that allows us to locally inject vigor and reshape the tree model. Moreover, we define a set of editing operators for tree primary and lateral development that enables users to interactively generate complex tree models with unprecedented detail with minimal effort.
We introduce a fast, robust, and user-controllable algorithm to generate surface-filling curves. We compute these curves through the gradient flow of a simple sparse energy, making our method several orders of magnitude faster than previous works. Our algorithm makes minimal assumptions on the topology and resolution of the input surface, achieving improved robustness. Our framework provides tuneable parameters that guide the shape of the output curve, making it ideal for interactive design applications.
Neural Radiance Fields (NeRF) achieves unprecedented performance in synthesizing novel view synthesis, utilizing multi-view consistency. When capturing multiple inputs, image signal processing (ISP) in modern cameras will independently enhance them, including exposure adjustment, color correction, local tone mapping, etc. While these processings greatly improve image quality, they often break the multi-view consistency assumption, leading to "floaters" in the reconstructed radiance fields. To address this concern without compromising visual aesthetics, we aim to first disentangle the enhancement by ISP at the NeRF training stage and re-apply user-desired enhancements to the reconstructed radiance fields at the finishing stage. Furthermore, to make the re-applied enhancements consistent between novel views, we need to perform imaging signal processing in 3D space (i.e. "3D ISP"). For this goal, we adopt the bilateral grid, a locally-affine model, as a generalized representation of ISP processing. Specifically, we optimize per-view 3D bilateral grids with radiance fields to approximate the effects of camera pipelines for each input view. To achieve user-adjustable 3D finishing, we propose to learn a low-rank 4D bilateral grid from a given single view edit, lifting photo enhancements to the whole 3D scene. We demonstrate our approach can boost the visual quality of novel view synthesis by effectively removing floaters and performing enhancements from user retouching. The source code and our data are available at: https://bilarfpro.github.io.
While surface-based view synthesis algorithms are appealing due to their low computational requirements, they often struggle to reproduce thin structures. In contrast, more expensive methods that model the scene's geometry as a volumetric density field (e.g. NeRF) excel at reconstructing fine geometric detail. However, density fields often represent geometry in a "fuzzy" manner, which hinders exact localization of the surface. In this work, we modify density fields to encourage them to converge towards surfaces, without compromising their ability to reconstruct thin structures. First, we employ a discrete opacity grid representation instead of a continuous density field, which allows opacity values to discontinuously transition from zero to one at the surface. Second, we anti-alias by casting multiple rays per pixel, which allows occlusion boundaries and subpixel structures to be modelled without using semi-transparent voxels. Third, we minimize the binary entropy of the opacity values, which facilitates the extraction of surface geometry by encouraging opacity values to binarize towards the end of training. Lastly, we develop a fusion-based meshing strategy followed by mesh simplification and appearance model fitting. The compact meshes produced by our model can be rendered in real-time on mobile devices and achieve significantly higher view synthesis quality compared to existing mesh-based approaches. Our interactive webdemo is available at https://binary-opacity-grid.github.io.
Reconstructing objects with realistic materials from multi-view images is problematic, since it is highly ill-posed. Although the neural reconstruction approaches have exhibited impressive reconstruction ability, they are designed for objects with specific materials (e.g., diffuse or specular materials). To this end, we propose a novel framework for robust geometry and material reconstruction, where the geometry is expressed with the implicit signed distance field (SDF) encoded by a tensorial representation, namely TensoSDF. At the core of our method is the roughness-aware incorporation of the radiance and reflectance fields, which enables a robust reconstruction of objects with arbitrary reflective materials. Furthermore, the tensorial representation enhances geometry details in the reconstructed surface and reduces the training time. Finally, we estimate the materials using an explicit mesh for efficient intersection computation and an implicit SDF for accurate representation. Consequently, our method can achieve more robust geometry reconstruction, outperform the previous works in terms of relighting quality, and reduce 50% training times and 70% inference time. Codes and datasets are available at https://github.com/Riga2/TensoSDF.
We present a sensor that can measure light and wirelessly communicate the measurement, without the need for an external power source or a battery. Our sensor, called cricket, harvests energy from incident light. It is asleep for most of the time and transmits a short and strong radio frequency chirp when its harvested energy reaches a specific level. The carrier frequency of each cricket is fixed and reveals its identity, and the duration between consecutive chirps is a measure of the incident light level. We have characterized the radiometric response function, signal-to-noise ratio and dynamic range of cricket. We have experimentally verified that cricket can be miniaturized at the expense of increasing the duration between chirps. We show that a cube with a cricket on each of its sides can be used to estimate the centroid of any complex illumination, which has value in applications such as solar tracking. We also demonstrate the use of crickets for creating untethered sensor arrays that can produce video and control lighting for energy conservation. Finally, we modified cricket's circuit to develop battery-free electronic sunglasses that can instantly adapt to environmental illumination.
Illusion-knit fabrics reveal distinct patterns or images depending on the viewing angle. Artists have manually achieved this effect by exploiting "microgeometry," i.e., small differences in stitch heights. However, past work in computational 3D knitting does not model or exploit designs based on stitch height variation. This paper establishes a foundation for exploring illusion knitting in the context of computational design and fabrication. We observe that the design space is highly constrained, elucidate these constraints, and derive strategies for developing effective, machine-knittable illusion patterns. We partially automate these strategies in a new interactive design tool that reduces difficult patterning tasks to familiar image editing tasks. Illusion patterns also uncover new fabrication challenges regarding mixed colorwork and texture; we describe new algorithms for mitigating fabrication failures and ensuring high-quality knit results.
This paper presents a computational pipeline for creating personalized, physical LEGO®1 figurines from user-input portrait photos. The generated figurine is an assembly of coherently-connected LEGO® bricks detailed with uv-printed decals, capturing prominent features such as hairstyle, clothing style, and garment color, and also intricate details such as logos, text, and patterns. This task is non-trivial, due to the substantial domain gap between unconstrained user photos and the stylistically-consistent LEGO® figurine models. To ensure assemble-ability by LEGO® bricks while capturing prominent features and intricate details, we design a three-stage pipeline: (i) we formulate a CLIP-guided retrieval approach to connect the domains of user photos and LEGO® figurines, then output physically-assemble-able LEGO® figurines with decals excluded; (ii) we then synthesize decals on the figurines via a symmetric U-Nets architecture conditioned on appearance features extracted from user photos; and (iii) we next reproject and uv-print the decals on associated LEGO® bricks for physical model production. We evaluate the effectiveness of our method against eight hundred expert-designed figurines, using a comprehensive set of metrics, which include a novel GPT-4V-based evaluation metric, demonstrating superior performance of our method in visual quality and resemblance to input photos. Also, we show our method's robustness by generating LEGO® figurines from diverse inputs and physically fabricating and assembling several of them.