We introduce a computational pipeline for simulating and designing C-shells, a new class of planar-to-spatial deployable linkage structures. A C-shell is composed of curved flexible beams connected at rotational joints that can be assembled in a stress-free planar configuration. When actuated, the elastic beams deform and the assembly deploys towards the target 3D shape.
We propose two alternative computational design approaches for C-shells: (i) Forward exploration simulates the deployed shape from a planar beam layout provided by the user. Once a satisfactory overall shape is found, a subsequent design optimization adapts the beam geometry to reduce the elastic energy of the linkage while preserving the target shape. (ii) Inverse design is facilitated by a new geometric flattening method that takes a design surface as input and computes an initial layout of piecewise straight linkage beams. Our design optimization algorithm then calculates the smooth curved beams to best reproduce the target shape at minimal elastic energy.
We find that C-shells offer a rich space for design and show several studies that highlight new shape topologies that cannot be achieved with existing deployable linkage structures.
Kirchhoff-Love shells are commonly used in many branches of engineering, including in computer graphics, but have so far been simulated only under limited nonlinear material options. We derive the Kirchhoff-Love thin-shell mechanical energy for an arbitrary 3D volumetric hyperelastic material, including isotropic materials, anisotropic materials, and materials whereby the energy includes both even and odd powers of the principal stretches. We do this by starting with any 3D hyperelastic material, and then analytically computing the corresponding thin-shell energy limit. This explicitly identifies and separates in-plane stretching and bending terms, and avoids numerical quadrature. Thus, in-plane stretching and bending are shown to originate from one and the same process (volumetric elasticity of thin objects), as opposed to from two separate processes as done traditionally in cloth simulation. Because we can simulate materials that include both even and odd powers of stretches, we can accommodate standard mesh distortion energies previously employed for 3D solid simulations, such as Symmetric ARAP and Co-rotational materials. We relate the terms of our energy to those of prior work on Kirchhoff-Love thin-shells in computer graphics that assumed small in-plane stretches, and demonstrate the visual difference due to the presence of our exact stretching and bending terms. Furthermore, our formulation allows us to categorize all distinct hyperelastic Kirchhoff-Love thin-shell energies. Specifically, we prove that for Kirchhoff-Love thin-shells, the space of all hyperelastic materials collapses to two-dimensional hyperelastic materials. This observation enables us to create an interface for the design of thin-shell Kirchhoff-Love mechanical energies, which in turn enables us to create thin-shell materials that exhibit arbitrary stiffness profiles under large deformations.
In this paper, we address two limitations of dihedral angle based discrete bending (DAB) models, i.e. the indefiniteness of their energy Hessian and their vulnerability to geometry degeneracies. To tackle the indefiniteness issue, we present novel analytic expressions for the eigensystem of a DAB energy Hessian. Our expressions reveal that DAB models typically have positive, negative, and zero eigenvalues, with four of each, respectively. By using these expressions, we can efficiently project an indefinite DAB energy Hessian as positive semi-definite analytically. To enhance the stability of DAB models at degenerate geometries, we propose rectifying their indefinite geometric stiffness matrix by using orthotropic geometric stiffness matrices with adaptive parameters calculated from our analytic eigensystem. Among the twelve motion modes of a dihedral element, our resulting Hessian for DAB models retains only the desirable bending modes, compared to the undesirable altitude-changing modes of the exact Hessian with original geometric stiffness, all modes of the Gauss-Newton approximation without geometric stiffness, and no modes of the projected Hessians with inappropriate geometric stiffness. Additionally, we suggest adjusting the compression stiffness according to the Kirchhoff-Love thin plate theory to avoid over-compression. Our method not only ensures the positive semidefiniteness but also avoids instability caused by large bending forces at degenerate geometries. To demonstrate the benefit of our approaches, we show comparisons against existing methods on the simulation of cloth and thin plates in challenging examples.
Thin shell structures exhibit complex behaviors critical for modeling and design across wide-ranging applications. Capturing their mechanical response requires finely detailed, high-resolution meshes. Corresponding simulations for predicting equilibria with these meshes are expensive, whereas coarse-mesh simulations can be fast but generate unacceptable artifacts and inaccuracies. The recently proposed progressive simulation framework [Zhang et al. 2022] offers a promising avenue to address these limitations with consistent and progressively improving simulation over a hierarchy of increasingly higher-resolution models. Unfortunately, it is currently severely limited in application to meshes and shapes generated via Loop subdivision.
We propose Progressive Shells Quasistatics to extend progressive simulation to the high-fidelity modeling and design of all input shell (and plate) geometries with unstructured (as well as structured) triangle meshes. To do so, we construct a fine-to-coarse hierarchy with a novel nonlinear prolongation operator custom-suited for curved-surface simulation that is rest-shape preserving, supports complex curved boundaries, and enables the reconstruction of detailed geometries from coarse-level meshes. Then, to enable convergent, high-quality solutions with robust contact handling, we propose a new, safe, and efficient shape-preserving upsampling method that ensures non-intersection and strain limits during refinement. With these core contributions, Progressive Shell Quasistatics enables, for the first time, wide generality for progressive simulation, including support for arbitrary curved-shell geometries, progressive collision objects, curved boundaries, and unstructured triangle meshes - all while ensuring that preview and final solutions remain free of intersections. We demonstrate these features across a wide range of stress-tests where progressive simulation captures the wrinkling, folding, twisting, and buckling behaviors of frictionally contacting thin shells with orders-of-magnitude speed-up in examples over direct fine-resolution simulation.
Motivated by humans' ability to adapt skills in the learning of new ones, this paper presents AdaptNet, an approach for modifying the latent space of existing policies to allow new behaviors to be quickly learned from like tasks in comparison to learning from scratch. Building on top of a given reinforcement learning controller, AdaptNet uses a two-tier hierarchy that augments the original state embedding to support modest changes in a behavior and further modifies the policy network layers to make more substantive changes. The technique is shown to be effective for adapting existing physics-based controllers to a wide range of new styles for locomotion, new task targets, changes in character morphology and extensive changes in environment. Furthermore, it exhibits significant increase in learning efficiency, as indicated by greatly reduced training times when compared to training from scratch or using other approaches that modify existing policies. Code is available at https://motion-lab.github.io/AdaptNet.
Recent advances in learning reusable motion priors have demonstrated their effectiveness in generating naturalistic behaviors. In this paper, we propose a new learning framework in this paradigm for controlling physics-based characters with improved motion quality and diversity over existing methods. The proposed method uses reinforcement learning (RL) to initially track and imitate life-like movements from unstructured motion clips using the discrete information bottleneck, as adopted in the Vector Quantized Variational AutoEncoder (VQ-VAE). This structure compresses the most relevant information from the motion clips into a compact yet informative latent space, i.e., a discrete space over vector quantized codes. By sampling codes in the space from a trained categorical prior distribution, high-quality life-like behaviors can be generated, similar to the usage of VQ-VAE in computer vision. Although this prior distribution can be trained with the supervision of the encoder's output, it follows the original motion clip distribution in the dataset and could lead to imbalanced behaviors in our setting. To address the issue, we further propose a technique named prior shifting to adjust the prior distribution using curiosity-driven RL. The outcome distribution is demonstrated to offer sufficient behavioral diversity and significantly facilitates upper-level policy learning for downstream tasks. We conduct comprehensive experiments using humanoid characters on two challenging downstream tasks, sword-shield striking and two-player boxing game. Our results demonstrate that the proposed framework is capable of controlling the character to perform considerably high-quality movements in terms of behavioral strategies, diversity, and realism. Videos, codes, and data are available at https://tencent-roboticsx.github.io/NCP/.
Differentiable physics simulation has shown its efficacy in inverse design problems. Given the pervasiveness of the diverse interactions between fluids and solids in life, a differentiable simulator for the inverse design of the motion of rigid objects in two-way fluid-rigid coupling is also demanded. There are two main challenges to develop a differentiable two-way fluid-solid coupling simulator for rigid body control tasks: the ubiquitous, discontinuous contacts in fluid-solid interactions, and the high computational cost of gradient formulation due to the large number of degrees of freedom (DoF) of fluid dynamics. In this work, we propose a novel differentiable SPH-based two-way fluid-rigid coupling simulator to address these challenges. Our purpose is to provide a differentiable simulator for SPH which incorporates a unified representation for both fluids and solids using particles. However, naively differentiating the forward simulation of the particle system encounters gradient explosion issues. We investigate the instability in differentiating the SPH-based fluid-rigid coupling simulator and present a feasible gradient computation scheme to address its differentiability. In addition, we also propose an efficient method to compute the gradient of fluid-rigid coupling without incurring the high computational cost of differentiating the entire high-DoF fluid system. We show the efficacy, scalability, and extensibility of our method in various challenging rigid body control tasks with diverse fluid-rigid interactions and multi-rigid contacts, achieving up to an order of magnitude speedup in optimization compared to baseline methods in experiments.
We present a set of operators to perform modifications, in particular collapses and splits, in volumetric cell complexes which are discretely embedded in a background mesh. Topological integrity and geometric embedding validity are carefully maintained. We apply these operators strategically to volumetric block decompositions, so-called T-meshes or base complexes, in the context of hexahedral mesh generation. This allows circumventing the expensive and unreliable global volumetric remapping step in the versatile meshing pipeline based on 3D integer-grid maps. In essence, we reduce this step to simpler local cube mapping problems, for which reliable solutions are available. As a consequence, the robustness of the mesh generation process is increased, especially when targeting coarse or block-structured hexahedral meshes. We furthermore extend this pipeline to support feature alignment constraints, and systematically respect these throughout, enabling the generation of meshes that align to points, curves, and surfaces of special interest, whether on the boundary or in the interior of the domain.
We present a numerically robust algorithm for computing the constrained Delaunay tetrahedrization (CDT) of a piecewise-linear complex, which has a 100% success rate on the 4408 valid models in the Thingi10k dataset.
We build on the underlying theory of the well-known tetgen software, but use a floating-point implementation based on indirect geometric predicates to implicitly represent Steiner points: this new approach dramatically simplifies the implementation, removing the need for ad-hoc tolerances in geometric operations. Our approach leads to a robust and parameter-free implementation, with an empirically manageable number of added Steiner points. Furthermore, our algorithm addresses a major gap in tetgen's theory which may lead to algorithmic failure on valid models, even when assuming perfect precision in the calculations.
Our output tetrahedrization conforms with the input geometry without approximations. We can further round our output to floating-point coordinates for downstream applications, which almost always results in valid floating-point meshes unless the input triangulation is very close to being degenerate.
We present a method for the generation of higher-order tetrahedral meshes. In contrast to previous methods, the curved tetrahedral elements are guaranteed to be free of degeneracies and inversions while conforming exactly to prescribed piecewise polynomial surfaces, such as domain boundaries or material interfaces. Arbitrary polynomial order is supported. Algorithmically, the polynomial input surfaces are first covered by a single layer of carefully constructed curved elements using a recursive refinement procedure that provably avoids degeneracies and inversions. These tetrahedral elements are designed such that the remaining space is bounded piecewise linearly. In this way, our method effectively reduces the curved meshing problem to the classical problem of linear mesh generation (for the remaining space).
The property of a surface being developable can be expressed in different equivalent ways, by vanishing Gauss curvature, or by the existence of isometric mappings to planar domains. Computational contributions to this topic range from special parametrizations to discrete-isometric mappings. However, so far a local criterion expressing developability of general quad meshes has been lacking. In this paper, we propose a new and efficient discrete developability criterion that is applied to quad meshes equipped with vertex weights, and which is motivated by a well-known characterization in differential geometry, namely a rank-deficient second fundamental form. We assign contact elements to the faces of meshes and ruling vectors to the edges, which in combination yield a developability condition per face. Using standard optimization procedures, we are able to perform interactive design and developable lofting. The meshes we employ are combinatorially regular quad meshes with isolated singularities but are otherwise not required to follow any special curves on a developable surface. They are thus easily embedded into a design workflow involving standard operations like remeshing, trimming, and merging operations. An important feature is that we can directly derive a watertight, rational bi-quadratic spline surface from our meshes. Remarkably, it occurs as the limit of weighted Doo-Sabin subdivision, which acts in an interpolatory manner on contact elements.
Discrete surfaces with spherical faces are interesting from a simplified manufacturing viewpoint when compared to other double curved face shapes. Furthermore, by the nature of their definition they are also appealing from the theoretical side leading to a Möbius invariant discrete surface theory. We therefore systematically describe so called sphere meshes with spherical faces and circular arcs as edges where the Möbius transformation group acts on all of its elements. Driven by aspects important for manufacturing, we provide the means to cluster spherical panels by their radii. We investigate the generation of sphere meshes which allow for a geometric support structure and characterize all such meshes with triangular combinatorics in terms of non-Euclidean geometries. We generate sphere meshes with hexagonal combinatorics by intersecting tangential spheres of a reference surface and let them evolve - guided by the surface curvature - to visually convex hexagons, even in negatively curved areas. Furthermore, we extend meshes with circular faces of all combinatorics to sphere meshes by filling its circles with suitable spherical caps and provide a remeshing scheme to obtain quadrilateral sphere meshes with support structure from given sphere congruences. By broadening polyhedral meshes to sphere meshes we exploit the additional degrees of freedom to minimize intersection angles of neighboring spheres enabling the use of spherical panels that provide a softer perception of the overall surface.
Mechanical metamaterials enable customizing the elastic properties of physical objects by altering their fine-scale structure. A broad gamut of effective material properties can be produced even from a single fabrication material by optimizing the geometry of a periodic microstructure tiling. Past work has extensively studied the capabilities of microstructures in the small-displacement regime, where periodic homogenization of linear elasticity yields computationally efficient optimal design algorithms. However, many applications involve flexible structures undergoing large deformations for which the accuracy of linear elasticity rapidly deteriorates due to geometric nonlinearities. Design of microstructures at finite strains involves a massive increase in computation and is much less explored; no computational tool yet exists to design metamaterials emulating target hyperelastic laws over finite regions of strain space.
We make an initial step in this direction, developing algorithms to accelerate homogenization and metamaterial design for nonlinear elasticity and building a complete framework for the optimal design of planar metamaterials. Our nonlinear homogenization method works by efficiently constructing an accurate interpolant of a microstructure's deformation over a finite space of macroscopic strains likely to be endured by the metamaterial. From this interpolant, the homogenized energy density, stress, and tangent elasticity tensor describing the microstructure's effective properties can be inexpensively computed at any strain. Our design tool then fits the effective material properties to a target constitutive law over a region of strain space using a parametric shape optimization approach, producing a directly manufacturable geometry. We systematically test our framework by designing a catalog of materials fitting isotropic Hooke's laws as closely as possible. We demonstrate significantly improved accuracy over traditional linear metamaterial design techniques by fabricating and testing physical prototypes.
Nonlinear metamaterials with tailored mechanical properties have applications in engineering, medicine, robotics, and beyond. While modeling their macromechanical behavior is challenging in itself, finding structure parameters that lead to ideal approximation of high-level performance goals is a challenging task. In this work, we propose Neural Metamaterial Networks (NMN)---smooth neural representations that encode the nonlinear mechanics of entire metamaterial families. Given structure parameters as input, NMN return continuously differentiable strain energy density functions, thus guaranteeing conservative forces by construction. Though trained on simulation data, NMN do not inherit the discontinuities resulting from topo-logical changes in finite element meshes. They instead provide a smooth map from parameter to performance space that is fully differentiable and thus well-suited for gradient-based optimization. On this basis, we formulate inverse material design as a nonlinear programming problem that leverages neural networks for both objective functions and constraints. We use this approach to automatically design materials with desired strain-stress curves, prescribed directional stiffness and Poisson ratio profiles. We furthermore conduct ablation studies on network nonlinearities and show the advantages of our approach compared to native-scale optimization.
While 3D printing enables the customization and home fabrication of a wide range of shapes, fabricating freeform thin-shells remains challenging. As layers misalign with the curvature, they incur structural deficiencies, while the curved shells require large support structures, typically using more material than the part itself.
We present a computational framework for optimizing the internal structure of 3D printed plates such that they morph into a desired freeform shell when heated. This exploits the shrinkage effect of thermoplastics such as PLA, which store internal stresses along the deposition directions. These stresses get released when the material is heated again above its glass transition temperature, causing an anisotropic deformation that induces curvature.
Our inverse design method takes as input a freeform surface and finds an optimized set of deposition trajectories in each layer such that their anisotropic shrinkage deforms the plate into the prescribed surface geometry. We optimize for a continuous vector field that varies across the plate and within its thickness. The algorithm then extracts a set of deposition trajectories from the vector field in order to fabricate the flat plates on standard FFF printers. We validate our algorithm on freeform, doubly-curved surfaces.
Additive and subtractive hybrid manufacturing (ASHM) involves the alternating use of additive and subtractive manufacturing techniques, which provides unique advantages for fabricating complex geometries with otherwise inaccessible surfaces. However, a significant challenge lies in ensuring tool accessibility during both fabrication procedures, as the object shape may change dramatically, and different parts of the shape are interdependent. In this study, we propose a computational framework to optimize the planning of additive and subtractive sequences while ensuring tool accessibility. Our goal is to minimize the switching between additive and subtractive processes to achieve efficient fabrication while maintaining product quality. We approach the problem by formulating it as a Volume-And-Surface-CO-decomposition (VASCO) problem. First, we slice volumes into slabs and build a dynamic-directed graph to encode manufacturing constraints, with each node representing a slab and direction reflecting operation order. We introduce a novel geometry property called hybrid-fabricability for a pair of additive and subtractive procedures. Then, we propose a beam-guided top-down block decomposition algorithm to solve the VASCO problem. We apply our solution to a 5-axis hybrid manufacturing platform and evaluate various 3D shapes. Finally, we assess the performance of our approach through both physical and simulated manufacturing evaluations.
Boundary layer flow plays a very important role in shaping the entire flow feature near and behind obstacles inside fluids. Thus, boundary treatment methods are crucial for a physically consistent fluid simulation, especially when turbulence occurs at a high Reynolds number, in which accurately handling thin boundary layer becomes quite challenging. Traditional Navier-Stokes solvers usually construct multi-resolution body-fitted meshes to achieve high accuracy, often together with near-wall and sub-grid turbulence modeling. However, this could be time-consuming and computationally intensive even with GPU accelerations. An alternative and much faster approach is to switch to a kinetic solver, such as the lattice Boltzmann model, but boundary treatment has to be done in a cut-cell manner, sacrificing accuracy unless grid resolution is much increased. In this paper, we focus on simulating the boundary-dominated turbulent flow phenomena with an efficient kinetic solver. In order to significantly improve the cut-cell-based boundary treatment for higher accuracy without excessively increasing the simulation resolution, we propose a novel parametric boundary treatment model, including a semi-Lagrangian scheme at the wall for non-equilibrium distribution functions, together with a purely link-based near-wall analytical mesoscopic model by analogy with the macroscopic wall modeling approach, which is yet simple to compute. Such a new method is further extended to handle moving boundaries, showing increased accuracy. Comprehensive analyses are conducted, with a variety of simulation results that are both qualitatively and quantitatively validated with experiments and real life scenarios, and compared to existing methods, to indicate superiority of our method. We highlight that our method not only provides a more accurate way for boundary treatment, but also a valuable tool to control boundary layer behaviors. This has not been achieved and demonstrated before in computer graphics, which we believe will be very useful in practical engineering.
Kinetic solvers for incompressible fluid simulation were designed to run efficiently on massively parallel architectures such as GPUs. While these lattice Boltzmann solvers have recently proven much faster and more accurate than the macroscopic Navier-Stokes-based solvers traditionally used in graphics, it systematically comes at the price of a very large memory requirement: a mesoscopic discretization of statistical mechanics requires over an order of magnitude more variables per grid node than most fluid solvers in graphics. In order to open up kinetic simulation to gaming and simulation software packages on commodity hardware, we propose a HighOrder Moment-Encoded Lattice-Boltzmann-Method solver which we coined HOME-LBM, requiring only the storage of a few moments per grid node, with little to no loss of accuracy in the typical simulation scenarios encountered in graphics. We show that our lightweight and lightspeed fluid solver requires three times less memory and runs ten times faster than state-of-the-art kinetic solvers, for a nearly-identical visual output.
Multiple fluid simulation has raised wide research interest in recent years. Despite the impressive successes of current works, simulation of scenes containing mixing or unmixing of high-density-ratio phases using particle-based discretizations still remains a challenging task. In this paper, we propose a peridynamic mixture-model theory that stably handles high-density-ratio multi-fluid simulations. With assistance of novel scalar-valued volume flow states, a particle based discretization scheme is proposed to calculate all the terms in the multi-phase Navier-Stokes equations in an integral form, We also design a novel mass updating strategy for enhancing phase mass conservation and reducing particle volume variations under high density ratio settings in multi-fluid simulations. As a result, we achieve significantly stabler simulations in mixture-model multi-fluid simulations involving mixing and unmixing of high density ratio phases. Various experiments and comparisons demonstrate the effectiveness of our approach.
We estimate the three Herschel-Bulkley parameters (yield stress σY, power-law index n, and consistency parameter η) for shear-dependent fluid-like materials possibly with large-scale inclusions, for which rheometers may fail to provide a useful measurement. We perform experiments using the unknown material for dam-break (or column collapse) setups and capture video footage. We then use simulations to optimize for the material parameters. For better match up with the simple shear flow encountered in a rheometer, we modify the flow rule for the elasto-viscoplastic Herschel-Bulkley model. Analyzing the loss landscape for optimization, we realize a similarity relation; material parameters far away within this relation would result in matched simulations, making it hard to distinguish the parameters. We found that by exploiting the setup dependency of the similarity relation, we can improve on the estimation using multiple setups, which we propose by analyzing the Hessian of the similarity relation. We validate the efficacy of our method by comparing the estimations to the measurements from a rheometer (for materials without large-scale inclusions) and show applications to materials with large-scale inclusions, including various salad or pasta sauces, and congee.
Soft robotics offers unique advantages in manipulating fragile or deformable objects, human-robot interaction, and exploring inaccessible terrain. However, designing soft robots that produce large, targeted deformations is challenging. In this paper, we propose a new methodology for designing soft robots that combines optimization-based design with a simple and cost-efficient manufacturing process. Our approach is centered around the concept of robotic skins---thin fabrics with 3D-printed reinforcement patterns that augment and control plain silicone actuators. By decoupling shape control and actuation, our approach enables a simpler and cost-efficient manufacturing process. Unlike previous methods that rely on empirical design heuristics for generating desired deformations, our approach automatically discovers complex reinforcement patterns without any need for domain knowledge or human intervention. This is achieved by casting reinforcement design as a nonlinear constrained optimization problem and using a novel, three-field topology optimization approach tailored to fabrics with 3D-printed reinforcements. We demonstrate the potential of our approach by designing soft robotic actuators capable of various motions such as bending, contraction, twist, and combinations thereof. We also demonstrate applications of our robotic skins to robotic grasping with a soft three-finger gripper and locomotion tasks for a soft quadrupedal robot.
The kinematic motion of a robotic character is defined by its mechanical joints and actuators that restrict the relative motion of its rigid components. Designing robots that perform a given target motion as closely as possible with a fixed number of actuated degrees of freedom is challenging, especially for robots that form kinematic loops. In this paper, we propose a technique that simultaneously solves for optimal design and control parameters for a robotic character whose design is parameterized with configurable joints. At the technical core of our technique is an efficient solution strategy that uses dynamic programming to solve for optimal state, control, and design parameters, together with a strategy to remove redundant constraints that commonly exist in general robot assemblies with kinematic loops. We demonstrate the efficacy of our approach by either editing the design of an existing robotic character, or by optimizing the design of a new character to perform a desired motion.
We introduce an adaptive level-of-detail technique for ray tracing triangle meshes that aims to reduce the memory bandwidth used during ray traversal, which can be the bottleneck for rendering time with large scenes and the primary consumer of energy. We propose a specific data structure for hierarchically representing triangle meshes, allowing localized decisions for the desired mesh resolution per ray. Starting with the lowest-resolution triangle mesh level, higher-resolution levels are generated by tessellating each triangle into four via splitting its edges with arbitrarily-placed vertices. We fit the resulting mesh hierarchy into a specialized acceleration structure to perform on-the-fly tessellation level selection during ray traversal. Our structure reduces both storage cost and data movement during rendering, which are the main consumers of energy. It also allows continuous transitions between detail levels, while locally adjusting the mesh resolution per ray and preserving watertightness. We present how this structure can be used with both primary and secondary rays for reflections and shadows, which can intersect with different tessellation levels, providing consistent results. We also propose specific hardware units to cover the cost of additional compute needed for level-of-detail operations. We evaluate our method using a cycle-accurate simulation of a custom ray tracing hardware architecture. Our results show that, as compared to traditional bounding volume hierarchies, our method can provide more than an order of magnitude reduction in energy use and render time, given sufficient computational resources.
Modeling human behaviors in contextual environments has a wide range of applications in character animation, embodied AI, VR/AR, and robotics. In real-world scenarios, humans frequently interact with the environment and manipulate various objects to complete daily tasks. In this work, we study the problem of full-body human motion synthesis for the manipulation of large-sized objects. We propose Object MOtion guided human MOtion synthesis (OMOMO), a conditional diffusion framework that can generate full-body manipulation behaviors from only the object motion. Since naively applying diffusion models fails to precisely enforce contact constraints between the hands and the object, OMOMO learns two separate denoising processes to first predict hand positions from object motion and subsequently synthesize full-body poses based on the predicted hand positions. By employing the hand positions as an intermediate representation between the two denoising processes, we can explicitly enforce contact constraints, resulting in more physically plausible manipulation motions. With the learned model, we develop a novel system that captures full-body human manipulation motions by simply attaching a smartphone to the object being manipulated. Through extensive experiments, we demonstrate the effectiveness of our proposed pipeline and its ability to generalize to unseen objects. Additionally, as high-quality human-object interaction datasets are scarce, we collect a large-scale dataset consisting of 3D object geometry, object motion, and human motion. Our dataset contains human-object interaction motion for 15 objects, with a total duration of approximately 10 hours.
National Key Laboratory of General Artificial Intelligence, Beijing Institute for General Artificial Intelligence (BIGAI), China
Retrieving out-of-reach objects is a crucial task in virtual reality (VR). One of the most commonly used approaches for this task is the gesture-based approach, which allows for bare-hand, eyes-free, and direct retrieval. However, previous work has primarily focused on assigned gesture design, neglecting the context. This can make it challenging to accurately retrieve an object from a large number of objects due to the one-to-one mapping metaphor, limitations of finger poses, and memory burdens. There is a general consensus that objects and contexts are related, which suggests that the object expected to be retrieved is related to the context, including the scene and the objects with which users interact. As such, we propose a commonsense knowledge-driven joint reasoning approach for object retrieval, where human grasping gestures and context are modeled using an And-Or graph (AOG). This approach enables users to accurately retrieve objects from a large number of candidate objects by using natural grasping gestures based on their experience of grasping physical objects. Experimental results demonstrate that our proposed approach improves retrieval accuracy. We also propose an object retrieval system based on the proposed approach. Two user studies show that our system enables efficient object retrieval in virtual environments (VEs).
Garment modeling is an essential task of the global apparel industry and a core part of digital human modeling. Realistic representation of garments with valid sewing patterns is key to their accurate digital simulation and eventual fabrication. However, little-to-no computational tools provide support for bridging the gap between high-level construction goals and low-level editing of pattern geometry, e.g., combining or switching garment elements, semantic editing, or design exploration that maintains the validity of a sewing pattern. We suggest the first DSL for garment modeling - GarmentCode - that applies principles of object-oriented programming to garment construction and allows designing sewing patterns in a hierarchical, component-oriented manner. The programming-based paradigm naturally provides unique advantages of component abstraction, algorithmic manipulation, and free-form design parametrization. We additionally support the construction process by automating typical low-level tasks like placing a dart at a desired location. In our prototype garment configurator, users can manipulate meaningful design parameters and body measurements, while the construction of pattern geometry is handled by garment programs implemented with GarmentCode. Our configurator enables the free exploration of rich design spaces and the creation of garments using interchangeable, parameterized components. We showcase our approach by producing a variety of garment designs and retargeting them to different body shapes using our configurator. The library and garment configurator are available at https://github.com/maria-korosteleva/GarmentCode.
Garment sewing pattern represents the intrinsic rest shape of a garment, and is the core for many applications like fashion design, virtual try-on, and digital avatars. In this work, we explore the challenging problem of recovering garment sewing patterns from daily photos for augmenting these applications. To solve the problem, we first synthesize a versatile dataset, named SewFactory, which consists of around 1M images and ground-truth sewing patterns for model training and quantitative evaluation. SewFactory covers a wide range of human poses, body shapes, and sewing patterns, and possesses realistic appearances thanks to the proposed human texture synthesis network. Then, we propose a two-level Transformer network called Sewformer, which significantly improves the sewing pattern prediction performance. Extensive experiments demonstrate that the proposed framework is effective in recovering sewing patterns and well generalizes to casually-taken human photos. Code, dataset, and pre-trained models will be released.
This paper presents computational methods to aid the creation of LEGO®1 sketch models from simple input images. Beyond conventional LEGO® mosaics, we aim to improve the expressiveness of LEGO® models by utilizing LEGO® tiles with sloping and rounding edges, together with rectangular bricks, to reproduce smooth curves and sharp features in the input. This is a challenging task, as we have limited brick shapes to use and limited space to place bricks. Also, the search space is immense and combinatorial in nature. We approach the task by decoupling the LEGO® construction into two steps: first approximate the shape with a LEGO®-buildable contour then filling the contour polygon with LEGO® bricks. Further, we formulate this contour approximation into a graph optimization with our objective and constraints and effectively solve for the contour polygon that best approximates the input shape. Further, we extend our optimization model to handle multi-color and multi-layer regions, and formulate a grid alignment process and various perceptual constraints to refine the results. We employ our method to create a large variety of LEGO® models and compare it with humans and baseline methods to manifest its compelling quality and speed.
The evolution of computer-generated holography (CGH) algorithms has prompted significant improvements in the performances of holographic displays. Nonetheless, they start to encounter a limited degree of freedom in CGH optimization and physical constraints stemming from the coherent nature of holograms. To surpass the physical limitations, we consider polarization as a new degree of freedom by utilizing a novel optical platform called metasurface. Polarization-multiplexing metasurfaces enable incoherent-like behavior in holographic displays due to the mutual incoherence of orthogonal polarization states. We leverage this unique characteristic of a metasurface by integrating it into a holographic display and exploiting polarization diversity to bring an additional degree of freedom for CGH algorithms. To minimize the speckle noise while maximizing the image quality, we devise a fully differentiable optimization pipeline by taking into account the metasurface proxy model, thereby jointly optimizing spatial light modulator phase patterns and geometric parameters of metasurface nanostructures. We evaluate the metasurface-enabled depolarized holography through simulations and experiments, demonstrating its ability to reduce speckle noise and enhance image quality.
Holographic displays promise several benefits including high quality 3D imagery, accurate accommodation cues, and compact form-factors. However, holography relies on coherent illumination which can create undesirable speckle noise in the final image. Although smooth phase holograms can be speckle-free, their non-uniform eyebox makes them impractical, and speckle mitigation with partially coherent sources also reduces resolution. Averaging sequential frames for speckle reduction requires high speed modulators and consumes temporal bandwidth that may be needed elsewhere in the system.
In this work, we propose multisource holography, a novel architecture that uses an array of sources to suppress speckle in a single frame without sacrificing resolution. By using two spatial light modulators, arranged sequentially, each source in the array can be controlled almost independently to create a version of the target content with different speckle. Speckle is then suppressed when the contributions from the multiple sources are averaged at the image plane. We introduce an algorithm to calculate multisource holograms, analyze the design space, and demonstrate up to a 10 dB increase in peak signal-to-noise ratio compared to an equivalent single source system. Finally, we validate the concept with a benchtop experimental prototype by producing both 2D images and focal stacks with natural defocus cues.
Our goal is to efficiently learn personalized animatable 3D head avatars from videos that are geometrically accurate, realistic, relightable, and compatible with current rendering systems. While 3D meshes enable efficient processing and are highly portable, they lack realism in terms of shape and appearance. Neural representations, on the other hand, are realistic but lack compatibility and are slow to train and render. Our key insight is that it is possible to efficiently learn high-fidelity 3D mesh representations via differentiable rendering by exploiting highly-optimized methods from traditional computer graphics and approximating some of the components with neural networks. To that end, we introduce FLARE, a technique that enables the creation of animatable and relightable mesh avatars from a single monocular video. First, we learn a canonical geometry using a mesh representation, enabling efficient differentiable rasterization and straightforward animation via learned blendshapes and linear blend skinning weights. Second, we follow physically-based rendering and factor observed colors into intrinsic albedo, roughness, and a neural representation of the illumination, allowing the learned avatars to be relit in novel scenes. Since our input videos are captured on a single device with a narrow field of view, modeling the surrounding environment light is non-trivial. Based on the split-sum approximation for modeling specular reflections, we address this by approximating the prefiltered environment map with a multi-layer perceptron (MLP) modulated by the surface roughness, eliminating the need to explicitly model the light. We demonstrate that our mesh-based avatar formulation, combined with learned deformation, material, and lighting MLPs, produces avatars with high-quality geometry and appearance, while also being efficient to train and render compared to existing approaches.
Immersive user experiences in live VR/AR performances require a fast and accurate free-view rendering of the performers. Existing methods are mainly based on Pixel-aligned Implicit Functions (PIFu) or Neural Radiance Fields (NeRF). However, while PIFu-based methods usually fail to produce photorealistic view-dependent textures, NeRF-based methods typically lack local geometry accuracy and are computationally heavy (e.g., dense sampling of 3D points, additional fine-tuning, or pose estimation). In this work, we propose a novel generalizable method, named SAILOR, to create high-quality human free-view videos from very sparse RGBD live streams. To produce view-dependent textures while preserving locally accurate geometry, we integrate PIFu and NeRF such that they work synergistically by conditioning the PIFu on depth and then rendering view-dependent textures through NeRF. Specifically, we propose a novel network, named SRONet, for this hybrid representation. SRONet can handle unseen performers without fine-tuning. Besides, a neural blending-based ray interpolation approach, a tree-based voxel-denoising scheme, and a parallel computing pipeline are incorporated to reconstruct and render live free-view videos at 10 fps on average. To evaluate the rendering performance, we construct a real-captured RGBD benchmark from 40 performers. Experimental results show that SAILOR outperforms existing human reconstruction and performance capture methods.
Neural fields are evolving towards a general-purpose continuous representation for visual computing. Yet, despite their numerous appealing properties, they are hardly amenable to signal processing. As a remedy, we present a method to perform general continuous convolutions with general continuous signals such as neural fields. Observing that piecewise polynomial kernels reduce to a sparse set of Dirac deltas after repeated differentiation, we leverage convolution identities and train a repeated integral field to efficiently execute large-scale convolutions. We demonstrate our approach on a variety of data modalities and spatially-varying kernels.
Neural Radiance Fields (NeRF) have shown impressive novel view synthesis results; nonetheless, even thorough recordings yield imperfections in reconstructions, for instance due to poorly observed areas or minor lighting changes. Our goal is to mitigate these imperfections from various sources with a joint solution: we take advantage of the ability of generative adversarial networks (GANs) to produce realistic images and use them to enhance realism in 3D scene reconstruction with NeRFs. To this end, we learn the patch distribution of a scene using an adversarial discriminator, which provides feedback to the radiance field reconstruction, thus improving realism in a 3D-consistent fashion. Thereby, rendering artifacts are repaired directly in the underlying 3D representation by imposing multi-view path rendering constraints. In addition, we condition a generator with multi-resolution NeRF renderings which is adversarially trained to further improve rendering quality. We demonstrate that our approach significantly improves rendering quality, e.g., nearly halving LPIPS scores compared to Nerfacto while at the same time improving PSNR by 1.4dB on the advanced indoor scenes of Tanks and Temples.
Neural Radiance Fields (NeRF) can be optimized to obtain high-fidelity 3D scene reconstructions of objects and large-scale scenes. However, NeRFs require accurate camera parameters as input --- inaccurate camera parameters result in blurry renderings. Extrinsic and intrinsic camera parameters are usually estimated using Structure-from-Motion (SfM) methods as a pre-processing step to NeRF, but these techniques rarely yield perfect estimates. Thus, prior works have proposed jointly optimizing camera parameters alongside a NeRF, but these methods are prone to local minima in challenging settings. In this work, we analyze how different camera parameterizations affect this joint optimization problem, and observe that standard parameterizations exhibit large differences in magnitude with respect to small perturbations, which can lead to an ill-conditioned optimization problem. We propose using a proxy problem to compute a whitening transform that eliminates the correlation between camera parameters and normalizes their effects, and we propose to use this transform as a preconditioner for the camera parameters during joint optimization. Our preconditioned camera optimization significantly improves reconstruction quality on scenes from the Mip-NeRF 360 dataset: we reduce error rates (RMSE) by 67% compared to state-of-the-art NeRF approaches that do not optimize for cameras like Zip-NeRF, and by 29% relative to state-of-the-art joint optimization approaches using the camera parameterization of SCNeRF. Our approach is easy to implement, does not significantly increase runtime, can be applied to a wide variety of camera parameterizations, and can straightforwardly be incorporated into other NeRF-like models.
We propose an unified λ-subdivision scheme with a continuous family of tuned subdivisions for quadrilateral meshes. Main subdivision stencil parameters of the unified scheme are represented as spline functions of the subdominant eigenvalue λ of respective subdivision matrices and the λ value can be selected within a wide range to produce desired properties of refined meshes and limit surfaces with optimal curvature performance in extraordinary regions. Spline representations of stencil parameters are constructed based on discrete optimized stencil coefficients obtained by a general tuning framework that optimizes eigenvectors of subdivision matrices towards curvature continuity conditions. To further improve the quality of limit surfaces, a weighting function is devised to penalize sign changes of Gauss curvatures on respective second order characteristic maps. By selecting an appropriate λ, the resulting unified subdivision scheme produces anticipated properties towards different target applications, including nice properties of several other existing tuned subdivision schemes. Comparison results also validate the advantage of the proposed scheme with higher quality surfaces for subdivision at lower λ values, a challenging task for other related tuned subdivision schemes.
K-surfaces are an interactive modeling technique for Bézier-spline surfaces. Inspired by k-curves by [Yan et al. 2017], each patch provides a single control point that is being interpolated at a local extremum of Gaussian curvature. The challenge is to solve the inverse problem of finding the center control point of a Bézier patch given the boundary control points and the handle. Unlike the situation in 2D, bi-quadratic Bézier patches may exhibit none, one, or several extrema, and finding them is non-trivial. We solve the difficult inverse problem, including the possible selection among several extrema, by learning the desired function from samples, generated by computing Gaussian curvature of random patches. This approximation provides a stable solution to the ill-defined inverse problem and is much more efficient than direct numerical optimization, facilitating the interactive modeling framework. The local solution is used in an iterative optimization incorporating continuity constraints across patches. We demonstrate that the surface varies smoothly with the handle location and that the resulting modeling system provides local and generally intuitive control. The idea of learning the inverse mapping from handles to patches may be applicable to other parametric surfaces.
The surface/surface intersection technique serves as one of the most fundamental functions in modern Computer Aided Design (CAD) systems. Despite the long research history and successful applications of surface intersection algorithms in various CAD industrial software, challenges still exist in balancing computational efficiency, accuracy, as well as topology correctness. Specifically, most practical intersection algorithms fail to guarantee the correct topology of the intersection curve(s) when two surfaces are in near-critical positions, which brings instability to CAD systems. Even in one of the most successfully used commercial geometry engines ACIS, such complicated intersection topology can still be a tough nut to crack.
In this paper, we present a practical topology guaranteed algorithm for computing the intersection loci of two B-spline surfaces. Our algorithm well treats all types of common and complicated intersection topology with practical efficiency, including those intersections with multiple branches or cross singularities, contacts in several isolated singular points or highorder contacts along a curve, as well as intersections along boundary curves. We present representative examples of these hard topology situations that challenge not only the open-source geometry engine OCCT but also the commercial engine ACIS. We compare our algorithm in both efficiency and topology correctness on plenty of common and complicated models with the open-source intersection package in SISL, OCCT, and the commercial engine ACIS.
Discontinuous visibility changes at object boundaries remain a persistent source of difficulty in the area of differentiable rendering. Left untreated, they bias computed gradients so severely that even basic optimization tasks fail.
Prior path-space methods addressed this bias by decoupling boundaries from the interior, allowing each part to be handled using specialized Monte Carlo sampling strategies. While conceptually powerful, the full potential of this idea remains unrealized since existing methods often fail to adequately sample the boundary proportional to its contribution.
This paper presents theoretical and algorithmic contributions. On the theoretical side, we transform the boundary derivative into a remarkably simple local integral that invites present and future developments.
Building on this result, we propose a new strategy that projects ordinary samples produced during forward rendering onto nearby boundaries. The resulting projections establish a variance-reducing guiding distribution that accelerates convergence of the subsequent differential phase.
We demonstrate the superior efficiency and versatility of our method across a variety of shape representations, including triangle meshes, implicitly defined surfaces, and cylindrical fibers based on Bézier curves.
Physics-based differentiable rendering is becoming increasingly crucial for tasks in inverse rendering and machine learning pipelines. To address discontinuities caused by geometric boundaries and occlusion, two classes of methods have been proposed: 1) the edge-sampling methods that directly sample light paths at the scene discontinuity boundaries, which require nontrivial data structures and precomputation to select the edges, and 2) the reparameterization methods that avoid discontinuity sampling but are currently limited to hemispherical integrals and unidirectional path tracing.
We introduce a new mathematical formulation that enjoys the benefits of both classes of methods. Unlike previous reparameterization work that focused on hemispherical integral, we derive the reparameterization in the path space. As a result, to estimate derivatives using our formulation, we can apply advanced Monte Carlo rendering methods, such as bidirectional path tracing, while avoiding explicit sampling of discontinuity boundaries. We show differentiable rendering and inverse rendering results to demonstrate the effectiveness of our method.
Recently, great progress has been made in physics-based differentiable rendering. Existing differentiable rendering techniques typically focus on static scenes, but during inverse rendering---a key application for differentiable rendering---the scene is updated dynamically by each gradient step. In this paper, we take a first step to leverage temporal data in the context of inverse direct illumination. By adopting reservoir-based spatiotemporal resampled importance resampling (ReSTIR), we introduce new Monte Carlo estimators for both interior and boundary components of differential direct illumination integrals. We also integrate ReSTIR with antithetic sampling to further improve its effectiveness. At equal frame time, our methods produce gradient estimates with up to 100× lower relative error than baseline methods. Additionally, we propose an inverse-rendering pipeline that incorporates these estimators and provides reconstructions with up to 20× lower error.
In theory, diffusion curves promise complex color gradations for infinite-resolution vector graphics. In practice, existing realizations suffer from poor scaling, discretization artifacts, or insufficient support for rich boundary conditions. Previous applications of the boundary element method to diffusion curves have relied on polygonal approximations, which either forfeit the high-order smoothness of Bézier curves, or, when the polygonal approximation is extremely detailed, result in large and costly systems of equations that must be solved. In this paper, we utilize the boundary integral equation method to accurately and efficiently solve the underlying partial differential equation. Given a desired resolution and viewport, we then interpolate this solution and use the boundary element method to render it. We couple this hybrid approach with the fast multipole method on a non-uniform quadtree for efficient computation. Furthermore, we introduce an adaptive strategy to enable truly scalable infinite-resolution diffusion curves.
In recent years, novel view synthesis from a single image has seen significant progress thanks to the rapid advancements in 3D scene representation and image inpainting techniques. While the current approaches are able to synthesize geometrically consistent novel views, they often do not handle the view-dependent effects properly. Specifically, the highlights in their synthesized images usually appear to be glued to the surfaces, making the novel views unrealistic. To address this major problem, we make a key observation that the process of synthesizing novel views requires changing the shading of the pixels based on the novel camera, and moving them to appropriate locations. Therefore, we propose to split the view synthesis process into two independent tasks of pixel reshading and relocation. During the reshading process, we take the single image as the input and adjust its shading based on the novel camera. This reshaded image is then used as the input to an existing view synthesis method to relocate the pixels and produce the final novel view image. We propose to use a neural network to perform reshading and generate a large set of synthetic input-reshaded pairs to train our network. We demonstrate that our approach produces plausible novel view images with realistic moving highlights on a variety of real world scenes.
Neural image representations offer the possibility of high fidelity, compact storage, and resolution-independent accuracy, providing an attractive alternative to traditional pixel- and grid-based representations. However, coordinate neural networks fail to capture discontinuities present in the image and tend to blur across them; we aim to address this challenge. In many cases, such as rendered images, vector graphics, diffusion curves, or solutions to partial differential equations, the locations of the discontinuities are known. We take those locations as input, represented as linear, quadratic, or cubic Bézier curves, and construct a feature field that is discontinuous across these locations and smooth everywhere else. Finally, we use a shallow multi-layer perceptron to decode the features into the signal value. To construct the feature field, we develop a new data structure based on a curved triangular mesh, with features stored on the vertices and on a subset of the edges that are marked as discontinuous. We show that our method can be used to compress a 100, 0002-pixel rendered image into a 25MB file; can be used as a new diffusion-curve solver by combining with Monte-Carlo-based methods or directly supervised by the diffusion-curve energy; or can be used for compressing 2D physics simulation data.
We explore the space of matrix-generated (0, m, 2)-nets and (0, 2)-sequences in base 2, also known as digital dyadic nets and sequences. In computer graphics, they are arguably leading the competition for use in rendering. We provide a complete characterization of the design space and count the possible number of constructions with and without considering possible reorderings of the point set. Based on this analysis, we then show that every digital dyadic net can be reordered into a sequence, together with a corresponding algorithm. Finally, we present a novel family of self-similar digital dyadic sequences, to be named ξ-sequences, that spans a subspace with fewer degrees of freedom. Those ξ-sequences are extremely efficient to sample and compute, and we demonstrate their advantages over the classic Sobol (0, 2)-sequence.
Artificial light sources make our daily life convenient, but cause a severe problem called light pollution. We propose a novel system for efficient visualization of light pollution in the night sky. Numerous methods have been proposed for rendering the sky, but most of these focus on rendering of the daytime or the sunset sky where the sun is the only, or dominant light source. For the visualization of the light pollution, however, we must consider many city light sources on the ground, resulting in excessive computational cost. We address this problem by precomputing a set of intensity distributions for the sky illuminated by city light at various locations and with different atmospheric conditions. We apply a principal component analysis and fast Fourier transform to the precomputed distributions, allowing us to efficiently visualize the extent of the light pollution. Using this method, we can achieve one to two orders of magnitudes faster computation compared to a naive approach that simply accumulates the scattered intensity for each viewing ray. Furthermore, the fast computation allows us to interactively solve the inverse problem that determines the city light intensity needed to reduce light pollution. Our system provides the user with both a forward and inverse investigation tool for the study and minimization of light pollution.
Speed and consistency of target-shifting play a crucial role in human ability to perform complex tasks. Shifting our gaze between objects of interest quickly and consistently requires changes both in depth and direction. Gaze changes in depth are driven by slow, inconsistent vergence movements which rotate the eyes in opposite directions, while changes in direction are driven by ballistic, consistent movements called saccades, which rotate the eyes in the same direction. In the natural world, most of our eye movements are a combination of both types. While scientific consensus on the nature of saccades exists, vergence and combined movements remain less understood and agreed upon.
We eschew the lack of scientific consensus in favor of proposing an operationalized computational model which predicts the completion time of any type of gaze movement during target-shifting in 3D. To this end, we conduct a psychophysical study in a stereo VR environment to collect more than 12,000 gaze movement trials, analyze the temporal distribution of the observed gaze movements, and fit a probabilistic model to the data. We perform a series of objective measurements and user studies to validate the model. The results demonstrate its predictive accuracy, generalization, as well as applications for optimizing visual performance by altering content placement. Lastly, we leverage the model to measure differences in human target-changing time relative to the natural world, as well as suggest scene-aware projection depth. By incorporating the complexities and randomness of human oculomotor control, we hope this research will support new behavior-aware metrics for VR/AR display design, interface layout, and gaze-contingent rendering.
In everyday photography, physical limitations of camera sensors and lenses frequently lead to a variety of degradations in captured images such as saturation or defocus blur. A common approach to overcome these limitations is to resort to image stack fusion, which involves capturing multiple images with different focal distances or exposures. For instance, to obtain an all-in-focus image, a set of multi-focus images is captured. Similarly, capturing multiple exposures allows for the reconstruction of high dynamic range. In this paper, we present a novel approach that combines neural fields with an expressive camera model to achieve a unified reconstruction of an all-in-focus high-dynamic-range image from an image stack. Our approach is composed of a set of specialized implicit neural representations tailored to address specific sub-problems along our pipeline: We use neural implicits to predict flow to overcome misalignments arising from lens breathing, depth, and all-in-focus images to account for depth of field, as well as tonemapping to deal with sensor responses and saturation - all trained using a physically inspired supervision structure with a differentiable thin lens model at its core. An important benefit of our approach is its ability to handle these tasks simultaneously or independently, providing flexible post-editing capabilities such as refocusing and exposure adjustment. By sampling the three primary factors in photography within our framework (focal distance, aperture, and exposure time), we conduct a thorough exploration to gain valuable insights into their significance and impact on overall reconstruction quality. Through extensive validation, we demonstrate that our method outperforms existing approaches in both depth-from-defocus and all-in-focus image reconstruction tasks. Moreover, our approach exhibits promising results in each of these three dimensions, showcasing its potential to enhance captured image quality and provide greater control in post-processing.
Reproducing the appearance of arbitrary layered materials has long been a critical challenge in computer graphics, with regard to the demanding requirements of both physical accuracy and low computation cost. Recent studies have demonstrated promising results by learning-based representations that implicitly encode the appearance of complex (layered) materials by neural networks. However, existing generally-learned models often struggle between strong representation ability and high runtime performance, and also lack physical parameters for material editing. To address these concerns, we introduce MetaLayer, a new methodology leveraging meta-learning for modeling and rendering layered materials. MetaLayer contains two networks: a BSDFNet that compactly encodes layered materials into implicit neural representations, and a MetaNet that establishes the mapping between the physical parameters of each material and the weights of its corresponding implicit neural representation. A new positional encoding method and a well-designed training strategy are employed to improve the performance and quality of the neural model. As a new learning-based representation, the proposed MetaLayer model provides both fast responses to material editing and high-quality results for a wide range of layered materials, outperforming existing layered BSDF models.
Capturing material properties of real-world elastic solids is both challenging and highly relevant to many applications in computer graphics, robotics and related fields. We give a non-intrusive, in-situ and inexpensive approach to measure the nonlinear elastic energy density function of man-made materials and biological tissues. We poke the elastic object with 3d-printed rigid cylinders of known radii, and use a precision force meter to record the contact force as a function of the indentation depth, which we measure using a force meter stand, or a novel unconstrained laser setup. We model the 3D elastic solid using the Finite Element Method (FEM), and elastic energy using a compressible Valanis-Landel material that generalizes Neo-Hookean materials by permitting arbitrary tensile behavior under large deformations. We then use optimization to fit the nonlinear isotropic elastic energy so that the FEM contact forces and indentations match their measured real-world counterparts. Because we use carefully designed cubic splines, our materials are accurate in a large range of stretches and robust to inversions, and are therefore "animation-ready" for computer graphics applications. We demonstrate how to exploit radial symmetry to convert the 3D elastostatic contact problem to the mathematically equivalent 2D problem, which vastly accelerates optimization. We also greatly improve the theory and robustness of stretch-based elastic materials, by giving a simple and elegant formula to compute the tangent stiffness matrix, with rigorous proofs and singularity handling. We also contribute the observation that volume compressibility can be estimated by poking with rigid cylinders of different radii, which avoids optical cameras and greatly simplifies experiments. We validate our method by performing full 3D simulations using the optimized materials and confirming that they match real-world forces, indentations and real deformed 3D shapes. We also validate it using a "Shore 00" durometer, a standard device for measuring material hardness.
Music-driven group choreography poses a considerable challenge but holds significant potential for a wide range of industrial applications. The ability to generate synchronized and visually appealing group dance motions that are aligned with music opens up opportunities in many fields such as entertainment, advertising, and virtual performances. However, most of the recent works are not able to generate high-fidelity long-term motions, or fail to enable controllable experience. In this work, we aim to address the demand for high-quality and customizable group dance generation by effectively governing the consistency and diversity of group choreographies. In particular, we utilize a diffusion-based generative approach to enable the synthesis of flexible number of dancers and long-term group dances, while ensuring coherence to the input music. Ultimately, we introduce a Group Contrastive Diffusion (GCD) strategy to enhance the connection between dancers and their group, presenting the ability to control the consistency or diversity level of the synthesized group animation via the classifier-guidance sampling technique. Through intensive experiments and evaluation, we demonstrate the effectiveness of our approach in producing visually captivating and consistent group dance motions. The experimental results show the capability of our method to achieve the desired levels of consistency and diversity, while maintaining the overall quality of the generated group choreography.
Synthesizing photorealistic 4D human head avatars from videos is essential for VR/AR, telepresence, and video game applications. Although existing Neural Radiance Fields (NeRF)-based methods achieve high-fidelity results, the computational expense limits their use in real-time applications. To overcome this limitation, we introduce BakedAvatar, a novel representation for real-time neural head avatar synthesis, deployable in a standard polygon rasterization pipeline. Our approach extracts deformable multi-layer meshes from learned isosurfaces of the head and computes expression-, pose-, and view-dependent appearances that can be baked into static textures for efficient rasterization. We thus propose a three-stage pipeline for neural head avatar synthesis, which includes learning continuous deformation, manifold, and radiance fields, extracting layered meshes and textures, and fine-tuning texture details with differential rasterization. Experimental results demonstrate that our representation generates synthesis results of comparable quality to other state-of-the-art methods while significantly reducing the inference time required. We further showcase various head avatar synthesis results from monocular videos, including view synthesis, face reenactment, expression editing, and pose editing, all at interactive frame rates on commodity devices. Source codes and demos are available on our project page.
Capturing and editing full-head performances enables the creation of virtual characters with various applications such as extended reality and media production. The past few years witnessed a steep rise in the photorealism of human head avatars. Such avatars can be controlled through different input data modalities, including RGB, audio, depth, IMUs, and others. While these data modalities provide effective means of control, they mostly focus on editing the head movements such as the facial expressions, head pose, and/or camera viewpoint. In this paper, we propose AvatarStudio, a text-based method for editing the appearance of a dynamic full head avatar. Our approach builds on existing work to capture dynamic performances of human heads using Neural Radiance Field (NeRF) and edits this representation with a text-to-image diffusion model. Specifically, we introduce an optimization strategy for incorporating multiple keyframes representing different camera viewpoints and time stamps of a video performance into a single diffusion model. Using this personalized diffusion model, we edit the dynamic NeRF by introducing view-and-time-aware Score Distillation Sampling (VT-SDS) following a model-based guidance approach. Our method edits the full head in a canonical space and then propagates these edits to the remaining time steps via a pre-trained deformation network. We evaluate our method visually and numerically via a user study, and results show that our method outperforms existing approaches. Our experiments validate the design choices of our method and highlight that our edits are genuine, personalized, as well as 3D- and time-consistent.
Coarse architectural models are often generated at scales ranging from individual buildings to scenes for downstream applications such as Digital Twin City, Metaverse, LODs, etc. Such piece-wise planar models can be abstracted as twins from 3D dense reconstructions. However, these models typically lack realistic texture relative to the real building or scene, making them unsuitable for vivid display or direct reference. In this paper, we present TwinTex, the first automatic texture mapping framework to generate a photorealistic texture for a piece-wise planar proxy. Our method addresses most challenges occurring in such twin texture generation. Specifically, for each primitive plane, we first select a small set of photos with greedy heuristics considering photometric quality, perspective quality and facade texture completeness. Then, different levels of line features (LoLs) are extracted from the set of selected photos to generate guidance for later steps. With LoLs, we employ optimization algorithms to align texture with geometry from local to global. Finally, we fine-tune a diffusion model with a multi-mask initialization component and a new dataset to inpaint the missing region. Experimental results on many buildings, indoor scenes and man-made objects of varying complexity demonstrate the generalization ability of our algorithm. Our approach surpasses state-of-the-art texture mapping methods in terms of high-fidelity quality and reaches a human-expert production level with much less effort.
This paper presents a new text-guided technique for generating 3D shapes. The technique leverages a hybrid 3D shape representation, namely EXIM, combining the strengths of explicit and implicit representations. Specifically, the explicit stage controls the topology of the generated 3D shapes and enables local modifications, whereas the implicit stage refines the shape and paints it with plausible colors. Also, the hybrid approach separates the shape and color and generates color conditioned on shape to ensure shape-color consistency. Unlike the existing state-of-the-art methods, we achieve high-fidelity shape generation from natural-language descriptions without the need for time-consuming per-shape optimization or reliance on human-annotated texts during training or test-time optimization. Further, we demonstrate the applicability of our approach to generate indoor scenes with consistent styles using text-induced 3D shapes. Through extensive experiments, we demonstrate the compelling quality of our results and the high coherency of our generated shapes with the input texts, surpassing the performance of existing methods by a significant margin. Codes and models are released at https://github.com/liuzhengzhe/EXIM.
Motion graphics videos are widely used in Web design, digital advertising, animated logos and film title sequences, to capture a viewer's attention. But editing such video is challenging because the video provides a low-level sequence of pixels and frames rather than higher-level structure such as the objects in the video with their corresponding motions and occlusions. We present a motion vectorization pipeline for converting motion graphics video into an SVG motion program that provides such structure. The resulting SVG program can be rendered using any SVG renderer (e.g. most Web browsers) and edited using any SVG editor. We also introduce a program transformation API that facilitates editing of a SVG motion program to create variations that adjust the timing, motions and/or appearances of objects. We show how the API can be used to create a variety of effects including retiming object motion to match a music beat, adding motion textures to objects, and collision preserving appearance changes.
Scalable Vector Graphics (SVG) is a popular vector image format that offers good support for interactivity and animation. Despite its appealing characteristics, creating custom SVG content can be challenging for users due to the steep learning curve required to understand SVG grammars or get familiar with professional editing software. Recent advancements in text-to-image generation have inspired researchers to explore vector graphics synthesis using either image-based methods (i.e., text → raster image → vector graphics) combining text-to-image generation models with image vectorization, or language-based methods (i.e., text → vector graphics script) through pretrained large language models. Nevertheless, these methods suffer from limitations in terms of generation quality, diversity, and flexibility. In this paper, we introduce IconShop, a text-guided vector icon synthesis method using autoregressive transformers. The key to success of our approach is to sequentialize and tokenize SVG paths (and textual descriptions as guidance) into a uniquely decodable token sequence. With that, we are able to exploit the sequence learning power of autoregressive transformers, while enabling both unconditional and text-conditioned icon synthesis. Through standard training to predict the next token on a large-scale vector icon dataset accompanied by textural descriptions, the proposed IconShop consistently exhibits better icon synthesis capability than existing image-based and language-based methods both quantitatively (using the FID and CLIP scores) and qualitatively (through formal subjective user studies). Meanwhile, we observe a dramatic improvement in generation diversity, which is validated by the objective Uniqueness and Novelty measures. More importantly, we demonstrate the flexibility of IconShop with multiple novel icon synthesis tasks, including icon editing, icon interpolation, icon semantic combination, and icon design auto-suggestion.
The target of the occluder is to use very few faces to maintain similar occlusion properties of the original 3D model. In this paper, we present DR-Occluder, a novel coarse-to-fine framework for occluder generation that leverages differentiable rendering to optimize a triangle set to an occluder. Unlike prior work, which has not utilized differentiable rendering for this task, our approach provides the ability to optimize a 3D shape to defined targets. Given a 3D model as input, our method first projects it to silhouette images, which are then processed by a convolution network to output a group of vertex offsets. These offsets are used to transform a group of distributed triangles into a preliminary occluder, which is further optimized by differentiable rendering. Finally, triangles whose area is smaller than a threshold are removed to obtain the final occluder. Our extensive experiments demonstrate that DR-Occluder significantly outperforms state-of-the-art methods in terms of occlusion quality. Furthermore, we compare the performance of our method with other approaches in a commercial engine, providing compelling evidence of its effectiveness.
We propose an efficient method for differentiable rendering of parametric surfaces and curves, which enables their use in inverse graphics problems. Our central observation is that a representative triangle mesh can be extracted from a continuous parametric object in a differentiable and efficient way. We derive differentiable meshing operators for surfaces and curves that provide varying levels of approximation granularity. With triangle mesh approximations, we can readily leverage existing machinery for differentiable mesh rendering to handle parametric geometry. Naively combining differentiable tessellation with inverse graphics settings lacks robustness and is prone to reaching undesirable local minima. To this end, we draw a connection between our setting and the optimization of triangle meshes in inverse graphics and present a set of optimization techniques, including regularizations and coarse-to-fine schemes. We show the viability and efficiency of our method in a set of image-based computer-aided design applications.
Inverse rendering, the process of inferring scene properties from images, is a challenging inverse problem. The task is ill-posed, as many different scene configurations can give rise to the same image. Most existing solutions incorporate priors into the inverse-rendering pipeline to encourage plausible solutions, but they do not consider the inherent ambiguities and the multi-modal distribution of possible decompositions. In this work, we propose a novel scheme that integrates a denoising diffusion probabilistic model pre-trained on natural illumination maps into an optimization framework involving a differentiable path tracer. The proposed method allows sampling from combinations of illumination and spatially-varying surface materials that are, both, natural and explain the image observations. We further conduct an extensive comparative study of different priors on illumination used in previous work on inverse rendering. Our method excels in recovering materials and producing highly realistic and diverse environment map samples that faithfully explain the illumination of the input images.
Many parametrization and mapping-related problems in geometry processing can be viewed as metric optimization problems, i.e., computing a metric minimizing a functional and satisfying a set of constraints, such as flatness.
Penner coordinates are global coordinates on the space of metrics on meshes with a fixed vertex set and topology, but varying connectivity, making it homeomorphic to the Euclidean space of dimension equal to the number of edges in the mesh, without any additional constraints imposed. These coordinates play an important role in the theory of discrete conformal maps, enabling recent development of highly robust algorithms with convergence and solution existence guarantees for computing such maps.
We demonstrate how Penner coordinates can be used to solve a general class of optimization problems involving metrics, including optimization and interpolation, while retaining the key solution existence guarantees available for discrete conformal maps.
We propose an efficient method to construct sparse cone singularities under distortion-bounded constraints for conformal parameterizations. Central to our algorithm is using the technique of shape derivatives to move cones for distortion reduction without changing the number of cones. In particular, the supernodal sparse Cholesky update significantly accelerates this movement process. To satisfy the distortion-bounded constraint, we alternately move cones and add cones. The capability and feasibility of our approach are demonstrated over a data set containing 3885 models. Compared with the state-of-the-art method, we achieve an average acceleration of 15 times and slightly fewer cones for the same amount of distortion.
We present GEGNN, a learning-based method for computing the approximate geodesic distance between two arbitrary points on discrete polyhedra surfaces with constant time complexity after fast precomputation. Previous relevant methods either focus on computing the geodesic distance between a single source and all destinations, which has linear complexity at least or require a long precomputation time. Our key idea is to train a graph neural network to embed an input mesh into a high-dimensional embedding space and compute the geodesic distance between a pair of points using the corresponding embedding vectors and a lightweight decoding function. To facilitate the learning of the embedding, we propose novel graph convolution and graph pooling modules that incorporate local geodesic information and are verified to be much more effective than previous designs. After training, our method requires only one forward pass of the network per mesh as precomputation. Then, we can compute the geodesic distance between a pair of points using our decoding function, which requires only several matrix multiplications and can be massively parallelized on GPUs. We verify the efficiency and effectiveness of our method on ShapeNet and demonstrate that our method is faster than existing methods by orders of magnitude while achieving comparable or better accuracy. Additionally, our method exhibits robustness on noisy and incomplete meshes and strong generalization ability on out-of-distribution meshes. The code and pretrained model can be found on https://github.com/IntelligentGeometry/GeGnn.
While perspective is a well-studied topic in art, it is generally taken for granted in images. However, for the recent wave of high-quality image synthesis methods such as latent diffusion models, perspective accuracy is not an explicit requirement. Since these methods are capable of outputting a wide gamut of possible images, it is difficult for these synthesized images to adhere to the principles of linear perspective. We introduce a novel geometric constraint in the training process of generative models to enforce perspective accuracy. We show that outputs of models trained with this constraint both appear more realistic and improve performance of downstream models trained on generated images. Subjective human trials show that images generated with latent diffusion models trained with our constraint are preferred over images from the Stable Diffusion V2 model 70% of the time. SOTA monocular depth estimation models such as DPT and PixelFormer, fine-tuned on our images, outperform the original models trained on real images by up to 7.03% in RMSE and 19.3% in SqRel on the KITTI test set for zero-shot transfer.
Diffusion models have achieved promising results in image restoration tasks, yet suffer from time-consuming, excessive computational resource consumption, and unstable restoration. To address these issues, we propose a robust and efficient Diffusion-based Low-Light image enhancement approach, dubbed DiffLL. Specifically, we present a wavelet-based conditional diffusion model (WCDM) that leverages the generative power of diffusion models to produce results with satisfactory perceptual fidelity. Additionally, it also takes advantage of the strengths of wavelet transformation to greatly accelerate inference and reduce computational resource usage without sacrificing information. To avoid chaotic content and diversity, we perform both forward diffusion and denoising in the training phase of WCDM, enabling the model to achieve stable denoising and reduce randomness during inference. Moreover, we further design a high-frequency restoration module (HFRM) that utilizes the vertical and horizontal details of the image to complement the diagonal information for better fine-grained restoration. Extensive experiments on publicly available real-world benchmarks demonstrate that our method outperforms the existing state-of-the-art methods both quantitatively and visually, and it achieves remarkable improvements in efficiency compared to previous diffusion-based methods. In addition, we empirically show that the application for low-light face detection also reveals the latent practical values of our method. Code is available at https://github.com/JianghaiSCU/Diffusion-Low-Light.
We present a method for interactively authoring and simulating meandering river networks. Starting from a terrain with an initial low-resolution network encoded as a directed graph, we simulate the evolution of the path of the different river channels using a physically-based migration equation augmented with control terms. The curvature-based terms in the equation allow us to reproduce phenomena identified in geomorphology, such as downstream migration of bends. Control terms account for the influence of the landscape topography and user-defined river trajectory constraints. Our model implements abrupt events that shape meandering networks, such as cutoffs forming oxbow lakes and avulsions. We visually show the effectiveness of our method and compare the generated networks quantitatively to river data by analyzing sinuosity and wavelength metrics. Our vector-based model runs at interactive rates, allowing for efficient authoring of large-scale meandering networks.
We present a novel hybrid Lagrangian/Eulerian method for simulating inelastic flows that generates high-quality particle distributions with adaptive volumes. At its core, our approach integrates an updated Lagrangian time discretization of continuum mechanics with the Power Particle-In-Cell geometric representation of deformable materials. As a result, we obtain material points described by optimized density kernels that precisely track the varying particle volumes both spatially and temporally. For efficient CFL-rate simulations, we also propose an implicit time integration for our system using a non-linear Gauss-Seidel solver inspired by X-PBD, viewing Eulerian nodal velocities as primal variables. We demonstrate the versatility of our method with simulations of mesoscale bubbles, sands, liquid, and foams.
A creative idea is often born from transforming, combining, and modifying ideas from existing visual examples capturing various concepts. However, one cannot simply copy the concept as a whole, and inspiration is achieved by examining certain aspects of the concept. Hence, it is often necessary to separate a concept into different aspects to provide new perspectives. In this paper, we propose a method to decompose a visual concept, represented as a set of images, into different visual aspects encoded in a hierarchical tree structure. We utilize large vision-language models and their rich latent space for concept decomposition and generation. Each node in the tree represents a sub-concept using a learned vector embedding injected into the latent space of a pretrained text-to-image model. We use a set of regularizations to guide the optimization of the embedding vectors encoded in the nodes to follow the hierarchical structure of the tree. Our method allows to explore and discover new concepts derived from the original one. The tree provides the possibility of endless visual sampling at each node, allowing the user to explore the hidden sub-concepts of the object of interest. The learned aspects in each node can be combined within and across trees to create new visual ideas, and can be used in natural language sentences to apply such aspects to new designs.
Project page: https://inspirationtree.github.io/inspirationtree/
We study how to optimize the latent space of neural shape generators that map latent codes to 3D deformable shapes. The key focus is to look at a deformable shape generator from a differential geometry perspective. We define a Riemannian metric based on as-rigid-as-possible and as-conformal-as-possible deformation energies. Under this metric, we study two desired properties of the latent space: 1) straight-line interpolations in latent codes follow geodesic curves; 2) latent codes disentangle pose and shape variations at different scales. Strictly enforcing the geometric interpolation property, however, only applies if the metric matrix is a constant. We show how to achieve this property approximately by enforcing that geodesic interpolations are axis-aligned, i.e., interpolations along coordinate axis follow geodesic curves. In addition, we introduce a novel approach that decouples pose and shape variations via generalized eigendecomposition. We also study efficient regularization terms for learning deformable shape generators, e.g., that promote smooth interpolations. Experimental results on benchmark datasets show that our approach leads to interpretable latent codes, improves the generalizability of synthetic shapes, and enhances performance in geodesic interpolation and geodesic shooting.
A key aspect of text-to-image personalization methods is the manner in which the target concept is represented within the generative process. This choice greatly affects the visual fidelity, downstream editability, and disk space needed to store the learned concept. In this paper, we explore a new text-conditioning space that is dependent on both the denoising process timestep (time) and the denoising U-Net layers (space) and showcase its compelling properties. A single concept in the space-time representation is composed of hundreds of vectors, one for each combination of time and space, making this space challenging to optimize directly. Instead, we propose to implicitly represent a concept in this space by optimizing a small neural mapper that receives the current time and space parameters and outputs the matching token embedding. In doing so, the entire personalized concept is represented by the parameters of the learned mapper, resulting in a compact, yet expressive, representation. Similarly to other personalization methods, the output of our neural mapper resides in the input space of the text encoder. We observe that one can significantly improve the convergence and visual fidelity of the concept by introducing a textual bypass, where our neural mapper additionally outputs a residual that is added to the output of the text encoder. Finally, we show how one can impose an importance-based ordering over our implicit representation, providing users control over the reconstruction and editability of the learned concept using a single trained model. We demonstrate the effectiveness of our approach over a range of concepts and prompts, showing our method's ability to generate high-quality and controllable compositions without fine-tuning any parameters of the generative model itself.
Personalizing generative models offers a way to guide image generation with user-provided references. Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models. However, representing and editing specific visual attributes such as material, style, and layout remains a challenge, leading to a lack of disentanglement and editability. To address this problem, we propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low to high frequency information, providing a new perspective on representing, generating, and editing images. We develop the Prompt Spectrum Space P*, an expanded textual conditioning space, and a new image representation method called ProSpect. ProSpect represents an image as a collection of inverted textual token embeddings encoded from per-stage prompts, where each prompt corresponds to a specific generation stage (i.e., a group of consecutive steps) of the diffusion model. Experimental results demonstrate that P* and ProSpect offer better disentanglement and controllability compared to existing methods. We apply ProSpect in various personalized attribute-aware image generation applications, such as image-guided or text-driven manipulations of materials, style, and layout, achieving previously unattainable results from a single image input without fine-tuning the diffusion models. Our source code is available at https://github.com/zyxElsa/ProSpect.
In this paper, we propose a new method, called DoubleCoverUDF, for extracting the zero level-set from unsigned distance fields (UDFs). DoubleCoverUDF takes a learned UDF and a user-specified parameter r (a small positive real number) as input and extracts an iso-surface with an iso-value r using the conventional marching cubes algorithm. We show that the computed iso-surface is the boundary of the r-offset volume of the target zero level-set S, which is an orientable manifold, regardless of the topology of S. Next, the algorithm computes a covering map to project the boundary mesh onto S, preserving the mesh's topology and avoiding folding. If S is an orientable manifold surface, our algorithm separates the double-layered mesh into a single layer using a robust minimum-cut post-processing step. Otherwise, it keeps the double-layered mesh as the output. We validate our algorithm by reconstructing 3D surfaces of open models and demonstrate its efficacy and effectiveness on synthetic models and benchmark datasets. Our experimental results confirm that our method is robust and produces meshes with better quality in terms of both visual evaluation and quantitative measures than existing UDF-based methods. The source code is available at https://github.com/jjjkkyz/DCUDF.
We introduce MIPS-Fusion, a robust and scalable online RGB-D reconstruction method based on a novel neural implicit representation - multi-implicit-submap. Different from existing neural RGB-D reconstruction methods lacking either flexibility with a single neural map or scalability due to extra storage of feature grids, we propose a pure neural representation tackling both difficulties with a divide-and-conquer design. In our method, neural submaps are incrementally allocated alongside the scanning trajectory and efficiently learned with local neural bundle adjustments. The submaps can be refined individually in a back-end optimization and optimized jointly to realize submap-level loop closure. Meanwhile, we propose a hybrid tracking approach combining randomized and gradient-based pose optimizations. For the first time, randomized optimization is made possible in neural tracking with several key designs to the learning process, enabling efficient and robust tracking even under fast camera motions. The extensive evaluation demonstrates that our method attains higher reconstruction quality than the state of the arts for large-scale scenes and under fast camera motions.
We present ClothCombo, a pipeline to drape arbitrary combinations of clothes on 3D human models with varying body shapes and poses. While existing learning-based approaches for draping clothes have shown promising results, multi-layered clothing remains challenging as it is non-trivial to model inter-cloth interaction. To this end, our method utilizes a GNN-based network to efficiently model the interaction between clothes in different layers, thus enabling multi-layered clothing. Specifically, we first create feature embedding for each cloth using a topology-agnostic network. Then, the draping network deforms all clothes to fit the target body shape and pose without considering inter-cloth interaction. Lastly, the untangling network predicts the per-vertex displacements in a way that resolves interpenetration between clothes. In experiments, the proposed model demonstrates strong performance in complex multi-layered scenarios. Being agnostic to cloth topology, our method can be readily used for layered virtual try-on of real clothes in diverse poses and combinations of clothes.
We introduce Neural Flow Maps, a novel simulation method bridging the emerging paradigm of implicit neural representations with fluid simulation based on the theory of flow maps, to achieve state-of-the-art simulation of in-viscid fluid phenomena. We devise a novel hybrid neural field representation, Spatially Sparse Neural Fields (SSNF), which fuses small neural networks with a pyramid of overlapping, multi-resolution, and spatially sparse grids, to compactly represent long-term spatiotemporal velocity fields at high accuracy. With this neural velocity buffer in hand, we compute long-term, bidirectional flow maps and their Jacobians in a mechanistically symmetric manner, to facilitate drastic accuracy improvement over existing solutions. These long-range, bidirectional flow maps enable high advection accuracy with low dissipation, which in turn facilitates high-fidelity incompressible flow simulations that manifest intricate vortical structures. We demonstrate the efficacy of our neural fluid simulation in a variety of challenging simulation scenarios, including leapfrogging vortices, colliding vortices, vortex reconnections, as well as vortex generation from moving obstacles and density differences. Our examples show increased performance over existing methods in terms of energy conservation, visual complexity, adherence to experimental observations, and preservation of detailed vortical structures.
Today's commodity camera systems rely on compound optics to map light originating from the scene to positions on the sensor where it gets recorded as an image. To record images without optical aberrations, i.e., deviations from Gauss' linear model of optics, typical lens systems introduce increasingly complex stacks of optical elements which are responsible for the height of existing commodity cameras. In this work, we investigate flat nanophotonic computational cameras as an alternative that employs an array of skewed lenslets and a learned reconstruction approach. The optical array is embedded on a metasurface that, at 700 nm height, is flat and sits on the sensor cover glass at 2.5 mm focal distance from the sensor. To tackle the highly chromatic response of a metasurface and design the array over the entire sensor, we propose a differentiable optimization method that continuously samples over the visible spectrum and factorizes the optical modulation for different incident fields into individual lenses. We reconstruct a megapixel image from our flat imager with a learned probabilistic reconstruction method that employs a generative diffusion model to sample an implicit prior. To tackle scene-dependent aberrations in broadband, we propose a method for acquiring paired captured training data in varying illumination conditions. We assess the proposed flat camera design in simulation and with an experimental prototype, validating that the method is capable of recovering images from diverse scenes in broadband with a single nanophotonic layer.
We introduce an active 3D reconstruction method which integrates visual perception, robot-object interaction, and 3D scanning to recover both the exterior and interior, i.e., unexposed, geometries of a target 3D object. Unlike other works in active vision which focus on optimizing camera viewpoints to better investigate the environment, the primary feature of our reconstruction is an analysis of the interactability of various parts of the target object and the ensuing part manipulation by a robot to enable scanning of occluded regions. As a result, an understanding of part articulations of the target object is obtained on top of complete geometry acquisition. Our method operates fully automatically by a Fetch robot with built-in RGBD sensors. It iterates between interaction analysis and interaction-driven reconstruction, scanning and reconstructing detected moveable parts one at a time, where both the articulated part detection and mesh reconstruction are carried out by neural networks. In the final step, all the remaining, non-articulated parts, including all the interior structures that had been exposed by prior part manipulations and subsequently scanned, are reconstructed to complete the acquisition. We demonstrate the performance of our method via qualitative and quantitative evaluation, ablation studies, comparisons to alternatives, as well as experiments in a real environment.
Autonomous surface reconstruction of 3D scenes has been intensely studied in recent years, however, it is still difficult to accurately reconstruct all the surface details of complex scenes with complicated object relations and severe occlusions, which makes the reconstruction results not suitable for direct use in applications such as gaming and virtual reality. Therefore, instead of reconstructing the detailed surfaces, we aim to recompose the scene with CAD models retrieved from a given dataset to faithfully reflect the object geometry and arrangement in the given scene. Moreover, unlike most of the previous works on scene CAD recomposition requiring an offline reconstructed scene or captured video as input, which leads to significant data redundancy, we propose a novel online scene CAD recomposition method with autonomous scanning, which efficiently recomposes the scene with the guidance of automatically optimized Next-Best-View (NBV) in a single online scanning pass. Based on the key observation that spatial relation in the scene can not only constrain the object pose and layout optimization but also guide the NBV generation, our system consists of two key modules: relation-guided CAD recomposition module that uses relation-constrained global optimization to get accurate object pose and layout estimation, and relation-aware NBV generation module that makes the exploration during the autonomous scanning tailored for our composition task. Extensive experiments have been conducted to show the superiority of our method over previous methods in scanning efficiency and retrieval accuracy as well as the importance of each key component of our method.
We address the problem of scene-aware activity program generation, which requires decomposing a given activity task into instructions that can be sequentially performed within a target scene to complete the activity. While existing methods have shown the ability to generate rational or executable programs, generating programs with both high rationality and executability still remains a challenge. Hence, we propose a novel method where the key idea is to explicitly combine the language rationality of a powerful language model with dynamic perception of the target scene where instructions are executed, to generate programs with high rationality and executability. Our method iteratively generates instructions for the activity program. Specifically, a two-branch feature encoder operates on a language-based and graph-based representation of the current generation progress to extract language features and scene graph features, respectively. These features are then used by a predictor to generate the next instruction in the program. Subsequently, another module performs the predicted action and updates the scene for perception in the next iteration. Extensive evaluations are conducted on the VirtualHome-Env dataset, showing the advantages of our method over previous work. Key algorithmic designs are validated through ablation studies, and results on other types of inputs are also presented to show the generalizability of our method.
Great progress has been made in estimating 3D human pose and shape from images and video by training neural networks to directly regress the parameters of parametric human models like SMPL. However, existing body models have simplified kinematic structures that do not correspond to the true joint locations and articulations in the human skeletal system, limiting their potential use in biomechanics. On the other hand, methods for estimating biomechanically accurate skeletal motion typically rely on complex motion capture systems and expensive optimization methods. What is needed is a parametric 3D human model with a biomechanically accurate skeletal structure that can be easily posed. To that end, we develop SKEL, which re-rigs the SMPL body model with a biomechanics skeleton. To enable this, we need training data of skeletons inside SMPL meshes in diverse poses. We build such a dataset by optimizing biomechanically accurate skeletons inside SMPL meshes from AMASS sequences. We then learn a regressor from SMPL mesh vertices to the optimized joint locations and bone rotations. Finally, we re-parametrize the SMPL mesh with the new kinematic parameters. The resulting SKEL model is animatable like SMPL but with fewer, and biomechanically-realistic, degrees of freedom. We show that SKEL has more biomechanically accurate joint locations than SMPL, and the bones fit inside the body surface better than previous methods. By fitting SKEL to SMPL meshes we are able to "upgrade" existing human pose and shape datasets to include biomechanical parameters. SKEL provides a new tool to enable biomechanics in the wild, while also providing vision and graphics researchers with a better constrained and more realistic model of human articulation. The model, code, and data are available for research at https://skel.is.tue.mpg.de.
We present the first large-scale database of measured spatially-varying anisotropic reflectance, consisting of 1,000 high-quality near-planar SVBRDFs, spanning 9 material categories such as wood, fabric and metal. Each sample is captured in 15 minutes, and represented as a set of high-resolution texture maps that correspond to spatially-varying BRDF parameters and local frames. To build this database, we develop a novel integrated system for robust, high-quality and -efficiency reflectance acquisition and reconstruction. Our setup consists of 2 cameras and 16,384 LEDs. We train 64 lighting patterns for efficient acquisition, in conjunction with a network that predicts per-point reflectance in a neural representation from carefully aligned two-view measurements captured under the patterns. The intermediate results are further fine-tuned with respect to the photographs acquired under 63 effective linear lights, and finally fitted to a BRDF model. We report various statistics of the database, and demonstrate its value in the applications of material generation, classification as well as sampling. All related data, including future additions to the database, can be downloaded from https://opensvbrdf.github.io/.
We propose a variational technique to optimize for generalized barycentric coordinates that offers additional control compared to existing models. Prior work represents barycentric coordinates using meshes or closed-form formulae, in practice limiting the choice of objective function. In contrast, we directly parameterize the continuous function that maps any coordinate in a polytope's interior to its barycentric coordinates using a neural field. This formulation is enabled by our theoretical characterization of barycentric coordinates, which allows us to construct neural fields that parameterize the entire function class of valid coordinates. We demonstrate the flexibility of our model using a variety of objective functions, including multiple smoothness and deformation-aware energies; as a side contribution, we also present mathematically-justified means of measuring and minimizing objectives like total variation on discontinuous neural fields. We offer a practical acceleration strategy, present a thorough validation of our algorithm, and demonstrate several applications.
Straight flat strips of inextensible material can be bent into curved strips aligned with arbitrary space curves. The large shape variety of these so-called rectifying strips makes them candidates for shape modeling, especially in applications such as architecture where simple elements are preferred for the fabrication of complex shapes. In this paper, we provide computational tools for the design of shapes from rectifying strips. They can form various patterns and fulfill constraints which are required for specific applications such as gridshells or shading systems. The methodology is based on discrete models of rectifying strips, a discrete level-set formulation and optimization-based constrained mesh design and editing. We also analyse the geometry at nodes and present remarkable quadrilateral arrangements of rectifying strips with torsion-free nodes.
Complex visual effects such as caustics are often produced by light paths containing multiple consecutive specular vertices (dubbed specular chains), which pose a challenge to unbiased estimation in Monte Carlo rendering. In this work, we study the light transport behavior within a sub-path that is comprised of a specular chain and two non-specular separators. We show that the specular manifolds formed by all the sub-paths could be exploited to provide coherence among sub-paths. By reconstructing continuous energy distributions from historical and coherent sub-paths, seed chains can be generated in the context of importance sampling and converge to admissible chains through manifold walks. We verify that importance sampling the seed chain in the continuous space reaches the goal of importance sampling the discrete admissible specular chain. Based on these observations and theoretical analyses, a progressive pipeline, manifold path guiding, is designed and implemented to importance sample challenging paths featuring long specular chains. To our best knowledge, this is the first general framework for importance sampling discrete specular chains in regular Monte Carlo rendering. Extensive experiments demonstrate that our method outperforms state-of-the-art unbiased solutions with up to 40 × variance reduction, especially in typical scenes containing long specular chains and complex visibility.
We present a novel algorithm for implementing Owen-scrambling, combining the generation and distribution of the scrambling bits in a single self-contained compact process. We employ a context-free grammar to build a binary tree of symbols, and equip each symbol with a scrambling code that affects all descendant nodes. We nominate the grammar of adaptive regular tiles (ART) derived from the repetition-avoiding Thue-Morse word, and we discuss its potential advantages and shortcomings. Our algorithm has many advantages, including random access to samples, fixed time complexity, GPU friendliness, and scalability to any memory budget. Further, it provides two unique features over known methods: it admits optimization, and it is in-vertible, enabling screen-space scrambling of the high-dimensional Sobol sampler.
We introduce Text2Cinemagraph, a fully automated method for creating cinemagraphs from text descriptions --- an especially challenging task when prompts feature imaginary elements and artistic styles, given the complexity of interpreting the semantics and motions of these images. We focus on cinemagraphs of fluid elements, such as flowing rivers, and drifting clouds, which exhibit continuous motion and repetitive textures. Existing single-image animation methods fall short on artistic inputs, and recent text-based video methods frequently introduce temporal inconsistencies, struggling to keep certain regions static. To address these challenges, we propose an idea of synthesizing image twins from a single text prompt --- a pair of an artistic image and its pixel-aligned corresponding natural-looking twin. While the artistic image depicts the style and appearance detailed in our text prompt, the realistic counterpart greatly simplifies layout and motion analysis. Leveraging existing natural image and video datasets, we can accurately segment the realistic image and predict plausible motion given the semantic information. The predicted motion can then be transferred to the artistic image to create the final cinemagraph. Our method outperforms existing approaches in creating cinemagraphs for natural landscapes as well as artistic and other-worldly scenes, as validated by automated metrics and user studies. Finally, we demonstrate two extensions: animating existing paintings and controlling motion directions using text.
Neural radiance fields achieve unprecedented quality for novel view synthesis, but their volumetric formulation remains expensive, requiring a huge number of samples to render high-resolution images. Volumetric encodings are essential to represent fuzzy geometry such as foliage and hair, and they are well-suited for stochastic optimization. Yet, many scenes ultimately consist largely of solid surfaces which can be accurately rendered by a single sample per pixel. Based on this insight, we propose a neural radiance formulation that smoothly transitions between volumetric- and surface-based rendering, greatly accelerating rendering speed and even improving visual fidelity. Our method constructs an explicit mesh envelope which spatially bounds a neural volumetric representation. In solid regions, the envelope nearly converges to a surface and can often be rendered with a single sample. To this end, we generalize the NeuS [Wang et al. 2021] formulation with a learned spatially-varying kernel size which encodes the spread of the density, fitting a wide kernel to volume-like regions and a tight kernel to surface-like regions. We then extract an explicit mesh of a narrow band around the surface, with width determined by the kernel size, and fine-tune the radiance field within this band. At inference time, we cast rays against the mesh and evaluate the radiance field only within the enclosed region, greatly reducing the number of samples required. Experiments show that our approach enables efficient rendering at very high fidelity. We also demonstrate that the extracted envelope enables downstream applications such as animation and simulation.
High-quality large-scale scene rendering requires a scalable representation and accurate camera poses. This research combines tile-based hybrid neural fields with parallel distributive optimization to improve bundle-adjusting neural radiance fields. The proposed method scales with a divide-and-conquer strategy. We partition scenes into tiles, each with a multi-resolution hash feature grid and shallow chained diffuse and specular multilayer perceptrons (MLPs). Tiles unify foreground and background via a spatial contraction function that allows both distant objects in outdoor scenes and planar reflections as virtual images outside the tile. Decomposing appearance with the specular MLP allows a specular-aware warping loss to provide a second optimization path for camera poses. We apply the alternating direction method of multipliers (ADMM) to achieve consensus among camera poses while maintaining parallel tile optimization. Experimental results show that our method outperforms state-of-the-art neural scene rendering method quality by 5%--10% in PSNR, maintaining sharp distant objects and view-dependent reflections across six indoor and outdoor scenes.
Existing methods for 3D tracking from monocular RGB videos predominantly consider articulated and rigid objects (e.g., two hands or humans interacting with rigid environments). Modelling dense non-rigid object deformations in this setting (e.g. when hands are interacting with a face), remained largely unaddressed so far, although such effects can improve the realism of the downstream applications such as AR/VR, 3D virtual avatar communications, and character animations. This is due to the severe ill-posedness of the monocular view setting and the associated challenges (e.g., in acquiring a dataset for training and evaluation or obtaining the reasonable non-uniform stiffness of the deformable object). While it is possible to naïvely track multiple non-rigid objects independently using 3D templates or parametric 3D models, such an approach would suffer from multiple artefacts in the resulting 3D estimates such as depth ambiguity, unnatural intra-object collisions and missing or implausible deformations.
Hence, this paper introduces the first method that addresses the fundamental challenges depicted above and that allows tracking human hands interacting with human faces in 3D from single monocular RGB videos. We model hands as articulated objects inducing non-rigid face deformations during an active interaction. Our method relies on a new hand-face motion and interaction capture dataset with realistic face deformations acquired with a markerless multi-view camera system. As a pivotal step in its creation, we process the reconstructed raw 3D shapes with position-based dynamics and an approach for non-uniform stiffness estimation of the head tissues, which results in plausible annotations of the surface deformations, hand-face contact regions and head-hand positions. At the core of our neural approach are a variational auto-encoder supplying the hand-face depth prior and modules that guide the 3D tracking by estimating the contacts and the deformations. Our final 3D hand and face reconstructions are realistic and more plausible compared to several baselines applicable in our setting, both quantitatively and qualitatively. https://vcai.mpi-inf.mpg.de/projects/Decaf
DSLR cameras can achieve multiple zoom levels via shifting lens distances or swapping lens types. However, these techniques are not possible on smart-phone devices due to space constraints. Most smartphone manufacturers adopt a hybrid zoom system: commonly a Wide (W) camera at a low zoom level and a Telephoto (T) camera at a high zoom level. To simulate zoom levels between W and T, these systems crop and digitally upsample images from W, leading to significant detail loss. In this paper, we propose an efficient system for hybrid zoom super-resolution on mobile devices, which captures a synchronous pair of W and T shots and leverages machine learning models to align and transfer details from T to W. We further develop an adaptive blending method that accounts for depth-of-field mismatches, scene occlusion, flow uncertainty, and alignment errors. To minimize the domain gap, we design a dual-phone camera rig to capture real-world inputs and ground-truths for supervised training. Our method generates a 12-megapixel image in 500ms on a mobile platform and compares favorably against state-of-the-art methods under extensive evaluation on real-world scenarios.
We introduce SLANG.D, an extension to the Slang shading language that incorporates first-class automatic differentiation support. The new shading language allows us to transform a Direct3D-based path tracer to be fully differentiable with minor modifications to existing code. SLANG.D enables a shared ecosystem between machine learning frameworks and pre-existing graphics hardware API-based rendering systems, promoting the interchange of components and ideas across these two domains.
Our contributions include a differentiable type system designed to ensure type safety and semantic clarity in codebases that blend differentiable and non-differentiable code, language primitives that automatically generate both forward and reverse gradient propagation methods, and a compiler architecture that generates efficient derivative propagation shader code for graphics pipelines. Our compiler supports differentiating code that involves arbitrary control-flow, dynamic dispatch, generics and higher-order differentiation, while providing developers flexible control of checkpointing and gradient aggregation strategies for best performance. Our system allows us to differentiate an existing real-time path tracer, Falcor, with minimal change to its shader code. We show that the compiler-generated derivative kernels perform as efficiently as handwritten ones. In several benchmarks, the SLANG.D code achieves significant speedup when compared to prior automatic differentiation systems.
The use of version control is pervasive in collaborative software projects. Version control systems are based on two primary operations: diffing two versions to compute the change between them and merging two versions edited concurrently. Recent works provide solutions to diff and merge graphics assets such as images, meshes and scenes. In this work, we present a practical algorithm to diff and merge procedural programs written as node graphs. To obtain more precise diffs, we version the graphs directly rather than their textual representations. Diffing graphs is equivalent to computing the graph edit distance, which is known to be computationally infeasible. Following prior work, we propose an approximate algorithm tailored to our problem domain. We validate the proposed algorithm by applying it both to manual edits and to a large set of randomized modifications of procedural shapes and materials. We compared our method with existing state-of-the-art algorithms, showing that our approach is the only one that reliably detects user edits.
2D irregular shape packing is a necessary step to arrange UV patches of a 3D model within a texture atlas for memory-efficient appearance rendering in computer graphics. Being a joint, combinatorial decision-making problem involving all patch positions and orientations, this problem has well-known NP-hard complexity. Prior solutions either assume a heuristic packing order or modify the upstream mesh cut and UV mapping to simplify the problem, which either limits the packing ratio or incurs robustness or generality issues. Instead, we introduce a learning-assisted 2D irregular shape packing method that achieves a high packing quality with minimal requirements from the input. Our method iteratively selects and groups subsets of UV patches into near-rectangular super patches, essentially reducing the problem to bin-packing, based on which a joint optimization is employed to further improve the packing ratio. In order to efficiently deal with large problem instances with hundreds of patches, we train deep neural policies to predict nearly rectangular patch subsets and determine their relative poses, leading to linear time scaling with the number of patches. We demonstrate the effectiveness of our method on three datasets for UV packing, where our method achieves a higher packing ratio over several widely used baselines with competitive computational speed.
We present a novel learning framework to solve the transport-and-packing (TAP) problem in 3D. It constitutes a full solution pipeline from partial observations of input objects via RGBD sensing and recognition to final box placement, via robotic motion planning, to arrive at a compact packing in a target container. The technical core of our method is a neural network for TAP, trained via reinforcement learning (RL), to solve the NP-hard combinatorial optimization problem. Our network simultaneously selects an object to pack and determines the final packing location, based on a judicious encoding of the continuously evolving states of partially observed source objects and available spaces in the target container, using separate encoders both enabled with attention mechanisms. The encoded feature vectors are employed to compute the matching scores and feasibility masks of different pairings of box selection and available space configuration for packing strategy optimization. Extensive experiments, including ablation studies and physical packing execution by a real robot (Universal Robot UR5e), are conducted to evaluate our method in terms of its design choices, scalability, generalizability, and comparisons to baselines, including the most recent RL-based TAP solution. We also contribute the first benchmark for TAP which covers a variety of input settings and difficulty levels.
We propose a method of reconstructing 3D machine-made shapes from bitmap sketches by separating an input image into individual patches and jointly optimizing their geometry. We rely on two main observations: (1) human observers interpret sketches of man-made shapes as a collection of simple geometric primitives, and (2) sketch strokes often indicate occlusion contours or sharp ridges between those primitives. Using these main observations we design a system that takes a single bitmap image of a shape, estimates image depth and segmentation into primitives with neural networks, then fits primitives to the predicted depth while determining occlusion contours and aligning intersections with the input drawing via optimization. Unlike previous work, our approach does not require additional input, annotation, or templates, and does not require retraining for a new category of man-made shapes. Our method produces triangular meshes that display sharp geometric features and are suitable for downstream applications, such as editing, rendering, and shading.
Eyebrows play a critical role in facial expression and appearance. Although the 3D digitization of faces is well explored, less attention has been drawn to 3D eyebrow modeling. In this work, we propose EMS, the first learning-based framework for single-view 3D eyebrow reconstruction. Following the methods of scalp hair reconstruction, we also represent the eyebrow as a set of fiber curves and convert the reconstruction to fibers growing problem. Three modules are then carefully designed: RootFinder firstly localizes the fiber root positions which indicate where to grow; OriPredictor predicts an orientation field in the 3D space to guide the growing of fibers; FiberEnder is designed to determine when to stop the growth of each fiber. Our OriPredictor directly borrows the method used in hair reconstruction. Considering the differences between hair and eyebrows, both RootFinder and FiberEnder are newly proposed. Specifically, to cope with the challenge that the root location is severely occluded, we formulate root localization as a density map estimation task. Given the predicted density map, a density-based clustering method is further used for finding the roots. For each fiber, the growth starts from the root point and moves step by step until the ending, where each step is defined as an oriented line segment with a constant length according to the predicted orientation field. To determine when to end, a pixel-aligned RNN architecture is designed to form a binary classifier, which outputs stop or not for each growing step. To support the training of all proposed networks, we build the first 3D synthetic eyebrow dataset that contains 400 high-quality eyebrow models manually created by artists. Extensive experiments have demonstrated the effectiveness of the proposed EMS pipeline on a variety of different eyebrow styles and lengths, ranging from short and sparse to long bushy eyebrows.
Despite recent successes in hair acquisition that fits a high-dimensional hair model to a specific input subject, generative hair models, which establish general embedding spaces for encoding, editing, and sampling diverse hairstyles, are way less explored. In this paper, we present GroomGen, the first generative model designed for hair geometry composed of highly-detailed dense strands. Our approach is motivated by two key ideas. First, we construct hair latent spaces covering both individual strands and hairstyles. The latent spaces are compact, expressive, and well-constrained for high-quality and diverse sampling. Second, we adopt a hierarchical hair representation that parameterizes a complete hair model to three levels: single strands, sparse guide hairs, and complete dense hairs. This representation is critical to the compactness of latent spaces, the robustness of training, and the efficiency of inference. Based on this hierarchical latent representation, our proposed pipeline consists of a strand-VAE and a hairstyle-VAE that encode an individual strand and a set of guide hairs to their respective latent spaces, and a hybrid densification step that populates sparse guide hairs to a dense hair model. GroomGen not only enables novel hairstyle sampling and plausible hairstyle interpolation, but also supports interactive editing of complex hairstyles, or can serve as strong data-driven prior for hairstyle reconstruction from images. We demonstrate the superiority of our approach with qualitative examples of diverse sampled hairstyles and quantitative evaluation of generation quality regarding every single component and the entire pipeline.
We introduce Doppler time-of-flight (D-ToF) rendering, an extension of ToF rendering for dynamic scenes, with applications in simulating D-ToF cameras. D-ToF cameras use high-frequency modulation of illumination and exposure, and measure the Doppler frequency shift to compute the radial velocity of dynamic objects. The time-varying scene geometry and high-frequency modulation functions used in such cameras make it challenging to accurately and efficiently simulate their measurements with existing ToF rendering algorithms. We overcome these challenges in a twofold manner: To achieve accuracy, we derive path integral expressions for D-ToF measurements under global illumination and form unbiased Monte Carlo estimates of these integrals. To achieve efficiency, we develop a tailored time-path sampling technique that combines antithetic time sampling with correlated path sampling. We show experimentally that our sampling technique achieves up to two orders of magnitude lower variance compared to naive time-path sampling. We provide an open-source simulator that serves as a digital twin for D-ToF imaging systems, allowing imaging researchers, for the first time, to investigate the impact of modulation functions, material properties, and global illumination on D-ToF imaging performance.
Artists often need to reshape 3D models of human-made objects by changing the relative proportions or scales of different model parts or elements while preserving the look and structure of the inputs. Manually reshaping inputs to satisfy these criteria is highly time-consuming; the edit in our teaser took an artist 5 hours to complete. However, existing methods for 3D shape editing are largely designed for other tasks and produce undesirable outputs when repurposed for reshaping. Prior work on 2D curve network reshaping suggests that in 2D settings the user-expected outcome is achieved when the reshaping edit keeps the orientations of the different model elements and when these elements scale as-locally-uniformly-as-possible (ALUP). However, our observations suggest that in 3D viewers are tolerant of non-uniform tangential scaling if and when this scaling preserves slippage and reduces changes in element size, or scale, relative to the input. Slippage preservation requires surfaces which are locally slippable with respect to a given rigid motion to retain this property post-reshaping (a motion is slippable if when applied to the surface, it slides the surface along itself without gaps). We build on these observations by first extending the 2D ALUP framework to 3D and then modifying it to allow non-uniform scaling while promoting slippage and scale preservation. Our 3D ALUP extension produces reshaped outputs better aligned with viewer expectations than prior alternatives; our slippage-aware method further improves the outcome producing results on par with manual reshaping ones. Our method does not require any user input beyond specifying control handles and their target locations. We validate our method by applying it to over one hundred diverse inputs and by comparing our results to those generated by alternative approaches and manually. Comparative study participants preferred our outputs over the best performing traditional deformation method by a 65% margin and over our 3D ALUP extension by a 61% margin; they judged our outputs as at least on par with manually produced ones.
This paper addresses the challenging task of reconstructing the poses of multiple individuals engaged in close interactions, captured by multiple calibrated cameras. The difficulty arises from the noisy or false 2D keypoint detections due to inter-person occlusion, the heavy ambiguity in associating keypoints to individuals due to the close interactions, and the scarcity of training data as collecting and annotating motion data in crowded scenes is resource-intensive. We introduce a novel system to address these challenges. Our system integrates a learning-based pose estimation component and its corresponding training and inference strategies. The pose estimation component takes multi-view 2D keypoint heatmaps as input and reconstructs the pose of each individual using a 3D conditional volumetric network. As the network doesn't need images as input, we can leverage known camera parameters from test scenes and a large quantity of existing motion capture data to synthesize massive training data that mimics the real data distribution in test scenes. Extensive experiments demonstrate that our approach significantly surpasses previous approaches in terms of pose accuracy and is generalizable across various camera setups and population sizes. The code is available on our project page: https://github.com/zju3dv/CloseMoCap.
Neural implicit representation is a promising approach for reconstructing surfaces from point clouds. Existing methods combine various regularization terms, such as the Eikonal and Laplacian energy terms, to enforce the learned neural function to possess the properties of a Signed Distance Function (SDF). However, inferring the actual topology and geometry of the underlying surface from poor-quality unoriented point clouds remains challenging. In accordance with Differential Geometry, the Hessian of the SDF is singular for points within the differential thin-shell space surrounding the surface. Our approach enforces the Hessian of the neural implicit function to have a zero determinant for points near the surface. This technique aligns the gradients for a near-surface point and its on-surface projection point, producing a rough but faithful shape within just a few iterations. By annealing the weight of the singular-Hessian term, our approach ultimately produces a high-fidelity reconstruction result. Extensive experimental results demonstrate that our approach effectively suppresses ghost geometry and recovers details from unoriented point clouds with better expressiveness than existing fitting-based methods.