SIGGRAPH Conference Papers '24: Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers '24

Full Citation in the ACM Digital Library

SESSION: Vector Graphics

QT-Font: High-efficiency Font Synthesis via Quadtree-based Diffusion Models

Few-shot font generation (FFG) aims to streamline the manual aspects of the font design process. Existing models are capable of generating glyph images in the same style of a few input reference glyphs. However, mainly due to their inefficient glyph representations, these existing FFG methods are limited to generating low-resolution glyph images. To address this problem, we introduce QT-Font, an efficient quadtree-based diffusion model specifically designed for FFG. More specifically, we design a sparse quadtree-based glyph representation to reduce the complexity of the representation space, exhibiting linear complexity and uniqueness. Concurrently, to reduce computational complexity, we propose a U-net model based on the dual quadtree graph network and the discrete diffusion model. Furthermore, a content-aware pooling module is also adopted to lessen the computational demands of the diffusion process. Qualitative and quantitative experiments have been conducted to demonstrate that our QT-Font, compared to existing approaches, can generate high-resolution glyph images with superior quality and more visually pleasing details, meanwhile significantly reducing both parameter sizes and computational costs.

Minkowski Penalties: Robust Differentiable Constraint Enforcement for Vector Graphics

This paper describes an optimization-based framework for finding arrangements of 2D shapes subject to pairwise constraints. Such arrangements naturally arise in tasks such as vector illustration and diagram generation, but enforcing these criteria robustly is surprisingly challenging. We approach this problem through the minimization of novel energetic penalties, derived from the signed distance function of the Minkowski difference between interacting shapes. This formulation provides useful gradients even when initialized from a wildly infeasible state, and, unlike many common collision penalties, can handle open curves that do not have a well-defined inside and outside. Moreover, it supports rich features beyond the basic no-overlap condition, such as tangency, containment, and precise padding, which are especially valuable in the vector illustration context. We develop closed-form expressions and efficient approximations of our penalty for standard vector graphics primitives, yielding efficient evaluation and easy implementation within existing automatic differentiation pipelines. The method has already been “battle-tested” as a component of public-facing open source software; we demonstrate the utility of the framework via examples from illustration, data visualization, diagram generation, and geometry processing.

Ciallo: GPU-Accelerated Rendering of Vector Brush Strokes

This paper introduces novel GPU-based rendering techniques for digital painting and animation to bridge the gap between raster and vector stroke representations. We propose efficient rendering methods for vanilla, stamp, and airbrush strokes that integrate the expressiveness of raster-based textures with the ease of real-time editing. Based on our stroke representation, we implement an open-source prototype drawing system with a vector fill feature, demonstrating that our techniques can enhance the expressiveness, efficiency, and edibility of digital drawing. Our work can serve as a foundation for future research on vector-based and GPU-accelerated rendering techniques in industrial-level brush engines.

SESSION: Material Texture Generation and Painting

MaPa: Text-driven Photorealistic Material Painting for 3D Shapes

This paper aims to generate materials for 3D meshes from text descriptions. Unlike existing methods that synthesize texture maps, we propose to generate segment-wise procedural material graphs as the appearance representation, which supports high-quality rendering and provides substantial flexibility in editing. Instead of relying on extensive paired data, i.e., 3D meshes with material graphs and corresponding text descriptions, to train a material graph generative model, we propose to leverage the pre-trained 2D diffusion model as a bridge to connect the text and material graphs. Specifically, our approach decomposes a shape into a set of segments and designs a segment-controlled diffusion model to synthesize 2D images that are aligned with mesh parts. Based on generated images, we initialize parameters of material graphs and fine-tune them through the differentiable rendering module to produce materials in accordance with the textual description. Extensive experiments demonstrate the superior performance of our framework in photorealism, resolution, and editability over existing methods.

Diffusion Texture Painting

We present a technique that leverages 2D generative diffusion models (DMs) for interactive texture painting on the surface of 3D meshes. Unlike existing texture painting systems, our method allows artists to paint with any complex image texture, and in contrast with traditional texture synthesis, our brush not only generates seamless strokes in real-time, but can inpaint realistic transitions between different textures. To enable this application, we present a stamp-based method that applies an adapted pre-trained DM to inpaint patches in local render space, which is then projected into the texture image, allowing artists control over brush stroke shape and texture orientation. We further present a way to adapt the inference of a pre-trained DM to ensure stable texture brush identity, while allowing the DM to hallucinate infinite variations of the source texture. Our method is the first to use DMs for interactive texture painting, and we hope it will inspire work on applying generative models to highly interactive artist-driven workflows. Code and data for this paper are at github.com/nv-tlabs/DiffusionTexturePainting.

TexPainter: Generative Mesh Texturing with Multi-view Consistency

The recent success of pre-trained diffusion models unlocks the possibility of the automatic generation of textures for arbitrary 3D meshes in the wild. However, these models are trained in the screen space, while converting them to a multi-view consistent texture image poses a major obstacle to the output quality. In this paper, we propose a novel method to enforce multi-view consistency. Our method is based on the observation that latent space in a pre-trained diffusion model is noised separately for each camera view, making it difficult to achieve multi-view consistency by directly manipulating the latent codes. Based on the celebrated Denoising Diffusion Implicit Models (DDIM) scheme, we propose to use an optimization-based color-fusion to enforce consistency and indirectly modify the latent codes by gradient back-propagation. Our method further relaxes the sequential dependency assumption among the camera views. By evaluating on a series of general 3D models, we find our simple approach improves consistency and overall quality of the generated textures as compared to competing state-of-the-arts. Our implementation is available at: https://github.com/Quantuman134/TexPainter

TexSliders: Diffusion-Based Texture Editing in CLIP Space

Generative models have enabled intuitive image creation and manipulation using natural language. In particular, diffusion models have recently shown remarkable results for natural image editing. In this work, we propose to apply diffusion techniques to edit textures, a specific class of images that are an essential part of 3D content creation pipelines. We analyze existing editing methods and show that they are not directly applicable to textures, since their common underlying approach, manipulating attention maps, is unsuitable for the texture domain. To address this, we propose a novel approach that instead manipulates CLIP image embeddings to condition the diffusion generation. We define editing directions using simple text prompts (e.g., “aged wood” to “new wood”) and map these to CLIP image embedding space using a texture prior, with a sampling-based approach that gives us identity-preserving directions in CLIP space. To further improve identity preservation, we project these directions to a CLIP subspace that minimizes identity variations resulting from entangled texture attributes. Our editing pipeline facilitates the creation of arbitrary sliders using natural language prompts only, with no ground-truth annotated data necessary.

SESSION: Monte Carlo for PDEs

Velocity-Based Monte Carlo Fluids

We present a velocity-based Monte Carlo fluid solver that overcomes the limitations of its existing vorticity-based counterpart. Because the velocity-based formulation is more commonly used in graphics, our Monte Carlo solver can be readily extended with various techniques from the fluid simulation literature. We derive our method by solving the Navier-Stokes equations via operator splitting and designing a pointwise Monte Carlo estimator for each substep. We reformulate the projection and diffusion steps as integration problems based on the recently introduced walk-on-boundary technique [Sugimoto et al. 2023]. We transform the volume integral arising from the source term of the pressure Poisson equation into a form more amenable to practical numerical evaluation. Our resulting velocity-based formulation allows for the proper simulation of scenes that the prior vorticity-based Monte Carlo method [Rioux-Lavoie et al. 2022] either simulates incorrectly or cannot support. We demonstrate that our method can easily incorporate advancements drawn from conventional non-Monte Carlo methods by showing how one can straightforwardly add buoyancy effects, divergence control capabilities, and numerical dissipation reduction methods, such as advection-reflection and PIC/FLIP methods.

Neural Monte Carlo Fluid Simulation

The idea of using a neural network to represent continuous vector fields (i.e., neural fields) has become popular for solving PDEs arising from physics simulations. Here, the classical spatial discretization (e.g., finite difference) of PDE solvers is replaced with a neural network that models a differentiable function, so the spatial gradients of the PDEs can be readily computed via autodifferentiation. When used in fluid simulation, however, neural fields fail to capture many important phenomena, such as the vortex shedding experienced in the von Kármán vortex street experiment. We present a novel neural network representation for fluid simulation that augments neural fields with explicitly enforced boundary conditions as well as a Monte Carlo pressure solver to get rid of all weakly enforced boundary conditions. Our method, the Neural Monte Carlo method (NMC), is completely mesh-free, i.e., it doesn’t depend on any grid-based discretization. While NMC does not achieve the state-of-the-art accuracy of the well-established grid-based methods, it significantly outperforms previous mesh-free neural fluid methods on fluid flows involving intricate boundaries and turbulence regimes.

Neural Control Variates with Automatic Integration

This paper presents a method to leverage arbitrary neural network architecture for control variates. Control variates are crucial in reducing the variance of Monte Carlo integration, but they hinge on finding a function that both correlates with the integrand and has a known analytical integral. Traditional approaches rely on heuristics to choose this function, which might not be expressive enough to correlate well with the integrand. Recent research alleviates this issue by modeling the integrands with a learnable parametric model, such as a neural network. However, the challenge remains in creating an expressive parametric model with a known analytical integral. This paper proposes a novel approach to construct learnable parametric control variates functions from arbitrary neural network architectures. Instead of using a network to approximate the integrand directly, we employ the network to approximate the anti-derivative of the integrand. This allows us to use automatic differentiation to create a function whose integration can be constructed by the antiderivative network. We apply our method to solve partial differential equations using the Walk-on-sphere algorithm [Sawhney and Crane 2020]. Our results indicate that this approach is unbiased using various network architectures and achieves lower variance than other control variate methods.

A Differential Monte Carlo Solver For the Poisson Equation

The Poisson equation is an important partial differential equation (PDE) with numerous applications in physics, engineering, and computer graphics. Conventional solutions to the Poisson equation require discretizing the domain or its boundary, which can be very expensive for domains with detailed geometries. To overcome this challenge, a family of grid-free Monte Carlo solutions has recently been developed. By utilizing walk-on-sphere (WoS) processes, these techniques are capable of efficiently solving the Poisson equation over complex domains.

In this paper, we introduce a general technique that differentiates solutions to the Poisson equation with Dirichlet boundary conditions. Specifically, we devise a new boundary-integral formulation for the derivatives with respect to arbitrary parameters including shapes of the domain. Further, we develop an efficient walk-on-spheres technique based on our new formulation—including a new approach to estimate normal derivatives of the solution field. We demonstrate the effectiveness of our technique over baseline methods using several synthetic examples.

SESSION: Generative 3D Geometry

Coin3D: Controllable and Interactive 3D Assets Generation with Proxy-Guided Conditioning

As humans, we aspire to create media content that is both freely willed and readily controlled. Thanks to the prominent development of generative techniques, we now can easily utilize 2D diffusion methods to synthesize images controlled by raw sketch or designated human poses, and even progressively edit/regenerate local regions with masked inpainting. However, similar workflows in 3D modeling tasks are still unavailable due to the lack of controllability and efficiency in 3D generation. In this paper, we present a novel controllable and interactive 3D assets modeling framework, named Coin3D. Coin3D allows users to control the 3D generation using a coarse geometry proxy assembled from basic shapes, and introduces an interactive generation workflow to support seamless local part editing while delivering responsive 3D object previewing within a few seconds. To this end, we develop several techniques, including the 3D adapter that applies volumetric coarse shape control to the diffusion model, proxy-bounded editing strategy for precise part editing, progressive volume cache to support responsive preview, and volume-SDS to ensure consistent mesh reconstruction. Extensive experiments of interactive generation and editing on diverse shape proxies demonstrate that our method achieves superior controllability and flexibility in the 3D assets generation task. Code and data are available on the project webpage: https://zju3dv.github.io/coin3d/.

DreamFont3D: Personalized Text-to-3D Artistic Font Generation

Text-to-3D artistic font generation aims to assist users for innovative and customized 3D font design by exploring novel concepts and styles. Despite of the advances in the text-to-3D tasks for general objects or scenes, the additional challenge of 3D font generation is to preserve the geometric structures of strokes in an appropriate extent, which determines the generation quality in terms of the recognizability and the local effect control of the 3D fonts. This paper presents a novel approach for text-to-3D artistic font generation, named DreamFont3D, which utilizes multi-view font masks and layout conditions to constrain the 3D font structure and local font effects. Specifically, to enhance the recognizability of 3D fonts, we propose the multi-view mask constraint (MC) to optimize the differentiable 3D representation while preserving the font structure. We also present a progressive mask weighting (MW) module to ensure a trade-off between the text-guided stylization of font effects and the mask-guided preservation of font structure. For precise control over local font effects, we design the multi-view attention modulation (AM) that guides the visual concepts to appear in specific regions according to the provided layout conditions. Compared with existing text-to-3D methods, DreamFont3D shows its own superiority in the consistency between font effects and text prompts, the recognizability, and the localization of font effects. Code and data at https://moonlight03.github.io/DreamFont3D/.

ThemeStation: Generating Theme-Aware 3D Assets from Few Exemplars

Real-world applications often require a large gallery of 3D assets that share a consistent theme. While remarkable advances have been made in general 3D content creation from text or image, synthesizing customized 3D assets following the shared theme of input 3D exemplars remains an open and challenging problem. In this work, we present ThemeStation, a novel approach for theme-aware 3D-to-3D generation. ThemeStation synthesizes customized 3D assets based on given few exemplars with two goals: 1) unity for generating 3D assets that thematically align with the given exemplars and 2) diversity for generating 3D assets with a high degree of variations. To this end, we design a two-stage framework that draws a concept image first, followed by a reference-informed 3D modeling stage. We propose a novel dual score distillation (DSD) loss to jointly leverage priors from both the input exemplars and the synthesized concept image. Extensive experiments and a user study confirm that ThemeStation surpasses prior works in producing diverse theme-aware 3D models with impressive quality. ThemeStation also enables various applications such as controllable 3D-to-3D generation.

Spice·E: Structural Priors in 3D Diffusion using Cross-Entity Attention

We are witnessing rapid progress in automatically generating and manipulating 3D assets due to the availability of pretrained text-to-image diffusion models. However, time-consuming optimization procedures are required for synthesizing each sample, hindering their potential for democratizing 3D content creation. Conversely, 3D diffusion models now train on million-scale 3D datasets, yielding high-quality text-conditional 3D samples within seconds. In this work, we present Spice · E – a neural network that adds structural guidance to 3D diffusion models, extending their usage beyond text-conditional generation. At its core, our framework introduces a cross-entity attention mechanism that allows for multiple entities—in particular, paired input and guidance 3D shapes—to interact via their internal representations within the denoising network. We utilize this mechanism for learning task-specific structural priors in 3D diffusion models from auxiliary guidance shapes. We show that our approach supports a variety of applications, including 3D stylization, semantic shape editing and text-conditional abstraction-to-3D, which transforms primitive-based abstractions into highly-expressive shapes. Extensive experiments demonstrate that Spice · E achieves SOTA performance over these tasks while often being considerably faster than alternative methods. Importantly, this is accomplished without tailoring our approach for any specific task. We will release our code and trained models.

SESSION: 3D Face Generator and Animation

HeadArtist: Text-conditioned 3D Head Generation with Self Score Distillation

We present HeadArtist for 3D head generation following human-language descriptions. With a landmark-guided ControlNet serving as a generative prior, we come up with an efficient pipeline that optimizes a parameterized 3D head model under the supervision of the prior distillation itself. We call such a process self score distillation (SSD). In detail, given a sampled camera pose, we first render an image and its corresponding landmarks from the head model, and add some particular level of noise onto the image. The noisy image, landmarks, and text condition are then fed into a frozen ControlNet twice for noise prediction. We conduct two predictions via the same ControlNet structure but with different classifier-free guidance (CFG) weights. The difference between these two predicted results directs how the rendered image can better match the text of interest. Experimental results show that our approach produces high-quality 3D head sculptures with rich geometry and photo-realistic appearance, which significantly outperforms state-of-the-art methods. We also show that our pipeline supports editing operations on the generated heads, including both geometry deformation and appearance change. Project page:https://kumapowerliu.github.io/HeadArtist.

Toonify3D: StyleGAN-based 3D Stylized Face Generator

Recent advances in generative models enable high-quality facial image stylization. Toonify is a popular StyleGAN-based framework that has been widely used for facial image stylization. Our goal is to create expressive 3D faces by turning Toonify into a 3D stylized face generator. Toonify is fine-tuned with a few gradient descent steps from StyleGAN trained for standard faces, and its features would carry semantic and visual information aligned with the features of the original StyleGAN model. Based on this observation, we design a versatile 3D-lifting method for StyleGAN, StyleNormal, that regresses a surface normal map of a StyleGAN-generated face using StyleGAN features. Due to the feature alignment between Toonify and StyleGAN, although StyleNormal is trained for regular faces, it can be applied for various stylized faces without additional fine-tuning. To learn local geometry of faces under various illuminations, we introduce a novel regularization term, the normal consistency loss, based on lighting manipulation in the GAN latent space. Finally, we present Toonify3D, a fully automated framework based on StyleNormal, that can generate full-head 3D stylized avatars and support GAN-based 3D facial expression editing.

Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance

The synthesis of 3D facial animations from speech has garnered considerable attention. Due to the scarcity of high-quality 4D facial data and well-annotated abundant multi-modality labels, previous methods often suffer from limited realism and a lack of flexible conditioning. We address this challenge through a trilogy. We first introduce Generalized Neural Parametric Facial Asset (GNPFA), an efficient variational auto-encoder mapping facial geometry and images to a highly generalized expression latent space, decoupling expressions and identities. Then, we utilize GNPFA to extract high-quality expressions and accurate head poses from a large array of videos. This presents the M2F-D dataset, a large, diverse, and scan-level co-speech 3D facial animation dataset with well-annotated emotion and style labels. Finally, we propose Media2Face, a diffusion model in GNPFA latent space for co-speech facial animation generation, accepting rich multi-modality guidances from audio, text, and image. Extensive experiments demonstrate that our model not only achieves high fidelity in facial animation synthesis but also broadens the scope of expressiveness and style adaptability in 3D facial animation.

SESSION: Differentiable Rendering

Path-Space Differentiable Rendering of Implicit Surfaces

Physics-based differentiable rendering is a key ingredient for integrating forward rendering into probabilistic inference and machine learning pipelines. As a state-of-the-art formulation for differentiable rendering, differential path integrals have enabled the development of efficient Monte Carlo estimators for both interior and boundary integrals. Unfortunately, this formulation has been designed mostly for explicit geometries like polygonal meshes.

In this paper, we generalize the theory of differential path integrals to support implicit geometries like level sets and signed-distance functions (SDFs). In addition, we introduce new Monte Carlo estimators for efficiently sampling discontinuity boundaries that are also implicitly specified. We demonstrate the effectiveness of our theory and algorithms using several differentiable-rendering and inverse-rendering examples.

Woven Fabric Capture with a Reflection-Transmission Photo Pair

Digitizing woven fabrics would be valuable for many applications, from digital humans to interior design. Previous work introduces a lightweight woven fabric acquisition approach by capturing a single reflection image and estimating the fabric parameters with a differentiable geometric and shading model. The renderings of the estimated fabric parameters can closely match the photo; however, the captured reflection image is insufficient to fully characterize the fabric sample reflectance. For instance, fabrics with different thicknesses might have similar reflection images but lead to significantly different transmission. We propose to recover the woven fabric parameters from two captured images: reflection and transmission. At the core of our method is a differentiable bidirectional scattering distribution function (BSDF) model, handling reflection and transmission, including single and multiple scattering. We propose a two-layer model, where the single scattering uses an SGGX phase function as in previous work, and multiple scattering uses a new azimuthally-invariant microflake definition, which we term ASGGX. This new fabric BSDF model closely matches real woven fabrics in both reflection and transmission. We use a simple setup for capturing reflection and transmission photos with a cell phone camera and two point lights, and estimate the fabric parameters via a lightweight network, together with a differentiable optimization. We also model the out-of-focus effects explicitly with a simple solution to match the thin-lens camera better. As a result, the renderings of the estimated parameters can agree with the input images on both reflection and transmission for the first time. The code for this paper is at https://github.com/lxtyin/FabricBTDF-Recovery.

SESSION: Geometry: Reconstruction

MVD^2: Efficient Multiview 3D Reconstruction for Multiview Diffusion

Multiview diffusion (MVD) has emerged as a prominent 3D generation technique, acclaimed for its generalizability, quality, and efficiency. MVD models finetune image diffusion models with 3D data to generate multiple views of a 3D object from an image or text prompt, followed by a multiview 3D reconstruction process. However, the sparsity of views and inconsistent details in the generated multiview images pose challenges for 3D reconstruction. We present MVD2, an efficient 3D reconstruction method tailored for MVD images. MVD2 integrates multiview image features into a 3D feature volume, then transforms this volume into a textureless 3D mesh, onto which the MVD images are mapped as textures. It employs a simple-yet-efficient view-dependent training scheme to mitigate discrepancies between MVD images and ground-truth views of 3D shapes, effectively improving 3D generation quality and robustness. MVD2 is trained with 3D collections and MVD images, and the trained MVD2 efficiently reconstructs 3D meshes from multiview images within one second and exhibits great model generalizability in dealing with images generated by various MVD methods. Our code and the pretrained model are available at: https://zhengxinyang.github.io/projects/MVD_square.html.

High-quality Surface Reconstruction using Gaussian Surfels

We propose a novel point-based representation, Gaussian surfels, to combine the advantages of the flexible optimization procedure in 3D Gaussian points and the surface alignment property of surfels. This is achieved by directly setting the z-scale of 3D Gaussian points to 0, effectively flattening the original 3D ellipsoid into a 2D ellipse. Such a design provides clear guidance to the optimizer. By treating the local z-axis as the normal direction, it greatly improves optimization stability and surface alignment. While the derivatives to the local z-axis computed from the covariance matrix are zero in this setting, we design a self-supervised normal-depth consistency loss to remedy this issue. Monocular normal priors and foreground masks are incorporated to enhance the quality of the reconstruction, mitigating issues related to highlights and background. We propose a volumetric cutting method to aggregate the information of Gaussian surfels so as to remove erroneous points in depth maps generated by alpha blending. Finally, we apply screened Poisson reconstruction method to the fused depth maps to extract the surface mesh. Experimental results show that our method demonstrates superior performance in surface reconstruction compared to state-of-the-art neural volume rendering and point-based rendering methods.

Reach for the Arcs: Reconstructing Surfaces from SDFs via Tangent Points

We introduce an algorithm to reconstruct a mesh from discrete samples of a shape’s Signed Distance Function (SDF). A simple geometric reinterpretation of the SDF lets us formulate the problem through a point cloud, from which a surface can be extracted with existing techniques. We extract all possible information from the SDF data, outperforming commonly used algorithms and imposing no topological or geometric restrictions.

Part123: Part-aware 3D Reconstruction from a Single-view Image

Recently, the emergence of diffusion models has opened up new opportunities for single-view reconstruction. However, all the existing methods represent the target object as a closed mesh devoid of any structural information, thus neglecting the part-based structure, which is crucial for many downstream applications, of the reconstructed shape. Moreover, the generated meshes usually suffer from large noises, unsmooth surfaces, and blurry textures, making it challenging to obtain satisfactory part segments using 3D segmentation techniques. In this paper, we present Part123, a novel framework for part-aware 3D reconstruction from a single-view image. We first use diffusion models to generate multiview-consistent images from a given image, and then leverage Segment Anything Model (SAM), which demonstrates powerful generalization ability on arbitrary objects, to generate multiview segmentation masks. To effectively incorporate 2D part-based information into 3D reconstruction and handle inconsistency, we introduce contrastive learning into a neural rendering framework to learn a part-aware feature space based on the multiview segmentation masks. A clustering-based algorithm is also developed to automatically derive 3D part segmentation results from the reconstructed models. Experiments show that our method can generate 3D models with high-quality segmented parts on various objects. Compared to existing unstructured reconstruction methods, the part-aware 3D models from our method benefit some important applications, including feature-preserving reconstruction, primitive fitting, and 3D shape editing.

SESSION: Consistent Text-to-Image

Subject-Diffusion: Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning

Recent progress in personalized image generation using diffusion models has been significant. However, development in the area of open-domain and test-time fine-tuning-free personalized image generation is proceeding rather slowly. In this paper, we propose Subject-Diffusion, a novel open-domain personalized image generation model that, in addition to not requiring test-time fine-tuning, also only requires a single reference image to support personalized generation of single- or two-subjects in any domain. Firstly, we construct an automatic data labeling tool and use the LAION-Aesthetics dataset to construct a large-scale dataset consisting of 76M images and their corresponding subject detection bounding boxes, segmentation masks, and text descriptions. Secondly, we design a new unified framework that combines text and image semantics by incorporating coarse location and fine-grained reference image control to maximize subject fidelity and generalization. Furthermore, we also adopt an attention control mechanism to support two-subject generation. Extensive qualitative and quantitative results demonstrate that our method have certain advantages over other frameworks in single, multiple, and human-customized image generation.

The Chosen One: Consistent Characters in Text-to-Image Diffusion Models

Recent advances in text-to-image generation models have unlocked vast potential for visual creativity. However, the users that use these models struggle with the generation of consistent characters, a crucial aspect for numerous real-world applications such as story visualization, game development, asset design, advertising, and more. Current methods typically rely on multiple pre-existing images of the target character or involve labor-intensive manual processes. In this work, we propose a fully automated solution for consistent character generation, with the sole input being a text prompt. We introduce an iterative procedure that, at each stage, identifies a coherent set of images sharing a similar identity and extracts a more consistent identity from this set. Our quantitative analysis demonstrates that our method strikes a better balance between prompt alignment and identity consistency compared to the baseline methods, and these findings are reinforced by a user study. To conclude, we showcase several practical applications of our approach.

Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion

We present a method for generating Streetscapes—long sequences of views through an on-the-fly synthesized city-scale scene. Our generation is conditioned by language input (e.g., city name, weather), as well as an underlying map/layout hosting the desired trajectory. Compared to recent models for video generation or 3D view synthesis, our method can scale to much longer-range camera trajectories, spanning several city blocks, while maintaining visual quality and consistency. To achieve this goal, we build on recent work on video diffusion, used within an autoregressive framework that can easily scale to long sequences. In particular, we introduce a new temporal imputation method that prevents our autoregressive approach from drifting from the distribution of realistic city imagery. We train our Streetscapes system on a compelling source of data—posed imagery from Google Street View, along with contextual map data—which allows users to generate city views conditioned on any desired city layout, with controllable camera pose.

Blue noise for diffusion models

Most of the existing diffusion models use Gaussian noise for training and sampling across all time steps, which may not optimally account for the frequency contents reconstructed by the denoising network. Despite the diverse applications of correlated noise in computer graphics, its potential for improving the training process has been underexplored. In this paper, we introduce a novel and general class of diffusion models taking correlated noise within and across images into account. More specifically, we propose a time-varying noise model to incorporate correlated noise into the training process, as well as a method for fast generation of correlated noise mask. Our model is built upon deterministic diffusion models and utilizes blue noise to help improve the generation quality compared to using Gaussian white (random) noise only. Further, our framework allows introducing correlation across images within a single mini-batch to improve gradient flow. We perform both qualitative and quantitative evaluations on a variety of datasets using our method, achieving improvements on different tasks over existing deterministic diffusion models in terms of FID metric. Code will be available at https://github.com/xchhuang/bndm.

SESSION: Geometry: Mappings and Fields

Neural Geometry Fields For Meshes

Recent work on using neural fields to represent surfaces has resulted in significant improvements in representational capability and computational efficiency. However, to our knowledge, most existing work has focused on implicit representations such as signed distance fields or volumes, and little work has explored their application to discrete surface geometry, i.e., 3D meshes, limiting the applicability of neural surface representations.

We present Neural Geometry Fields, a neural representation for discrete surface geometry represented by triangle meshes. Our idea is to represent the target surface using a coarse set of quadrangular patches, and add surface details using coordinate neural networks by displacing the patches. We then extract a traditional triangular mesh from a neural geometry field instance by sampling the displacement. We show that our representation excels in mesh compression, where it significantly reduces the memory footprint of meshes without compromising on surface details.

SESSION: Fast Radiance Fields

RTG-SLAM: Real-time 3D Reconstruction at Scale using Gaussian Splatting

We present Real-time Gaussian SLAM (RTG-SLAM), a real-time 3D reconstruction system with an RGBD camera for large-scale environments using Gaussian splatting. The system features a compact Gaussian representation and a highly efficient on-the-fly Gaussian optimization scheme. We force each Gaussian to be either opaque or nearly transparent, with the opaque ones fitting the surface and dominant colors, and transparent ones fitting residual colors. By rendering depth in a different way from color rendering, we let a single opaque Gaussian well fit a local surface region without the need of multiple overlapping Gaussians, hence largely reducing the memory and computation cost. For on-the-fly Gaussian optimization, we explicitly add Gaussians for three types of pixels per frame: newly observed, with large color errors, and with large depth errors. We also categorize all Gaussians into stable and unstable ones, where the stable Gaussians are expected to well fit previously observed RGBD images and otherwise unstable. We only optimize the unstable Gaussians and only render the pixels occupied by unstable Gaussians. In this way, both the number of Gaussians to be optimized and pixels to be rendered are largely reduced, and the optimization can be done in real time. We show real-time reconstructions of a variety of large scenes. Compared with the state-of-the-art NeRF-based RGBD SLAM, our system achieves comparable high-quality reconstruction but with around twice the speed and half the memory cost, and shows superior performance in the realism of novel view synthesis and camera tracking accuracy.

BoostMVSNeRFs: Boosting MVS-based NeRFs to Generalizable View Synthesis in Large-scale Scenes

While Neural Radiance Fields (NeRFs) have demonstrated exceptional quality, their protracted training duration remains a limitation. Generalizable and MVS-based NeRFs, although capable of mitigating training time, often incur tradeoffs in quality. This paper presents a novel approach called BoostMVSNeRFs to enhance the rendering quality of MVS-based NeRFs in large-scale scenes. We first identify limitations in MVS-based NeRF methods, such as restricted viewport coverage and artifacts due to limited input views. Then, we address these limitations by proposing a new method that selects and combines multiple cost volumes during volume rendering. Our method does not require training and can adapt to any MVS-based NeRF methods in a feed-forward fashion to improve rendering quality. Furthermore, our approach is also end-to-end trainable, allowing fine-tuning on specific scenes. We demonstrate the effectiveness of our method through experiments on large-scale datasets, showing significant rendering quality improvements in large-scale scenes and unbounded outdoor scenarios. We release the source code of BoostMVSNeRFs at https://su-terry.github.io/BoostMVSNeRFs.

2D Gaussian Splatting for Geometrically Accurate Radiance Fields

3D Gaussian Splatting (3DGS) has recently revolutionized radiance field reconstruction, achieving high quality novel view synthesis and fast rendering speed. However, 3DGS fails to accurately represent surfaces due to the multi-view inconsistent nature of 3D Gaussians. We present 2D Gaussian Splatting (2DGS), a novel approach to model and reconstruct geometrically accurate radiance fields from multi-view images. Our key idea is to collapse the 3D volume into a set of 2D oriented planar Gaussian disks. Unlike 3D Gaussians, 2D Gaussians provide view-consistent geometry while modeling surfaces intrinsically. To accurately recover thin surfaces and achieve stable optimization, we introduce a perspective-accurate 2D splatting process utilizing ray-splat intersection and rasterization. Additionally, we incorporate depth distortion and normal consistency terms to further enhance the quality of the reconstructions. We demonstrate that our differentiable renderer allows for noise-free and detailed geometry reconstruction while maintaining competitive appearance quality, fast training speed, and real-time rendering. Project page at https://surfsplatting.github.io.

SESSION: VR, Eye Tracking, Perception

Saccade-Contingent Rendering

Battery-constrained power consumption, compute limitations, and high frame rate requirements in head-mounted displays present unique challenges in the drive to present increasingly immersive and comfortable imagery in virtual reality. However, humans are not equally sensitive to all regions of the visual field, and perceptually-optimized rendering techniques are increasingly utilized to address these bottlenecks. Many of these techniques are gaze-contingent and often render reduced detail away from a user’s fixation. Such techniques are dependent on spatio-temporally-accurate gaze tracking and can result in obvious visual artifacts when eye tracking is inaccurate. In this work we present a gaze-contingent rendering technique which only requires saccade detection, bypassing the need for highly-accurate eye tracking. In our first experiment, we show that visual acuity is reduced for several hundred milliseconds after a saccade. In our second experiment, we use these results to reduce the rendered image resolution after saccades in a controlled psychophysical setup, and find that observers cannot discriminate between saccade-contingent, reduced-resolution rendering and full-resolution rendering under certain conditions identified in Experiment 1. Finally, in our third experiment, we introduce a 90 pixels per degree headset and validate our saccade-contingent rendering method under typical VR viewing conditions.

Perceptual Evaluation of Steered Retinal Projection

Steered retinal projection (SRP) is an emerging display technology that combines retinal projection and pupil steering to achieve exceptional light efficiency and a consistent viewing experience. Retinal projection enables most photons from a display projector to reach the retina, and pupil steering dynamically aligns the narrow viewing window of the retinal projection with the eye. While SRP holds considerable promise, its development has been stagnant due to a lack of understanding how human vision reacts to the dynamic steering movement of the viewing window. To delve into these areas, this study introduces the first SRP system testbed specifically designed for perceptual studies on the viewing experience of pupil steering. The testbed replicates the SRP viewing experience and offers the flexibility in adjusting several parameters including steering resolution, accuracy, and latency. We conducted two perceptual studies utilizing the testbed. The first study investigates the impact of saccadic suppression, a phenomenon that reduces visual sensitivity during rapid eye movements, on the SRP viewing experience. The second study explores the trade space between eye-tracking and pupil steering performance, providing insights into the optimal balance between these factors. Additionally, we introduce a numerical model to predict the detection probability for SRP artifacts considering the temporal characteristics of global luminance and the human vision system. This model enables a more comprehensive interpretation of user study and provides preliminary hardware requirements for SRP systems. The findings from this study offer invaluable research directions that may help determine component-level development milestones for SRP development, paving the way for the practical implementation of this promising technology.

SESSION: Simulating Nature

Real-time Wing Deformation Simulations for Flying Insects

Realistic simulation of the intricate wing deformations seen in flying insects not only deepens our comprehension of insect flight mechanics but also opens up numerous applications in fields such as computer animation and virtual reality. Despite its importance, this research area has been relatively underexplored due to the complex and diverse wing structures and the intricate patterns of deformation. This paper presents an efficient skeleton-driven model specifically designed to real-time simulate realistic wing deformations across a wide range of flying insects. Our approach begins with the construction of a virtual skeleton that accurately reflects the distinct morphological characteristics of individual insect species. This skeleton serves as the foundation for the simulation of the intricate deformation wave propagation often observed in wing deformations. To faithfully reproduce the bending effect seen in these deformations, we introduce both internal and external forces that act on the wing joints, drawing on periodic wing-beat motion and a simplified aerodynamics model. Additionally, we utilize mass-spring algorithms to simulate the inherent elasticity of the wings, helping to prevent excessive twisting. Through various simulation experiments, comparisons, and user studies, we demonstrate the effectiveness, robustness, and adaptability of our model.

Modelling a Feather as a Strongly Anisotropic Elastic Shell

Feathers exhibit a highly anisotropic behaviour, governed by their complex hierarchical microstructure composed of individual hairs (barbs) clamped onto a spine (rachis) and attached to each other through tiny hooks (barbules). Previous methods in computer graphics have approximated feathers as strips of cloth, thus failing to capture the particular macroscopic nonlinear behaviour of the feather surface (vane). To investigate the anisotropic properties of a feather vane, we design precise measurement protocols on real feather samples. Our experimental results suggest a linear strain-stress relationship of the feather membrane with orientation-dependent coefficients, as well as an extreme ratio of stiffnesses in the barb and barbule direction, of the order of 104. From these findings we build a simple continuum model for the feather vane, where the vane is represented as a three-parameter anisotropic elastic shell. However, implementing the model numerically reveals severe locking and ill-conditioning issues, due to the extreme stiffness ratio between the barb and the barbule directions. To resolve these issues, we align the mesh along the barb directions and replace the stiffest modes with an inextensibility constraint. We extensively validate our membrane model against real-world laboratory measurements, by using an intermediary microscale model that allows us to limit the number of required lab experiments. Finally, we enrich our membrane model with anisotropic bending, and show its practicality in graphics-like scenarios like a full feather and a larger-scale bird. Code and data for this paper are available at https://gitlab.inria.fr/elan-public-code/feather-shell/.

SESSION: Clothing Geometry

LayGA: Layered Gaussian Avatars for Animatable Clothing Transfer

Animatable clothing transfer, aiming at dressing and animating garments across characters, is a challenging problem. Most human avatar works entangle the representations of the human body and clothing together, which leads to difficulties for virtual try-on across identities. What’s worse, the entangled representations usually fail to exactly track the sliding motion of garments. To overcome these limitations, we present Layered Gaussian Avatars (LayGA), a new representation that formulates body and clothing as two separate layers for photorealistic animatable clothing transfer from multi-view videos. Our representation is built upon the Gaussian map-based avatar for its excellent representation power of garment details. However, the Gaussian map produces unstructured 3D Gaussians distributed around the actual surface. The absence of a smooth explicit surface raises challenges in accurate garment tracking and collision handling between body and garments. Therefore, we propose two-stage training involving single-layer reconstruction and multi-layer fitting. In the single-layer reconstruction stage, we propose a series of geometric constraints to reconstruct smooth surfaces and simultaneously obtain the segmentation between body and clothing. Next, in the multi-layer fitting stage, we train two separate models to represent body and clothing and utilize the reconstructed clothing geometries as 3D supervision for more accurate garment tracking. Furthermore, we propose geometry and rendering layers for both high-quality geometric reconstruction and high-fidelity rendering. Overall, the proposed LayGA realizes photorealistic animations and virtual try-on, and outperforms other baseline methods. Our project page is https://jsnln.github.io/layga/index.html.

Singular Foliations for Knit Graph Design

We build upon the stripes-based knit planning framework of [Mitra et al. 2023], and view the resultant stripe pattern through the lens of singular foliations. This perspective views the stripes, and thus the candidate course rows or wale columns, as integral curves of a vector field specified by the spinning form of [Knöppel et al. 2015]. We show how to tightly control the topological structure of this vector field with linear level set constraints, preventing helicing of any integral curve. Practically speaking, this obviates the stripe placement constraints of [Mitra et al. 2023] and allows for shifting and variation of the stripe frequency without introducing additional helices. En route, we make the first explicit algebraic characterization of spinning form level set structure within singular triangles, and replace the standard interpolant with an “effective” one that improves the robustness of knit graph generation. We also extend the model of [Mitra et al. 2023] to surfaces with genus, via a Morse-based cylindrical decomposition, and implement automatic singularity pairing on the resulting components.

SESSION: NeRFs and Lighting

NeRF as a Non-Distant Environment Emitter in Physics-based Inverse Rendering

Physics-based inverse rendering enables joint optimization of shape, material, and lighting based on captured 2D images. To ensure accurate reconstruction, using a light model that closely resembles the captured environment is essential. Although the widely adopted distant environmental lighting model is adequate in many cases, we demonstrate that its inability to capture spatially varying illumination can lead to inaccurate reconstructions in many real-world inverse rendering scenarios. To address this limitation, we incorporate NeRF as a non-distant environment emitter into the inverse rendering pipeline. Additionally, we introduce an emitter importance sampling technique for NeRF to reduce the rendering variance. Through comparisons on both real and synthetic datasets, our results demonstrate that our NeRF-based emitter offers a more precise representation of scene lighting, thereby improving the accuracy of inverse rendering.

3D Gaussian Splatting with Deferred Reflection

The advent of neural and Gaussian-based radiance field methods have achieved great success in the field of novel view synthesis. However, specular reflection remains non-trivial, as the high frequency radiance field is notoriously difficult to fit stably and accurately. We present a deferred shading method to effectively render specular reflection with Gaussian splatting. The key challenge comes from the environment map reflection model, which requires accurate surface normal while simultaneously bottlenecks normal estimation with discontinuous gradients. We leverage the per-pixel reflection gradients generated by deferred shading to bridge the optimization process of neighboring Gaussians, allowing nearly correct normal estimations to gradually propagate and eventually spread over all reflective objects. Our method significantly outperforms state-of-the-art techniques and concurrent work in synthesizing high-quality specular reflection effects, demonstrating a consistent improvement of peak signal-to-noise ratio (PSNR) for both synthetic and real-world scenes, while running at a frame rate almost identical to vanilla Gaussian splatting.

Lite2Relight: 3D-aware Single Image Portrait Relighting

Achieving photorealistic 3D view synthesis and relighting of human portraits is pivotal for advancing AR/VR applications. Existing methodologies in portrait relighting demonstrate substantial limitations in terms of generalization and 3D consistency, coupled with inaccuracies in physically realistic lighting and identity preservation. Furthermore, personalization from a single view is difficult to achieve and often requires multiview images during the testing phase or involves slow optimization processes. This paper introduces Lite2Relight , a novel technique that can predict 3D consistent head poses of portraits while performing physically plausible light editing at interactive speed. Our method uniquely extends the generative capabilities and efficient volumetric representation of EG3D, leveraging a lightstage dataset to implicitly disentangle face reflectance and perform relighting under target HDRI environment maps. By utilizing a pre-trained geometry-aware encoder and a feature alignment module, we map input images into a relightable 3D space, enhancing them with a strong face geometry and reflectance prior. Through extensive quantitative and qualitative evaluations, we show that our method outperforms the state-of-the-art methods in terms of efficacy, photorealism, and practical application. This includes producing 3D-consistent results of the full head, including hair, eyes, and expressions. Lite2Relight paves the way for large-scale adoption of photorealistic portrait editing in various domains, offering a robust, interactive solution to a previously constrained problem.

EyeIR: Single Eye Image Inverse Rendering In the Wild

We propose a method to decompose a single eye region image in the wild into albedo, shading, specular, normal and illumination. This inverse rendering problem is particularly challenging due to inherent ambiguities and complex properties of the natural eye region. To address this problem, first we construct a synthetic eye region dataset with rich diversity. Then we propose a synthetic to real adaptation framework to leverage the supervision signals from synthetic data to guide the direction of self-supervised learning. We design region-aware self-supervised losses based on image formation and eye region intrinsic properties, which can refine each predicted component by mutual learning and reduce the artifacts caused by ambiguities of natural eye images. Particularly, we address the demanding problem of specularity removal in the eye region. We show high-quality inverse rendering results of our method and demonstrate its use for a number of applications.

SESSION: Fluids

Lagrangian Covector Fluid with Free Surface

This paper introduces a novel Lagrangian fluid solver based on covector flow maps. We aim to address the challenges of establishing a robust flow-map solver for incompressible fluids under complex boundary conditions. Our key idea is to use particle trajectories to establish precise flow maps and tailor path integrals of physical quantities along these trajectories to reformulate the Poisson problem during the projection step. We devise a decoupling mechanism based on path-integral identities from flow-map theory. This mechanism integrates long-range flow maps for the main fluid body into a short-range projection framework, ensuring a robust treatment of free boundaries. We show that our method can effectively transform a long-range projection problem with integral boundaries into a Poisson problem with standard boundary conditions — specifically, zero Dirichlet on the free surface and zero Neumann on solid boundaries. This transformation significantly enhances robustness and accuracy, extending the applicability of flow-map methods to complex free-surface problems.

Fluid Control with Laplacian Eigenfunctions

Physics-based fluid control has long been a challenging problem in balancing efficiency and accuracy. We introduce a novel physics-based fluid control pipeline using Laplacian Eigenfluids. Utilizing the adjoint method with our provided analytical gradient expressions, the derivative computation of the control problem is efficient and easy to formulate. We demonstrate that our method is fast enough to support real-time fluid simulation, editing, control, and optimal animation generation. Our pipeline naturally supports multi-resolution and frequency control of fluid simulations. The effectiveness and efficiency of our fluid control pipeline are validated through a variety of 2D examples and comparisons.

SESSION: 3D People and their Habitats

Physics-based Scene Layout Generation from Human Motion

Creating scenes for captured motions that achieve realistic human-scene interaction is crucial for 3D animation in movies or video games. As character motion is often captured in a blue-screened studio without real furniture or objects in place, there may be a discrepancy between the planned motion and the captured one. This gives rise to the need for automatic scene layout generation to relieve the burdens of selecting and positioning furniture and objects. Previous approaches cannot avoid artifacts like penetration and floating due to the lack of physical constraints. Furthermore, some heavily rely on specific data to learn the contact affordances, restricting the generalization ability to different motions. In this work, we present a physics-based approach that simultaneously optimizes a scene layout generator and simulates a moving human in a physics simulator. To attain plausible and realistic interaction motions, our method explicitly introduces physical constraints. To automatically recover and generate the scene layout, we minimize the motion tracking errors to identify the objects that can afford interaction. We use reinforcement learning to perform a dual-optimization of both the character motion imitation controller and the scene layout generator. To facilitate the optimization, we reshape the tracking rewards and devise pose prior guidance obtained from our estimated pseudo-contact labels. We evaluate our method using motions from SAMP and PROX, and demonstrate physically plausible scene layout reconstruction compared with the previous kinematics-based method.

VRMM: A Volumetric Relightable Morphable Head Model

In this paper, we introduce the Volumetric Relightable Morphable Model (VRMM), a novel volumetric and parametric facial prior for 3D face modeling. While recent volumetric prior models offer improvements over traditional methods like 3D Morphable Models (3DMMs), they face challenges in model learning and personalized reconstructions. Our VRMM overcomes these by employing a novel training framework that efficiently disentangles and encodes latent spaces of identity, expression, and lighting into low-dimensional representations. This framework, designed with self-supervised learning, significantly reduces the constraints for training data, making it more feasible in practice. The learned VRMM offers relighting capabilities and encompasses a comprehensive range of expressions. We demonstrate the versatility and effectiveness of VRMM through various applications like avatar generation, facial reconstruction, and animation. Additionally, we address the common issue of overfitting in generative volumetric models with a novel prior-preserving personalization framework based on VRMM. Such an approach enables high-quality 3D face reconstruction from even a single portrait input. Our experiments showcase the potential of VRMM to significantly enhance the field of 3D face modeling.

Compressed Skinning for Facial Blendshapes

We present a new method to bake classical facial animation blendshapes into a fast linear blend skinning representation. Previous work explored skinning decomposition methods that approximate general animated meshes using a dense set of bone transformations; these optimizers typically alternate between optimizing for the bone transformations and the skinning weights. We depart from this alternating scheme and propose a new approach based on proximal algorithms, which effectively means adding a projection step to the popular Adam optimizer. This approach is very flexible and allows us to quickly experiment with various additional constraints and/or loss functions. Specifically, we depart from the classical skinning paradigms and restrict the transformation coefficients to contain only about 90% non-zeros, while achieving similar accuracy and visual quality as the state-of-the-art. The sparse storage enables our method to deliver significant savings in terms of both memory and run-time speed. We include a compact implementation of our new skinning decomposition method in PyTorch, which is easy to experiment with and modify to related problems.

SESSION: 3D Fabrication

Modal Folding: Discovering Smooth Folding Patterns for Sheet Materials using Strain-Space Modes

Folding can transform mundane objects such as napkins into stunning works of art. However, finding new folding transformations for sheet materials is a challenging problem that requires expertise and real-world experimentation. In this paper, we present Modal Folding—an automated approach for discovering energetically optimal folding transformations, i.e., large deformations that require little mechanical work. For small deformations, minimizing internal energy for fixed displacement magnitudes leads to the well-known elastic eigenmodes. While linear modes provide promising directions for bending, they cannot capture the rotational motion required for folding. To overcome this limitation, we introduce strain-space modes—nonlinear analogues of elastic eigenmodes that operate on per-element curvatures instead of vertices. Using strain-space modes to determine target curvatures for bending elements, we can generate complex nonlinear folding motions by simply minimizing the sheet’s internal energy. Our modal folding approach offers a systematic and automated way to create complex designs. We demonstrate the effectiveness of our method with simulation results for a range of shapes and materials, and validate our designs with physical prototypes.

SESSION: Motion Capture

Towards Unstructured Unlabeled Optical Mocap: A Video Helps!

Optical motion capture (mocap) requires accurately reconstructing the human body from retroreflective markers, including pose and shape. In a typical mocap setting, marker labeling is an important but tedious and error-prone step. Previous work has shown that marker labeling can be automated by using a structured template defining specific marker placements, but this places additional recording constraints. We propose to relax these constraints and solve for Unstructured Unlabeled Optical (UUO) mocap. Compared to the typical mocap setting that either labels markers or places them w.r.t a structured layout, markers in UUO mocap can be placed anywhere on the body and even on one specific limb (e.g., right leg for biomechanics research), hence it is of more practical significance. It is also more challenging. To solve UUO mocap, we exploit a monocular video captured by a single RGB camera, which does not require camera calibration. On this video, we run an off-the-shelf method to reconstruct and track a human individual, giving strong visual priors of human body pose and shape. With both the video and UUO markers, we propose an optimization pipeline towards marker identification, marker labeling, human pose estimation, and human body reconstruction. Our technical novelties include multiple hypothesis testing to optimize global orientation, and marker localization and marker-part matching to better optimize for body surface. We conduct extensive experiments to quantitatively compare our method against state-of-the-art approaches, including marker-only mocap and video-only human body/shape reconstruction. Experiments demonstrate that our method resoundingly outperforms existing methods on three established benchmark datasets for both full-body and partial-body reconstruction.

Physical Non-inertial Poser (PNP): Modeling Non-inertial Effects in Sparse-inertial Human Motion Capture

Existing inertial motion capture techniques use the human root coordinate frame to estimate local poses and treat it as an inertial frame by default. We argue that when the root has linear acceleration or rotation, the root frame should be considered non-inertial theoretically. In this paper, we model the fictitious forces that are non-neglectable in a non-inertial frame by an auto-regressive estimator delicately designed following physics. With the fictitious forces, the force-related IMU measurement (accelerations) can be correctly compensated in the non-inertial frame and thus Newton’s laws of motion are satisfied. In this case, the relationship between the accelerations and body motions is deterministic and learnable, and we train a neural network to model it for better motion capture. Furthermore, to train the neural network with synthetic data, we develop an IMU synthesis by simulation strategy to better model the noise model of IMU hardware and allow parameter tuning to fit different hardware. This strategy not only establishes the network training with synthetic data but also enables calibration error modeling to handle bad motion capture calibration, increasing the robustness of the system. Code is available at https://xinyu-yi.github.io/PNP/.

Ultra Inertial Poser: Scalable Motion Capture and Tracking from Sparse Inertial Sensors and Ultra-Wideband Ranging

While camera-based capture systems remain the gold standard for recording human motion, learning-based tracking systems based on sparse wearable sensors are gaining popularity. Most commonly, they use inertial sensors, whose propensity for drift and jitter have so far limited tracking accuracy. In this paper, we propose Ultra Inertial Poser, a novel 3D full body pose estimation method that constrains drift and jitter in inertial tracking via inter-sensor distances. We estimate these distances across sparse sensor setups using a lightweight embedded tracker that augments inexpensive off-the-shelf 6D inertial measurement units with ultra-wideband radio-based ranging—dynamically and without the need for stationary reference anchors. Our method then fuses these inter-sensor distances with the 3D states estimated from each sensor. Our graph-based machine learning model processes the 3D states and distances to estimate a person’s 3D full body pose and translation. To train our model, we synthesize inertial measurements and distance estimates from the motion capture database AMASS. For evaluation, we contribute a novel motion dataset of 10 participants who performed 25 motion types, captured by 6 wearable IMU+UWB trackers and an optical motion capture system, totaling 200 minutes of synchronized sensor data (UIP-DB). Our extensive experiments show state-of-the-art performance for our method over PIP and TIP, reducing position error from 13.62 to 10.65 cm (22% better) and lowering jitter from 1.56 to 0.055 km/s3 (a reduction of 97%).

  UIP code, UIP-DB dataset, and hardware design:

https://github.com/eth-siplab/UltraInertialPoser

Hand-Object Interaction Controller (HOIC): Deep Reinforcement Learning for Reconstructing Interactions with Physics

Hand manipulating objects is an important interaction motion in our daily activities. We faithfully reconstruct this motion with a single RGBD camera by a novel deep reinforcement learning method to leverage physics. Firstly, we propose object compensation control which establishes direct object control to make the network training more stable. Meanwhile, by leveraging the compensation force and torque, we seamlessly upgrade the simple point contact model to a more physical-plausible surface contact model, further improving the reconstruction accuracy and physical correctness. Experiments indicate that without involving any heuristic physical rules, this work still successfully involves physics in the reconstruction of hand-object interactions which are complex motions hard to imitate with deep reinforcement learning. Our code and data are available at https://github.com/hu-hy17/HOIC.

Physics-Informed Learning of Characteristic Trajectories for Smoke Reconstruction

We delve into the physics-informed neural reconstruction of smoke and obstacles through sparse-view RGB videos, tackling challenges arising from limited observation of complex dynamics. Existing physics-informed neural networks often emphasize short-term physics constraints, leaving the proper preservation of long-term conservation less explored. We introduce Neural Characteristic Trajectory Fields, a novel representation utilizing Eulerian neural fields to implicitly model Lagrangian fluid trajectories. This topology-free, auto-differentiable representation facilitates efficient flow map calculations between arbitrary frames as well as efficient velocity extraction via auto-differentiation. Consequently, it enables end-to-end supervision covering long-term conservation and short-term physics priors. Building on the representation, we propose physics-informed trajectory learning and integration into NeRF-based scene reconstruction. We enable advanced obstacle handling through self-supervised scene decomposition and seamless integrated boundary constraints. Our results showcase the ability to overcome challenges like occlusion uncertainty, density-color ambiguity, and static-dynamic entanglements. Code and sample tests are at https://github.com/19reborn/PICT_smoke.

SESSION: Shape Analysis

Consistent Point Orientation for Manifold Surfaces via Boundary Integration

This paper introduces a new approach for generating globally consistent normals for point clouds sampled from manifold surfaces. Given that the generalized winding number (GWN) field generated by a point cloud with globally consistent normals is a solution to a PDE with jump boundary conditions and possesses harmonic properties, and the Dirichlet energy of the GWN field can be defined as an integral over the boundary surface, we formulate a boundary energy derived from the Dirichlet energy of the GWN. Taking as input a point cloud with randomly oriented normals, we optimize this energy to restore the global harmonicity of the GWN field, thereby recovering the globally consistent normals. Experiments show that our method outperforms state-of-the-art approaches, exhibiting enhanced robustness to noise, outliers, complex topologies, and thin structures. Our code can be found at https://github.com/liuweizhou319/BIM.

A Linear Method to Consistently Orient Normals of a 3D Point Cloud

Correctly and consistently orienting a set of normal vectors associated with a point cloud sampled from a surface in 3D is a difficult procedure necessary for further downstream processing of sampled 3D geometry, such as surface reconstruction and registration. It is difficult because correct orientation cannot be achieved without global considerations of the entire point cloud. We present an algorithm to orient a given set of normals of a 3D point cloud of size N, whose main computational component is the least-squares solution of an O(N) linear system, mostly sparse, derived from the classical Stokes’ theorem. We show experimentally that our method can successfully orient sets of normals computed locally from point clouds containing a moderate amount of noise, representing also 3D surfaces with non-smooth features (such as corners and edges), in a fraction of the time required by state-of-the-art methods.

Navigation-Driven Approximate Convex Decomposition

Approximate convex decomposition – approximating a shape by a set of convex hulls – is a popular approach to creating efficient collision representations for games and simulations. Existing algorithms to construct such decompositions are typically driven by general surface- or volume-based error metrics that can’t ignore unreachable internal surfaces nor provide local control over the results. We introduce the problem of navigable approximate convex decomposition: First, define a navigable space for the input shape which other objects in the game or simulation must be able to move through, then find a decomposition which does not overlap that space. We show how to automatically find such navigable space, how to customize it, and we introduce an approximate convex decomposition algorithm that protects it. Our results demonstrate that this approach can generate decompositions that meet application requirements faster and with fewer convex hulls than previous methods, while providing a new level of flexibility in defining what those requirements are.

Into the Portal: Directable Fractal Self-Similarity

We present a novel, directable method for introducing fractal self-similarity into arbitrary shapes. Our method allows a user to directly specify the locations of self-similarities in a Julia set, and is general enough to reproduce other well-known fractals such as the Koch snowflake. Ours is the first algorithm to enable this level of general artistic control while also maintaining the character of the original fractal shape. We introduce the notion of placing “portals” in the iteration space of a dynamical system, bridging the aesthetics of iterated maps with the fine-grained control of iterated function systems (IFS). Our method is effective in both 2D and 3D.

SESSION: 3D Head Avatars From Data

MonoGaussianAvatar: Monocular Gaussian Point-based Head Avatar

The ability to animate photo-realistic head avatars reconstructed from monocular portrait video sequences represents a crucial step in bridging the gap between the virtual and real worlds. Recent advancements in head avatar techniques, including explicit 3D morphable meshes (3DMM), point clouds, and neural implicit representation have been exploited for this ongoing research. However, 3DMM-based methods are constrained by their fixed topologies, point-based approaches suffer from a heavy training burden due to the extensive quantity of points involved, and the last ones suffer from limitations in deformation flexibility and rendering efficiency. In response to these challenges, we propose MonoGaussianAvatar (Monocular Gaussian Point-based Head Avatar), a novel approach that harnesses 3D Gaussian point representation coupled with a Gaussian deformation field to learn explicit head avatars from monocular portrait videos. We define our head avatars with Gaussian points characterized by adaptable shapes, enabling flexible topology. These points exhibit movement with a Gaussian deformation field in alignment with the target pose and expression of a person, facilitating efficient deformation. Additionally, the Gaussian points have controllable shape, size, color, and opacity combined with Gaussian splatting, allowing for efficient training and rendering. Experiments demonstrate the superior performance of our method, which achieves state-of-the-art results among previous methods. Code and data can be found at https://github.com/aipixel/MonoGaussianAvatar.

InvertAvatar: Incremental GAN Inversion for Generalized Head Avatars

While high fidelity and efficiency are central to the creation of digital head avatars, recent methods relying on 2D or 3D generative models often experience limitations such as shape distortion, expression inaccuracy, and identity flickering. Additionally, existing one-shot inversion techniques fail to fully leverage multiple input images for detailed feature extraction. We propose a novel framework, Incremental 3D GAN Inversion, that enhances avatar reconstruction performance using an algorithm designed to increase the fidelity from multiple frames, resulting in improved reconstruction quality proportional to frame count. Our method introduces a unique animatable 3D GAN prior with two crucial modifications for enhanced expression controllability alongside an innovative neural texture encoder that categorizes texture feature spaces based on UV parameterization. Differentiating from traditional techniques, our architecture emphasizes pixel-aligned image-to-image translation, mitigating the need to learn correspondences between observation and canonical spaces. Furthermore, we incorporate ConvGRU-based recurrent networks for temporal data aggregation from multiple frames, boosting geometry and texture detail reconstruction. The proposed paradigm demonstrates state-of-the-art performance on one-shot and few-shot avatar animation tasks. The code will be available at https://github.com/XChenZ/invertAvatar.

3D Gaussian Blendshapes for Head Avatar Animation

We introduce 3D Gaussian blendshapes for modeling photorealistic head avatars. Taking a monocular video as input, we learn a base head model of neutral expression, along with a group of expression blendshapes, each of which corresponds to a basis expression in classical parametric face models. Both the neutral model and expression blendshapes are represented as 3D Gaussians, which contain a few properties to depict the avatar appearance. The avatar model of an arbitrary expression can be effectively generated by combining the neutral model and expression blendshapes through linear blending of Gaussians with the expression coefficients. High-fidelity head avatar animations can be synthesized in real time using Gaussian splatting. Compared to state-of-the-art methods, our Gaussian blendshape representation better captures high-frequency details exhibited in input video, and achieves superior rendering performance.

SESSION: Real-Time Rendering: Hair, Fabrics, and Super-Resolution

Real-Time Hair Rendering with Hair Meshes

Hair meshes are known to be effective for modeling and animating hair in computer graphics. We present how the hair mesh structure can be used for efficiently rendering strand-based hair models on the GPU with on-the-fly geometry generation that provides orders of magnitude reduction in storage and memory bandwidth. We use mesh shaders to carefully distribute the computation and a custom texture layout for offloading a part of the computation to the hardware texture units. We also present a set of procedural styling operations to achieve hair strand variations for a wide range of hairstyles and a consistent coordinate-frame generation approach to attach these variations to an animating/deforming hair mesh. Finally, we describe level-of-detail techniques for improving the performance of rendering distant hair models. Our results show an unprecedented level of performance with strand-based hair rendering, achieving hundreds of full hair models animated and rendered at real-time frame rates on a consumer GPU.

Modeling Hair Strands with Roving Capsules

Hair strands can be modeled by sweeping spheres with varying radii along Bézier curves. We ray-trace such shapes by finding intersections of a given ray with a set of capsules dynamically defined at runtime. A substantial performance boost is achieved by systematically eliminating parts of the shape that are guaranteed not to intersect with the given ray. The new intersector is more than twice faster than the previously leading phantom algorithm [Reshetov and Luebke 2018]. This improvement results in a 30% overall performance increase, which includes traversal, shading, and the rendering system overhead.

In addition, we derive a parametric form of the swept sphere shapes. This provides a deeper understanding of the properties of such objects compared to the offset surfaces obtained by sweeping circles orthogonal to a given curve.

The complete WebGL implementation of our algorithm is available at https://www.shadertoy.com/view/4ffXWs.

Real-time Neural Woven Fabric Rendering

Woven fabrics are widely used in applications of realistic rendering, where real-time capability is also essential. However, rendering realistic woven fabrics in real time is challenging due to their complex structure and optical appearance, which cause aliasing and noise without many samples. The core of this issue is a multi-scale representation of the fabric shading model, which allows for a fast range query. Some previous neural methods deal with the issue at the cost of training on each material, which limits their practicality. In this paper, we propose a lightweight neural network to represent different types of woven fabrics at different scales. Thanks to the regularity and repetitiveness of woven fabric patterns, our network can encode fabric patterns and parameters as a small latent vector, which is later interpreted by a small decoder, enabling the representation of different types of fabrics. By applying the pixel’s footprint as input, our network achieves multi-scale representation. Moreover, our network is fast and occupies little storage because of its lightweight structure. As a result, our method achieves rendering and editing woven fabrics at nearly 60 frames per second on an RTX 3090, showing a quality close to the ground truth and being free from visible aliasing and noise.

Mob-FGSR: Frame Generation and Super Resolution for Mobile Real-Time Rendering

Recent advances in supersampling for frame generation and super-resolution improve real-time rendering performance significantly. However, because these methods rely heavily on the most recent features of high-end GPUs, they are impractical for mobile platforms, which are limited by lower GPU capabilities and a lack of dedicated optical flow estimation hardware. We propose Mob-FGSR, a novel lightweight supersampling framework tailored for mobile devices that integrates frame generation with super resolution to effectively improve real-time rendering performance. Our method introduces a splat-based motion vectors reconstruction method, which allows for accurate pixel-level motion estimation for both interpolation and extrapolation at desired times without the need for high-end GPUs or rendering data from generated frames. Subsequently, fast image generation models are designed to construct interpolated or extrapolated frames and improve resolution, providing users with a plethora of options. Our runtime models operate without the use of neural networks, ensuring their applicability to mobile devices. Extensive testing shows that our framework outperforms other lightweight solutions and rivals the performance of algorithms designed specifically for high-end GPUs. Our model’s minimal runtime is confirmed by on-device testing, demonstrating its potential to benefit a wide range of mobile real-time rendering applications. More information and an Android demo can be found at: https://mob-fgsr.github.io/.

Deep Fourier-based Arbitrary-scale Super-resolution for Real-time Rendering

As a prevailing tool for effectively reducing rendering costs in many graphical applications, frame super-resolution has seen important progress in recent years. However, most of prior works designed for rendering contents face a common limitation: once a model is trained, it can only afford a single fixed scale. In this paper, we attempt to eliminate this limitation by supporting arbitrary-scale super-resolution for a trained neural model. The key is a Fourier-based implicit neural representation which maps arbitrary and naturally coordinates in the high-resolution spatial domain to valid pixel values. By observing that high-resolution G-buffers possess similar spectrum to high-resolution rendered frames, we design a High-Frequency Fourier Mapping (HFFM) module to recover fine details from low-resolution inputs, without introducing noticeable artifacts. A Low-Frequency Residual Learning (LFRL) strategy is adopted to preserve low-frequency structures and ensure low biasedness caused by network inference. Moreover, different rendering contents are well separated by our spatial-temporal masks derived from G-buffers and motion vectors. Several light-weight designs to the neural network guarantee the real-time performance on a wide range of scenes.

SESSION: Character Animation from Data

LGTM: Local-to-Global Text-Driven Human Motion Diffusion Model

In this paper, we introduce LGTM, a novel Local-to-Global pipeline for Text-to-Motion generation. LGTM utilizes a diffusion-based architecture and aims to address the challenge of accurately translating textual descriptions into semantically coherent human motion in computer animation. Specifically, traditional methods often struggle with semantic discrepancies, particularly in aligning specific motions to the correct body parts. To address this issue, we propose a two-stage pipeline to overcome this challenge: it first employs large language models (LLMs) to decompose global motion descriptions into part-specific narratives, which are then processed by independent body-part motion encoders to ensure precise local semantic alignment. Finally, an attention-based full-body optimizer refines the motion generation results and guarantees the overall coherence. Our experiments demonstrate that LGTM gains significant improvements in generating locally accurate, semantically-aligned human motion, marking a notable advancement in text-to-motion applications. Code and data for this paper are available at https://github.com/L-Sun/LGTM

Taming Diffusion Probabilistic Models for Character Control

We present a novel character control framework that effectively utilizes motion diffusion probabilistic models to generate high-quality and diverse character animations, responding in real-time to a variety of dynamic user-supplied control signals. At the heart of our method lies a transformer-based Conditional Autoregressive Motion Diffusion Model (CAMDM), which takes as input the character’s historical motion and can generate a range of diverse potential future motions conditioned on high-level, coarse user control. To meet the demands for diversity, controllability, and computational efficiency required by a real-time controller, we incorporate several key algorithmic designs. These include separate condition tokenization, classifier-free guidance on past motion, and heuristic future trajectory extension, all designed to address the challenges associated with taming motion diffusion probabilistic models for character control. As a result, our work represents the first model that enables real-time generation of high-quality, diverse character animations based on user interactive control, supporting animating the character in multiple styles with a single unified model. We evaluate our method on a diverse set of locomotion skills, demonstrating the merits of our method over existing character controllers. The code and model are available at  https://aiganimation.github.io/CAMDM/.

TEDi: Temporally-Entangled Diffusion for Long-Term Motion Synthesis

The gradual nature of a diffusion process that synthesizes samples in small increments constitutes a key ingredient of Denoising Diffusion Probabilistic Models (DDPM), which have presented unprecedented quality in image synthesis and been recently explored in the motion domain. In this work, we propose to adapt the gradual diffusion concept (operating along a diffusion time-axis) into the temporal-axis of the motion sequence. Our key idea is to extend the DDPM framework to support temporally varying denoising, thereby entangling the two axes. Using our special formulation, we iteratively denoise a motion buffer that contains a set of increasingly-noised poses, which auto-regressively produces an arbitrarily long stream of frames. With a stationary diffusion time-axis, in each diffusion step we increment only the temporal-axis of the motion such that the framework produces a new, clean frame which is removed from the beginning of the buffer, followed by a newly drawn noise vector that is appended to it. This new mechanism paves the way towards a new framework for long-term motion synthesis with applications to character animation and other domains.

Flexible Motion In-betweening with Diffusion Models

Motion in-betweening, a fundamental task in character animation, consists of generating motion sequences that plausibly interpolate user-provided keyframe constraints. It has long been recognized as a labor-intensive and challenging process. We investigate the potential of diffusion models in generating diverse human motions guided by keyframes. Unlike previous inbetweening methods, we propose a simple unified model capable of generating precise and diverse motions that conform to a flexible range of user-specified spatial constraints, as well as text conditioning. To this end, we propose Conditional Motion Diffusion In-betweening (CondMDI) which allows for arbitrary dense-or-sparse keyframe placement and partial keyframe constraints while generating high-quality motions that are diverse and coherent with the given keyframes. We evaluate the performance of CondMDI on the text-conditioned HumanML3D dataset and demonstrate the versatility and efficacy of diffusion models for keyframe in-betweening. We further explore the use of guidance and imputation-based approaches for inference-time keyframing and compare CondMDI against these methods.

WalkTheDog: Cross-Morphology Motion Alignment via Phase Manifolds

We present a new approach for understanding the periodicity structure and semantics of motion datasets, independently of the morphology and skeletal structure of characters. Unlike existing methods using an overly sparse high-dimensional latent, we propose a phase manifold consisting of multiple closed curves, each corresponding to a latent amplitude. With our proposed vector quantized periodic autoencoder, we learn a shared phase manifold for multiple characters, such as a human and a dog, without any supervision. This is achieved by exploiting the discrete structure and a shallow network as bottlenecks, such that semantically similar motions are clustered into the same curve of the manifold, and the motions within the same component are aligned temporally by the phase variable. In combination with an improved motion matching framework, we demonstrate the manifold’s capability of timing and semantics alignment in several applications, including motion retrieval, transfer and stylization. Code and pre-trained models for this paper are available at peizhuoli.github.io/walkthedog.

Iterative Motion Editing with Natural Language

Text-to-motion diffusion models can generate realistic animations from text prompts, but do not support fine-grained motion editing controls. In this paper, we present a method for using natural language to iteratively specify local edits to existing character animations, a task that is common in most computer animation workflows. Our key idea is to represent a space of motion edits using a set of kinematic motion editing operators (MEOs) whose effects on the source motion is well-aligned with user expectations. We provide an algorithm that leverages pre-existing language models to translate textual descriptions of motion edits into source code for programs that define and execute sequences of MEOs on a source animation. We execute MEOs by first translating them into keyframe constraints, and then use diffusion-based motion models to generate output motions that respect these constraints. Through a user study and quantitative evaluation, we demonstrate that our system can perform motion edits that respect the animator’s editing intent, remain faithful to the original animation (it edits the original animation, but does not dramatically change it), and yield realistic character animation results.

SESSION: Rendering, Sampling and Tracing

Quad-Optimized Low-Discrepancy Sequences

The convergence of Monte Carlo integration is given by the uniformity of samples as well as the regularity of the integrand. Despite much effort dedicated to producing excellent, extremely uniform, sampling patterns, the Sobol’ sampler remains unchallenged in production rendering systems. This is not only due to its reasonable quality, but also because it allows for integration in (almost) arbitrary dimension, with arbitrary sample count, while actually producing sequences thus allowing for progressive rendering, with fast sample generation and small memory footprint. We improve over Sobol’ sequences in terms of sample uniformity in consecutive 2-d and 4-d projections, while providing similar practical benefits – sequences, high dimensionality, speed and compactness. We base our contribution on a base-3 Sobol’ construction, involving a search over irreducible polynomials and generator matrices, that produce (1, 4)-sequences or (2,4)-sequences in all consecutive quadruplets of dimensions, and (0, 2)-sequence in all consecutive pairs of dimensions. We provide these polynomials and matrices that may be used as a replacement of Joe & Kuo’s widely used ones, with computational overhead, for moderate-dimensional problems.

SESSION: Lighting and Matting with Image Generation

DiLightNet: Fine-grained Lighting Control for Diffusion-based Image Generation

This paper presents a novel method for exerting fine-grained lighting control during text-driven diffusion-based image generation. While existing diffusion models already have the ability to generate images under any lighting condition, without additional guidance these models tend to correlate image content and lighting. Moreover, text prompts lack the necessary expressional power to describe detailed lighting setups. To provide the content creator with fine-grained control over the lighting during image generation, we augment the text-prompt with detailed lighting information in the form of radiance hints, i.e., visualizations of the scene geometry with a homogeneous canonical material under the target lighting. However, the scene geometry needed to produce the radiance hints is unknown. Our key observation is that we only need to guide the diffusion process, hence exact radiance hints are not necessary; we only need to point the diffusion model in the right direction. Based on this observation, we introduce a three stage method for controlling the lighting during image generation. In the first stage, we leverage a standard pretrained diffusion model to generate a provisional image under uncontrolled lighting. Next, in the second stage, we resynthesize and refine the foreground object in the generated image by passing the target lighting to a refined diffusion model, named DiLightNet, using radiance hints computed on a coarse shape of the foreground object inferred from the provisional image. To retain the texture details, we multiply the radiance hints with a neural encoding of the provisional synthesized image before passing it to DiLightNet. Finally, in the third stage, we resynthesize the background to be consistent with the lighting on the foreground object. We demonstrate and validate our lighting controlled diffusion model on a variety of text prompts and lighting conditions.

IntrinsicDiffusion: Joint Intrinsic Layers from Latent Diffusion Models

Reasoning about the intrinsic properties of an image, such as albedo, illumination, and surface geometry, is a long-standing problem with many applications in image editing and compositing. Existing solutions to this ill-posed problem either heavily rely on manually designed priors or learn priors from limited datasets that lack diversity. Hence, they fall short in generalizing to in-the-wild test scenarios. In this paper, we show that a large-scale text-to-image generation model trained on a massive amount of visual data can implicitly learn intrinsic image priors. In particular, we introduce a novel conditioning mechanism built on top of a pre-trained foundational image generation model to jointly predict multiple intrinsic modalities from an input image. We demonstrate that predicting different modalities in a collaborative manner improves the overall quality. This design also enables mixing datasets with annotations of only a subset of the modalities during training, contributing to the generalizability of our approach. Our method achieves state-of-the-art performance in intrinsic image decomposition, both qualitatively and quantitatively. We also demonstrate downstream image editing applications, such as relighting and retexturing.

RGB↔X: Image decomposition and synthesis using material- and lighting-aware diffusion models

The three areas of realistic forward rendering, per-pixel inverse rendering, and generative image synthesis may seem like separate and unrelated sub-fields of graphics and vision. However, recent work has demonstrated improved estimation of per-pixel intrinsic channels (albedo, roughness, metallicity) based on a diffusion architecture; we call this the RGB → X problem. We further show that the reverse problem of synthesizing realistic images given intrinsic channels, X → RGB, can also be addressed in a diffusion framework. Focusing on the image domain of interior scenes, we introduce an improved diffusion model for RGB → X, which also estimates lighting, as well as the first diffusion X → RGB model capable of synthesizing realistic images from (full or partial) intrinsic channels. Our X → RGB model explores a middle ground between traditional rendering and generative models: We can specify only certain appearance properties that should be followed, and give freedom to the model to hallucinate a plausible version of the rest. This flexibility allows using a mix of heterogeneous training datasets that differ in the available channels. We use multiple existing datasets and extend them with our own synthetic and real data, resulting in a model capable of extracting scene properties better than previous work and of generating highly realistic images of interior scenes.

Matting by Generation

This paper introduces an innovative approach for image matting that redefines the traditional regression-based task as a generative modeling challenge. Our method harnesses the capabilities of latent diffusion models, enriched with extensive pre-trained knowledge, to regularize the matting process. We present novel architectural innovations that empower our model to produce mattes with superior resolution and detail. The proposed method is versatile and can perform both guidance-free and guidance-based image matting, accommodating a variety of additional cues. Our comprehensive evaluation across three benchmark datasets demonstrates the superior performance of our approach, both quantitatively and qualitatively. The results not only reflect our method’s robust effectiveness but also highlight its ability to generate visually compelling mattes that approach photorealistic quality. The code for this paper is available at https://github.com/lightChaserX/alphaLDM.

SESSION: Virtual Interaction and Real Devices

Dragon's Path: Synthesizing User-Centered Flying Creature Animation Paths for Outdoor Augmented Reality Experiences

Advances in augmented reality promise to deliver highly immersive storytelling experiences by animating virtual characters naturally in the real world. However, creating such realistic animated content for viewing in augmented reality is non-trivial and challenging. In this paper, we present a novel approach to automatically generate user-centered flying creature animation paths for outdoor augmented reality experiences. Given a sequence of storyline actions, our approach finds suitable locations for the character to perform its actions via a location compatibility predictor trained with user preferences, synthesizing a corresponding animation path optimized with respect to the user’s perspective. We applied our approach to synthesize user-centered augmented reality experiences based on different storyline actions and environments. We also conducted user study experiments to validate the efficacy of our approach for synthesizing desirable augmented reality experiences.

VR-GS: A Physical Dynamics-Aware Interactive Gaussian Splatting System in Virtual Reality

As 3D content becomes increasingly prevalent, there’s a growing focus on the development of engagements with 3D virtual content. Unfortunately, traditional techniques for creating, editing, and interacting with this content are fraught with difficulties. They tend to be not only engineering-intensive but also require extensive expertise, which adds to the frustration and inefficiency in virtual object manipulation. Our proposed VR-GS system represents a leap forward in human-centered 3D content interaction, offering a seamless and intuitive user experience. By developing a physical dynamics-aware interactive Gaussian Splatting (GS) in a Virtual Reality (VR) setting, and constructing a highly efficient two-level embedding strategy alongside deformable body simulations, VR-GS ensures real-time execution with highly realistic dynamic responses. The components of our system are designed for high efficiency and effectiveness, starting from detailed scene reconstruction and object segmentation, advancing through multi-view image in-painting, and extending to interactive physics-based editing. The system also incorporates real-time deformation embedding and dynamic shadow casting, ensuring a comprehensive and engaging virtual experience.

Temporal acoustic point holography

Holographic acoustic levitation using phased arrays of transducers has facilitated innovative mid-air particle displays. However, current time-invariant methods for computing acoustic holograms often induce high phase changes and thus transducer amplitude fluctuations, severely limiting possible dynamic displays. In this work, we develop a wide range of temporal phase retrieval algorithms that suppress phase change and trade off computational time and solution optimality. We base our hardware implementation on Gerchberg-Saxton (GS), which we identify as gradient descent for maximizing focal amplitudes and thus acoustic trap quality. Following this, we adapt GS to additionally constrain transducers’ (and points’) phase changes from the previous time frame, and enable multiparticle animations without prior knowledge of particles’ paths and interactions. We experimentally showcase a series of levitated character animations depicting human motion and believe our work paves the way for natural dynamic multi-particle displays and unencumbered 3D delivery of other modalities such as haptics and audio.

SESSION: Cloth Simulation

Neural-Assisted Homogenization of Yarn-Level Cloth

Real-world fabrics, composed of threads and yarns, often display complex stress-strain relationships, making their homogenization a challenging task for fast simulation by continuum-based models. Consequently, existing homogenized yarn-level models frequently struggle with numerical stability without line search at large time steps, forcing a trade-off between model accuracy and stability. In this paper, we propose a neural-assisted homogenized constitutive model for simulating yarn-level cloth. Unlike analytic models, a neural model is advantageous in adapting to complex dynamic behaviors, and its inherent smoothness naturally mitigates stability issues. We also introduce a sector-based warm-start strategy to accelerate the data collection process in homogenization. This model is trained using collected strain energy datasets and its accuracy is validated through both qualitative and quantitative experiments. Thanks to our model’s stability, our simulator can now achieve two-orders-of-magnitude speedups with large time steps compared to previous models.

ContourCraft: Learning to Resolve Intersections in Neural Multi-Garment Simulations

Learning-based approaches to cloth simulation have started to show their potential in recent years. However, handling collisions and intersections in neural simulations remains a largely unsolved problem. In this work, we present ContourCraft, a learning-based solution for handling intersections in neural cloth simulations. Unlike conventional approaches that critically rely on intersection-free inputs, ContourCraft robustly recovers from intersections introduced through missed collisions, self-penetrating bodies, or errors in manually designed multi-layer outfits. The technical core of ContourCraft is a novel intersection contour loss that penalizes interpenetrations and encourages rapid resolution thereof. We integrate our intersection loss with a collision-avoiding repulsion objective into a neural cloth simulation method based on graph neural networks (GNNs). We demonstrate our method’s ability across a challenging set of diverse multi-layer outfits under dynamic human motions. Our extensive analysis indicates that ContourCraft significantly improves collision handling for learned simulation and produces visually compelling results.

SESSION: 3D Shape Analysis

DAE-Net: Deforming Auto-Encoder for fine-grained shape co-segmentation

We present an unsupervised 3D shape co-segmentation method which learns a set of deformable part templates from a shape collection. To accommodate structural variations in the collection, our network composes each shape by a selected subset of template parts which are affine-transformed. To maximize the expressive power of the part templates, we introduce a per-part deformation network to enable the modeling of diverse parts with substantial geometry variations, while imposing constraints on the deformation capacity to ensure fidelity to the originally represented parts. We also propose a training scheme to effectively overcome local minima. Architecturally, our network is a branched autoencoder, with a CNN encoder taking a voxel shape as input and producing per-part transformation matrices, latent codes, and part existence scores, and the decoder outputting point occupancies to define the reconstruction loss. Our network, coined DAE-Net for Deforming Auto-Encoder, can achieve unsupervised 3D shape co-segmentation that yields fine-grained, compact, and meaningful parts that are consistent across diverse shapes. We conduct extensive experiments on the ShapeNet Part dataset, DFAUST, and an animal subset of Objaverse to show superior performance over prior methods. Code and data are available at https://github.com/czq142857/DAE-Net.

SESSION: Dynamic Radiance Fields

ST-4DGS: Spatial-Temporally Consistent 4D Gaussian Splatting for Efficient Dynamic Scene Rendering

Dynamic scene rendering at any novel view continues to be a difficult but important task, especially for high-fidelity rendering quality with efficient rendering speed. The recent 3D Gaussian Splatting, i.e., 3DGS, shows great success for static scene rendering with impressive quality at a very efficient speed. However, the extension of 3DGS from static scene to dynamic 4DGS is still challenging, even for scenes with modest amounts of foreground object movement (such as a human moving an object). This paper proposes a novel spatial-temporally 4D Gaussian Splatting, i.e., ST-4DGS, which aims at the spatial-temporally persistent dynamic rendering quality and maintains real-time rendering efficiency. The key ideas of ST-4DGS are two novel mechanisms: (1) a novel spatial-temporal 4D Gaussian Splatting with a motion-aware shape regularization, and (2) a spatial-temporal joint density control mechanism. The proposed mechanisms efficiently prevent the compactness degeneration of the 4D Gaussian representation during dynamic scene learning, thus leading to spatial-temporally consistent dynamic rendering quality. With extensive evaluation on public datasets, our ST-4DGS can achieve much better dynamic rendering quality than previous approaches, such as 4DGS, HexPlane, K-Plane, 4K4D, etc, and in a more efficient rendering speed for persistent dynamic rendering. To our best knowledge, ST-4DGS is a new state-of-the-art 4D Gaussian Splatting for high-fidelity dynamic rendering, especially ensuring the spatial-temporally consistent rendering quality in scenes with modest movement. The code is available at https://github.com/wanglids/ST-4DGS.

GaussianPrediction: Dynamic 3D Gaussian Prediction for Motion Extrapolation and Free View Synthesis

Forecasting future scenarios in dynamic environments is essential for intelligent decision-making and navigation, a challenge yet to be fully realized in computer vision and robotics. Traditional approaches like video prediction and novel-view synthesis either lack the ability to forecast from arbitrary viewpoints or to predict temporal dynamics. In this paper, we introduce GaussianPrediction, a novel framework that empowers 3D Gaussian representations with dynamic scene modeling and future scenario synthesis in dynamic environments. GaussianPrediction can forecast future states from any viewpoint, using video observations of dynamic scenes. To this end, we first propose a 3D Gaussian canonical space with deformation modeling to capture the appearance and geometry of dynamic scenes, and integrate the lifecycle property into Gaussians for irreversible deformations. To make the prediction feasible and efficient, a concentric motion distillation approach is developed by distilling the scene motion with key points. Finally, a Graph Convolutional Network is employed to predict the motions of key points, enabling the rendering of photorealistic images of future scenarios. Our framework shows outstanding performance on both synthetic and real-world datasets, demonstrating its efficacy in predicting and rendering future environments. Code is available on the project webpage: https://zju3dv.github.io/gaussian-prediction.

Factorized Motion Fields for Fast Sparse Input Dynamic View Synthesis

Designing a 3D representation of a dynamic scene for fast optimization and rendering is a challenging task. While recent explicit representations enable fast learning and rendering of dynamic radiance fields, they require a dense set of input viewpoints. In this work, we focus on learning a fast representation for dynamic radiance fields with sparse input viewpoints. However, the optimization with sparse input is under-constrained and necessitates the use of motion priors to constrain the learning. Existing fast dynamic scene models do not explicitly model the motion, making them difficult to be constrained with motion priors. We design an explicit motion model as a factorized 4D representation that is fast and can exploit the spatio-temporal correlation of the motion field. We then introduce reliable flow priors including a combination of sparse flow priors across cameras and dense flow priors within cameras to regularize our motion model. Our model is fast, compact and achieves very good performance on popular multi-view dynamic scene datasets with sparse input viewpoints. The source code for our model can be found on our project page: https://nagabhushansn95.github.io/publications/2024/RF-DeRF.html.

Modeling Ambient Scene Dynamics for Free-view Synthesis

We introduce a novel method for dynamic free-view synthesis of an ambient scenes from a monocular capture bringing a immersive quality to the viewing experience. Our method builds upon the recent advancements in 3D Gaussian Splatting (3DGS) that can faithfully reconstruct complex static scenes. Previous attempts to extend 3DGS to represent dynamics have been confined to bounded scenes or require multi-camera captures, and often fail to generalize to unseen motions, limiting their practical application. Our approach overcomes these constraints by leveraging the periodicity of ambient motions to learn the motion trajectory model, coupled with careful regularization. We also propose important practical strategies to improve the visual quality of the baseline 3DGS static reconstructions and to improve memory efficiency critical for GPU-memory intensive learning. We demonstrate high-quality photorealistic novel view synthesis of several ambient natural scenes with intricate textures and fine structural elements. We show that our method significantly outperforms prior methods both qualitatively and quantitatively. Project page: https://ambientgaussian.github.io/

4D-Rotor Gaussian Splatting: Towards Efficient Novel View Synthesis for Dynamic Scenes

We consider the problem of novel-view synthesis (NVS) for dynamic scenes. Recent neural approaches have accomplished exceptional NVS results for static 3D scenes, but extensions to 4D time-varying scenes remain non-trivial. Prior efforts often encode dynamics by learning a canonical space plus implicit or explicit deformation fields, which struggle in challenging scenarios like sudden movements or generating high-fidelity renderings. In this paper, we introduce 4D Gaussian Splatting (4DRotorGS), a novel method that represents dynamic scenes with anisotropic 4D XYZT Gaussians, inspired by the success of 3D Gaussian Splatting in static scenes [Kerbl et al. 2023]. We model dynamics at each timestamp by temporally slicing the 4D Gaussians, which naturally compose dynamic 3D Gaussians and can be seamlessly projected into images. As an explicit spatial-temporal representation, 4DRotorGS demonstrates powerful capabilities for modeling complicated dynamics and fine details—especially for scenes with abrupt motions. We further implement our temporal slicing and splatting techniques in a highly optimized CUDA acceleration framework, achieving real-time inference rendering speeds of up to 277 FPS on an RTX 3090 GPU and 583 FPS on an RTX 4090 GPU. Rigorous evaluations on scenes with diverse motions showcase the superior efficiency and effectiveness of 4DRotorGS, which consistently outperforms existing methods both quantitatively and qualitatively.

Controllable Neural Style Transfer for Dynamic Meshes

In recent years, animation movies are shifting from realistic representations to more stylized depictions that support unique design languages. To favor that, recent works implemented a Neural Style Transfer (NST) pipeline that supports the stylization of 3D assets by 2D images. In this paper we propose a novel mesh stylization technique that improves previous NST works in several ways. First, we replace the standard Gram-Matrix style loss by a Neural Neighbor formulation that enables sharper and artifact-free results. To support large mesh deformations, we reparametrize the optimized mesh positions through an implicit formulation based on the Laplace-Beltrami operator that better captures silhouette gradients that are common in inverse differentiable rendering setups. This reparametrization is coupled with a coarse-to-fine stylization setup, which enables deformations that can change large structures of the mesh. We provide artistic control through a novel method that enables directional and temporal control over synthesized styles by a guiding vector field. Lastly, we improve the previous time-coherency schemes and develop an efficient regularization that controls volume changes during the stylization process. These improvements enable high quality mesh stylizations that can create unique looks for both simulations and 3D assets.

SESSION: Appearance Models

A Realistic Multi-scale Surface-based Cloth Appearance Model

Surface-based cloth appearance models have been rapidly advancing, shifting from detail-less BRDFs to modern per-point shading models with accurate spatially-varying reflection, transmission, and so on. However, the increased complexity has brought about realism-performance trade-offs: from closeup, rendered cloth can be highly inaccurate due to the missing, unaffordable parallax effects; from far away, significant amount of noise will show up since every point can be shaded differently inside a pixel’s footprint. In this paper, we aim at eliminating the trade-off with a realistic multi-scale surface-based cloth appearance model. We propose a comprehensive micro-scale model focusing on correct parallax effects, and a practical meso-scale integration scheme, emphasizing efficiency while losslessly preserving accurate highlights and self-shadowing. We further improve its performance using our novel Clustered Control Variates (CCV) and Summed-Area Table (SAT) integration scheme, and its practicality using an efficient Clustered Principal Component Analysis (C-PCA) compression method. As a result, our multi-scale model achieves a 30 × acceleration compared to the state-of-the-art, is able to represent a variety of realistic cloth appearance, and can be potentially applied in real-time applications.

SESSION: Simulating Deformation

Stabler Neo-Hookean Simulation: Absolute Eigenvalue Filtering for Projected Newton

Volume-preserving hyperelastic materials are widely used to model near-incompressible materials such as rubber and soft tissues. However, the numerical simulation of volume-preserving hyperelastic materials is notoriously challenging within this regime due to the non-convexity of the energy function. In this work, we identify the pitfalls of the popular eigenvalue clamping strategy for projecting Hessian matrices to positive semi-definiteness during Newton’s method. We introduce a novel eigenvalue filtering strategy for projected Newton’s method to stabilize the optimization of Neo-Hookean energy and other volume-preserving variants under high Poisson’s ratio (near 0.5) and large initial volume change. Our method only requires a single line of code change in the existing projected Newton framework, while achieving significant improvement in both stability and convergence speed. We demonstrate the effectiveness and efficiency of our eigenvalue projection scheme on a variety of challenging examples and over different deformations on a large dataset.

Efficient Position-Based Deformable Colon Modeling for Endoscopic Procedures Simulation

Current endoscopy simulators oversimplify navigation and interaction within tubular anatomical structures to maintain interactive frame rates, neglecting the intricate dynamics of permanent contact between the organ and the medical tool. Traditional algorithms fail to represent the complexities of long, slender, deformable tools like endoscopes and hollow organs, such as the human colon, and their interaction.

In this paper, we address longstanding challenges hindering the realism of surgery simulators, explicitly focusing on these structures. One of the main components we introduce is a new model for the overall shape of the organ, which is challenging to retain due to the complex surroundings inside the abdomen. Our approach uses eXtended Position-Based Dynamics (XPBD) with a Cosserat rod constraint combined with a mesh of tetrahedrons to retain the colon’s shape. We also introduce a novel contact detection algorithm for tubular structures, allowing for real-time performance. This comprehensive representation captures global deformations and local features, significantly enhancing simulation fidelity compared to previous works.

Results showcase that navigating the endoscope through our simulated colon seemingly mirrors real-world operations. Additionally, we use real-patient data to generate the colon model, resulting in a highly realistic virtual colonoscopy simulation. Integrating efficient simulation techniques with practical medical applications arguably advances surgery simulation realism.

SESSION: Generative 3D Geometry and Editing

GEM3D: GEnerative Medial Abstractions for 3D Shape Synthesis

We introduce GEM3D 1 – a new deep, topology-aware generative model of 3D shapes. The key ingredient of our method is a neural skeleton-based representation encoding information on both shape topology and geometry. Through a denoising diffusion probabilistic model, our method first generates skeleton-based representations following the Medial Axis Transform (MAT), then generates surfaces through a skeleton-driven neural implicit formulation. The neural implicit takes into account the topological and geometric information stored in the generated skeleton representations to yield surfaces that are more topologically and geometrically accurate compared to previous neural field formulations. We discuss applications of our method in shape synthesis and point cloud reconstruction tasks, and evaluate our method both qualitatively and quantitatively. We demonstrate significantly more faithful surface reconstruction and diverse shape generation results compared to the state-of-the-art, also involving challenging scenarios of reconstructing and synthesizing structurally complex, high-genus shape surfaces from Thingi10K and ShapeNet.

SESSION: Rendering, Denoising & Path Guiding

Practical Error Estimation for Denoised Monte Carlo Image Synthesis

We present a practical global error estimation technique for Monte Carlo ray tracing combined with deep learning based denoising. Our method uses aggregated estimates of bias and variance to determine the squared error distribution of the pixels. Unlike unbiased estimates for classical Monte Carlo ray tracing, this distribution follows a noncentral chi-squared distribution, under reasonable assumptions. Based on this, we develop a stopping criterion for denoised Monte Carlo image synthesis that terminates rendering once a user specified error threshold has been achieved. Our results demonstrate that our error estimate and stopping criterion work well on a variety of scenes, and that we are able to achieve a given error threshold without the user specifying the number of samples needed.

SESSION: Perception, Image, Video

Versatile Vision Foundation Model for Image and Video Colorization

Image and video colorization are among the most common problems in image restoration. This is an ill-posed problem and a wide variety of methods have been proposed, ranging from more traditional computer vision strategies to most recent development with transformer-based or generative neural network models. In this work we show how a latent diffusion model, pre-trained on text-to-image synthesis, can be finetuned for image colorization and provide a flexible solution for a wide variety of scenarios: high quality direct colorization with diverse results, user guided colorization through colors hints, text prompts or reference image and finally video colorization. Some works already investigated using diffusion models for colorization, however the proposed solutions are often more complex and require training a side model guiding the denoising process (à la ControlNet). Not only is this approach increasing the number of parameters and compute time, it also results in sub optimal colorization as we show. Our evaluation demonstrates that our model is the only approach that offers a wide flexibility while either matching or outperforming existing methods specialized in each sub-task, by proposing a group of universal, architecture-agnostic mechanisms which could be applied to any pre-trained diffusion model.

SESSION: Simulation with Contact

Primal-Dual Non-Smooth Friction for Rigid Body Animation

Current numerical algorithms for simulating friction fall in one of two camps: smooth solvers sacrifice the stable treatment of static friction in exchange for fast convergence, and non-smooth solvers accurately compute friction at convergence rates that are often prohibitive for large graphics applications. We introduce a novel bridge between these two ideas that computes static and dynamic friction stably and efficiently. Our key idea is to convert the highly constrained non-smooth problem into an unconstrained smooth problem using logarithmic barriers that converges to the exact solution as accuracy increases. We phrase the problem as an interior point primal-dual problem that can be solved efficiently with Newton iteration. We observe quadratic convergence despite the non-smooth nature of the original problem, and our method is well-suited for large systems of tightly packed objects with many contact points. We demonstrate the efficacy of our method with stable piles of grains and stacks of objects, complex granular flows, and robust interlocking assemblies of rigid bodies.

Preconditioned Nonlinear Conjugate Gradient Method for Real-time Interior-point Hyperelasticity

The linear conjugate gradient method is widely used in physical simulation, particularly for solving large-scale linear systems derived from Newton’s method. The nonlinear conjugate gradient method generalizes the conjugate gradient method to nonlinear optimization, which is extensively utilized in solving practical large-scale unconstrained optimization problems. However, it is rarely discussed in physical simulation due to the requirement of multiple vector-vector dot products. Fortunately, with the advancement of GPU-parallel acceleration techniques, it is no longer a bottleneck. In this paper, we propose a Jacobi preconditioned nonlinear conjugate gradient method for elastic deformation using interior-point methods. Our method is straightforward, GPU-parallelizable, and exhibits fast convergence and robustness against large time steps. The employment of the barrier function in interior-point methods necessitates continuous collision detection per iteration to obtain a penetration-free step size, which is computationally expensive and challenging to parallelize on GPUs. To address this issue, we introduce a line search strategy that deduces an appropriate step size in a single pass, eliminating the need for additional collision detection. Furthermore, we simplify and accelerate the computations of Jacobi preconditioning and Hessian-vector product for hyperelasticity and barrier function. Our method can accurately simulate objects comprising over 100,000 tetrahedra in complex self-collision scenarios at real-time speeds.

A Dynamic Duo of Finite Elements and Material Points

This paper presents a novel method to couple Finite Element Methods (FEM), typically employed for modeling Lagrangian solids such as flesh, cloth, hair, and rigid bodies, with Material Point Methods (MPM), which are well-suited for simulating materials undergoing substantial deformation and topology change, including Newtonian/non-Newtonian fluid, granular materials, and fracturing materials. The challenge of coupling these diverse methods arises from their contrasting computational needs: implicit FEM integration is often favored to enjoy stability and large timesteps, while explicit MPM integration benefits from its allowance for efficient GPU optimization and flexibility of applying different plasticity models, which only allows for moderate timesteps. To bridge this gap, a mixed implicit-explicit time integration (IMEX) approach is proposed, utilizing principles from time splitting for partial differential equations and optimization-based time integrators. This method adopts incremental potential contact (IPC) to define a variational frictional contact model between the two materials, serving as the primary coupling mechanism. Our method enables implicit FEM and explicit MPM to coexist with significantly different timestep sizes while preserving two-way coupling. Experimental results demonstrate the potential of our method as a strong foundation for future exploration and enhancement in the field of multi-material simulation.

SESSION: Spatial Data Structures

Neural Bounding

Bounding volumes are an established concept in computer graphics and vision tasks but have seen little change since their early inception. In this work, we study the use of neural networks as bounding volumes. Our key observation is that bounding, which so far has primarily been considered a problem of computational geometry, can be redefined as a problem of learning to classify space into free or occupied. This learning-based approach is particularly advantageous in high-dimensional spaces, such as animated scenes with complex queries, where neural networks are known to excel. However, unlocking neural bounding requires a twist: allowing – but also limiting – false positives, while ensuring that the number of false negatives is strictly zero. We enable such tight and conservative results using a dynamically-weighted asymmetric loss function. Our results show that our neural bounding produces up to an order of magnitude fewer false positives than traditional methods. In addition, we propose an extension of our bounding method using early exits that accelerates query speeds by 25 %. We also demonstrate that our approach is applicable to non-deep learning models that train within seconds. Our project page is at https://wenxin-liu.github.io/neural_bounding/.

N-BVH: Neural ray queries with bounding volume hierarchies

Neural representations have shown spectacular ability to compress complex signals in a fraction of the raw data size. In 3D computer graphics, the bulk of a scene’s memory usage is due to polygons and textures, making them ideal candidates for neural compression. Here, the main challenge lies in finding good trade-offs between efficient compression and cheap inference while minimizing training time. In the context of rendering, we adopt a ray-centric approach to this problem and devise <Formula format="inline"><TexMath><?TeX $\mathcal {N}$?></TexMath><AltText>Math 1</AltText><File name="siggraphconferencepapers24-70-inline1" type="svg"/></Formula>-BVH, a neural compression architecture designed to answer arbitrary ray queries in 3D. Our compact model is learned from the input geometry and substituted for it whenever a ray intersection is queried by a path-tracing engine. While prior neural compression methods have focused on point queries, ours proposes neural ray queries that integrate seamlessly into standard ray-tracing pipelines. At the core of our method, we employ an adaptive BVH-driven probing scheme to optimize the parameters of a multi-resolution hash grid, focusing its neural capacity on the sparse 3D occupancy swept by the original surfaces. As a result, our <Formula format="inline"><TexMath><?TeX $\mathcal {N}$?></TexMath><AltText>Math 2</AltText><File name="siggraphconferencepapers24-70-inline2" type="svg"/></Formula>-BVH can serve accurate ray queries from a representation that is more than an order of magnitude more compact, providing faithful approximations of visibility, depth, and appearance attributes. The flexibility of our method allows us to combine and overlap neural and non-neural entities within the same 3D scene and extends to appearance level of detail.

ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation

The common trade-offs of state-of-the-art methods for multi-shape representation (a single model "packing" multiple objects) involve trading modeling accuracy against memory and storage. We show how to encode multiple shapes represented as continuous neural fields with a higher degree of precision than previously possible and with low memory usage. Key to our approach is a recursive hierarchical formulation that exploits object self-similarity, leading to a highly compressed and efficient shape latent space. Thanks to the recursive formulation, our method supports spatial and global-to-local latent feature fusion without needing to initialize and maintain auxiliary data structures, while still allowing for continuous field queries to enable applications such as raytracing. In experiments on a set of diverse datasets, we provide compelling qualitative results and demonstrate state-of-the-art multi-scene reconstruction and compression results with a single network per dataset.

SESSION: Controllable Image Generation and Completion

Filter-Guided Diffusion for Controllable Image Generation

Recent advances in diffusion-based generative models have shown incredible promise for zero shot image-to-image translation and editing. Most of these approaches work by combining or replacing network-specific features used in the generation of new images with those taken from the inversion of some guide image. Methods of this type are considered the current state-of-the-art in training-free approaches, but have some notable limitations: they tend to be costly in runtime and memory, and often depend on deterministic sampling that limits variation in generated results. We propose Filter-Guided Diffusion (FGD), an alternative approach that leverages fast filtering operations during the diffusion process to support finer control over the strength and frequencies of guidance and can work with non-deterministic samplers to produce greater variety. With its efficiency, FGD can be sampled over multiple seeds and hyperparameters in less time than a single run of other SOTA methods to produce superior results based on structural and semantic metrics. We conduct extensive quantitative and qualitative experiments to evaluate the performance of FGD in translation tasks and also demonstrate its potential in localized editing when used with masks.

LOOSECONTROL: Lifting ControlNet for Generalized Depth Conditioning

We present LooseControl to allow generalized depth conditioning for diffusion-based image generation. ControlNet, the SOTA for depth conditioned image generation, produces remarkable results but relies on having access to detailed depth maps for guidance. Creating such exact depth maps, in many scenarios, is challenging. This paper introduces a generalized version of depth conditioning that enables new content creation workflows. Specifically, we allow (C1) scene boundary control for loosely specifying scenes with only boundary conditions, and (C2) 3D box control for specifying the target objects’ layout locations rather than the objects’ exact shape and appearance. Using LooseControl, along with text guidance, users can create complex environments (e.g., rooms, street views, etc.) by specifying only scene boundaries and locations of primary objects. Further, we provide two editing mechanisms to refine the results: (E1) 3D box editing enables the user to refine images by changing, adding, or removing boxes while freezing the image style. This yields minimal changes apart from changes induced by the edited boxes. (E2) Attribute editing proposes possible editing directions to change one particular aspect of the scene, such as the overall object density or a particular object. Tests and comparisons with baselines demonstrate the generality of our method. We believe that LooseControl can become an important design tool for easily creating complex environments and be extended to other forms of guidance channels. The project page can be found at https://shariqfarooq123.github.io/loose-control/.

Separate-and-Enhance: Compositional Finetuning for Text-to-Image Diffusion Models

Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object generation. In this work, we first show the fundamental reasons for such misalignment by identifying issues related to low attention activation and mask overlaps. Then we propose a compositional finetuning framework with two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores, respectively. Unlike conventional test-time adaptation methods, our model, once finetuned on critical parameters, is able to directly perform inference given an arbitrary multi-object prompt, which enhances the scalability and generalizability. Through comprehensive evaluations, our model demonstrates superior performance in image realism, text-image alignment, and adaptability, significantly surpassing established baselines. Furthermore, we show that training our model with a diverse range of concepts enables it to generalize effectively to novel concepts, exhibiting enhanced performance compared to models trained on individual concept pairs.

Object-level Scene Deocclusion

Deoccluding the hidden portions of objects in a scene is a formidable task, particularly when addressing real-world scenes. In this paper, we present a new self-supervised PArallel visible-to-COmplete diffusion framework, named PACO, a foundation model for object-level scene deocclusion. Leveraging the rich prior of pre-trained models, we first design the parallel variational autoencoder, which produces a full-view feature map that simultaneously encodes multiple complete objects, and the visible-to-complete latent generator, which learns to implicitly predict the full-view feature map from partial-view feature map and text prompts extracted from the incomplete objects in the input image. To train PACO, we create a large-scale dataset with 500k samples to enable self-supervised learning, avoiding tedious annotations of the amodal masks and occluded regions. At inference, we devise a layer-wise deocclusion strategy to improve efficiency while maintaining the deocclusion quality. Extensive experiments on COCOA and various real-world scenes demonstrate the superior capability of PACO for scene deocclusion, surpassing the state of the arts by a large margin. Our method can also be extended to cross-domain scenes and novel categories that are not covered by the training set. Further, we demonstrate the deocclusion applicability of PACO in single-view 3D scene reconstruction and object recomposition. Project page: https://liuzhengzhe.github.io/Deocclude-Any-Object.github.io/.

SESSION: Character Animation: 2D, 3D, Robot

Text-Guided Synthesis of Crowd Animation

Creating vivid crowd animations is core to immersive virtual environments in digital games. This work focuses on tackling the challenges of the crowd behavior generation problem. Existing approaches are labor-intensive, relying on practitioners to manually craft the complex behavior systems. We propose a machine learning approach to synthesize diversified dynamic crowd animation scenarios for a given environment based on a text description input. We first train two conditional diffusion models that generate text-guided agent distribution fields and velocity fields. Assisted by local navigation algorithms, the fields are then used to control multiple groups of agents. We further employ Large-Language Model (LLM) to canonicalize the general script into a structured sentence for more stable training and better scalability. To train our diffusion models, we devise a constructive method to generate random environments and crowd animations. We show that our trained diffusion models can generate crowd animations for both unseen environments and novel scenario descriptions. Our method paves the way towards automatic generating of crowd behaviors for virtual environments. Code and data for this paper are available at: https://github.com/MLZG/Text-Crowd.git.

Soft Pneumatic Actuator Design using Differentiable Simulation

We propose a computational design pipeline for pneumatically-actuated soft robots interacting with their environment through contact. We optimize the shape of the robot with a shape optimization approach, using a physically-accurate high-order finite element model for the forward simulation. Our approach enables fine-grained control over both deformation and contact forces by optimizing the shape of internal cavities, which we exploit to design pneumatically-actuated robots that can assume user-prescribed poses, or apply user-controlled forces. We demonstrate the efficacy of our method on two artistic and two functional examples.

SMEAR: Stylized Motion Exaggeration with ARt-direction

Smear frames are routinely used by artists for the expressive depiction of motion in animations. In this paper, we present an automatic, yet art-directable method for the generation of smear frames in 3D, with a focus on elongated in-betweens where an object is stretched along its trajectory. It takes as input a key-framed animation of a 3D mesh, and outputs a deformed version of this mesh for each frame of the animation, while providing for artistic refinement at the end of the animation process and prior to rendering.

Our approach works in two steps. We first compute spatially and temporally coherent motion offsets that describe to which extent parts of the input mesh should be leading in front or trailing behind. We then describe a framework to stylize these motion offsets in order to produce elongated in-betweens at interactive rates, which we extend to the other two common smear frame effects: multiple in-betweens and motion lines. Novice users may rely on preset stylization functions for fast and easy prototyping, while more complex custom-made stylization functions may be designed by experienced artists through our geometry node implementation in Blender.

SESSION: Geometry: Editing and Deformation

Semantic Shape Editing with Parametric Implicit Templates

We propose a semantic shape editing method to edit 3D triangle meshes using parametric implicit surface templates, benefiting from the many advantages offered by analytical implicit representations, such as infinite resolution and boolean or blending operations. We propose first a template fitting method to optimize its parameters to best capture the input mesh. For subsequent template edits, our novel mesh deformation method allows tracking the template’s 0-set even when featuring anisotropic stretch and/or local volume change. We make few assumptions on the template implicit fields and only strictly require continuity. We demonstrate applications to interactive semantic shape editing and semantic mesh retargeting.

A Unified Differentiable Boolean Operator with Fuzzy Logic

This paper presents a unified differentiable boolean operator for implicit solid shape modeling using Constructive Solid Geometry (CSG). Traditional CSG relies on min, max operators to perform boolean operations on implicit shapes. But because these boolean operators are discontinuous and discrete in the choice of operations, this makes optimization over the CSG representation challenging. Drawing inspiration from fuzzy logic, we present a unified boolean operator that outputs a continuous function and is differentiable with respect to operator types. This enables optimization of both the primitives and the boolean operations employed in CSG with continuous optimization techniques, such as gradient descent. We further demonstrate that such a continuous boolean operator allows the modeling of both sharp mechanical objects and smooth organic shapes with the same framework. Our proposed boolean operator opens up new possibilities for future research toward fully continuous CSG optimization.

CNS-Edit: 3D Shape Editing via Coupled Neural Shape Optimization

This paper introduces a new approach based on a coupled representation and a neural volume optimization to implicitly perform 3D shape editing in latent space. This work has three innovations. First, we design the coupled neural shape (CNS) representation for supporting 3D shape editing. This representation includes a latent code, which captures high-level global semantics of the shape, and a 3D neural feature volume, which provides a spatial context to associate with the local shape changes given by the editing. Second, we formulate the coupled neural shape optimization procedure to co-optimize the two coupled components in the representation subject to the editing operation. Last, we offer various 3D shape editing operators, i.e., copy, resize, delete, and drag, and derive each into an objective for guiding the CNS optimization, such that we can iteratively co-optimize the latent code and neural feature volume to match the editing target. With our approach, we can achieve a rich variety of editing results that are not only aware of the shape semantics but are also not easy to achieve by existing approaches. Both quantitative and qualitative evaluations demonstrate the strong capabilities of our approach over the state-of-the-art solutions.

SESSION: Video Generation

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

We introduce Motion-I2V, a novel framework for consistent and controllable text-guided image-to-video generation (I2V). In contrast to previous methods that directly learn the complicated image-to-video mapping, Motion-I2V factorizes I2V into two stages with explicit motion modeling. For the first stage, we propose a diffusion-based motion field predictor, which focuses on deducing the trajectories of the reference image’s pixels. For the second stage, we propose motion-augmented temporal attention to enhance the limited 1-D temporal attention in video latent diffusion models. This module can effectively propagate reference image features to synthesized frames with the guidance of predicted trajectories from the first stage. Compared with existing methods, Motion-I2V can generate more consistent videos even in the presence of large motion and viewpoint variation. By training a sparse trajectory ControlNet for the first stage, Motion-I2V can support users to precisely control motion trajectories and motion regions with sparse trajectory and region. This offers more controllability of the I2V process than solely relying on textual instructions. Additionally, Motion-I2V’s second stage naturally supports zero-shot video-to-video translation. Both qualitative and quantitative comparisons demonstrate the advantages of Motion-I2V over prior approaches in consistent and controllable image-to-video generation. Please see our project page at https://xiaoyushi97.github.io/Motion-I2V/.

I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models

Text-guided image-to-video (I2V) generation aims to generate a coherent video that preserves the identity of the input image and semantically aligns with the input prompt. Existing methods typically augment pretrained text-to-video (T2V) models by either concatenating the image with noised video frames channel-wise before being fed into the model or injecting the image embedding produced by pretrained image encoders in cross-attention modules. However, the former approach often necessitates altering the fundamental weights of pretrained T2V models, thus restricting the model’s compatibility within the open-source communities and disrupting the model’s prior knowledge. Meanwhile, the latter typically fails to preserve the identity of the input image. We present I2V-Adapter to overcome such limitations. I2V-Adapter adeptly propagates the unnoised input image to subsequent noised frames through a cross-frame attention mechanism, maintaining the identity of the input image without any changes to the pretrained T2V model. Notably, I2V-Adapter only introduces a few trainable parameters, significantly alleviating the training cost and also ensures compatibility with existing community-driven personalized models and control tools. Moreover, we propose a novel Frame Similarity Prior to balance the motion amplitude and the stability of generated videos through two adjustable control coefficients. Our experimental results demonstrate that I2V-Adapter is capable of producing high-quality videos. This performance, coupled with its agility and adaptability, represents a substantial advancement in the field of I2V, particularly for personalized and controllable applications.

Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion

Recent text-to-video diffusion models have achieved impressive progress. In practice, users often desire the ability to control object motion and camera movement independently for customized video creation. However, current methods lack the focus on separately controlling object motion and camera movement in a decoupled manner, which limits the controllability and flexibility of text-to-video models. In this paper, we introduce Direct-a-Video, a system that allows users to independently specify motions for multiple objects as well as camera’s pan and zoom movements, as if directing a video. We propose a simple yet effective strategy for the decoupled control of object motion and camera movement. Object motion is controlled through spatial cross-attention modulation using the model’s inherent priors, requiring no additional optimization. For camera movement, we introduce new temporal cross-attention layers to interpret quantitative camera movement parameters. We further employ an augmentation-based approach to train these layers in a self-supervised manner on a small-scale dataset, eliminating the need for explicit motion annotation. Both components operate independently, allowing individual or combined control, and can generalize to open-domain scenarios. Extensive experiments demonstrate the superiority and effectiveness of our method. Project page and code are available at https://direct-a-video.github.io/.

MotionCtrl: A Unified and Flexible Motion Controller for Video Generation

Motions in a video primarily consist of camera motion, induced by camera movement, and object motion, resulting from object movement. Accurate control of both camera and object motion is essential for video generation. However, existing works either mainly focus on one type of motion or do not clearly distinguish between the two, limiting their control capabilities and diversity. Therefore, this paper presents MotionCtrl, a unified and flexible motion controller for video generation designed to effectively and independently control camera and object motion. The architecture and training strategy of MotionCtrl are carefully devised, taking into account the inherent properties of camera motion, object motion, and imperfect training data. Compared to previous methods, MotionCtrl offers three main advantages: 1) It effectively and independently controls camera motion and object motion, enabling more fine-grained motion control and facilitating flexible and diverse combinations of both types of motion. 2) Its motion conditions are determined by camera poses and trajectories, which are appearance-free and minimally impact the appearance or shape of objects in generated videos. 3) It is a relatively generalizable model that can adapt to a wide array of camera poses and trajectories once trained. Extensive qualitative and quantitative experiments have been conducted to demonstrate the superiority of MotionCtrl over existing methods. Project page: https://wzhouxiff.github.io/projects/MotionCtrl/ .

X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention

We propose X-Portrait, an innovative conditional diffusion model tailored for generating expressive and temporally coherent portrait animation. Specifically, given a single portrait as appearance reference, we aim to animate it with motion derived from a driving video, capturing both highly dynamic and subtle facial expressions along with wide-range head movements. As its core, we leverage the generative prior of a pre-trained diffusion model as the rendering backbone, while achieve fine-grained head pose and expression control with novel controlling signals within the framework of ControlNet. In contrast to conventional coarse explicit controls such as facial landmarks, our motion control module is learned to interpret the dynamics directly from the original driving RGB inputs. The motion accuracy is further enhanced with a patch-based local control module that effectively enhance the motion attention to small-scale nuances like eyeball positions. Notably, to mitigate the identity leakage from the driving signals, we train our motion control modules with scaling-augmented cross-identity images, ensuring maximized disentanglement from the appearance reference modules. Experimental results demonstrate the universal effectiveness of X-Portrait across a diverse range of facial portraits and expressive driving sequences, and showcase its proficiency in generating captivating portrait animations with consistently maintained identity characteristics.

SESSION: Computational Cameras and Displays

Tele-Aloha: A Telepresence System with Low-budget and High-authenticity Using Sparse RGB Cameras

In this paper, we present a low-budget and high-authenticity bidirectional telepresence system, Tele-Aloha, targeting peer-to-peer communication scenarios. Compared to previous systems, Tele-Aloha utilizes only four sparse RGB cameras, one consumer-grade GPU, and one autostereoscopic screen to achieve high-resolution (2048x2048), real-time (30 fps), low-latency (less than 150ms) and robust distant communication. As the core of Tele-Aloha, we propose an efficient novel view synthesis algorithm for upper-body. Firstly, we design a cascaded disparity estimator for obtaining a robust geometry cue. Additionally a neural rasterizer via Gaussian Splatting is introduced to project latent features onto target view and to decode them into a reduced resolution. Further, given the high-quality captured data, we leverage weighted blending mechanism to refine the decoded image into the final resolution of 2K. Exploiting world-leading autostereoscopic display and low-latency iris tracking, users are able to experience a strong three-dimensional sense even without any wearable head-mounted display device. Altogether, our telepresence system demonstrates the sense of co-presence in real-life experiments, inspiring the next generation of communication.

Aperture-Aware Lens Design

Optics designers use simulation tools to assist them in designing lenses for various applications. Commercial tools rely on finite differencing and sampling methods to perform gradient-based optimization of lens design objectives. Recently, differentiable rendering techniques have enabled more efficient gradient calculation of these objectives. However, these techniques are unable to optimize for light throughput, often an important metric for many applications.

We develop a method for calculating the gradients of optical systems with respect to both focus and light throughput. We formulate lens performance as an integral loss over a dynamic domain, which allows for the use of differentiable rendering techniques to calculate the required gradients. We also develop a ray tracer specifically designed for refractive lenses and derive formulas for calculating gradients that simultaneously optimize for focus and light throughput. Explicitly optimizing for light throughput produces lenses that outperform traditional optimized lenses that tend to prioritize for only focus. To evaluate our lens designs, we simulate various applications where our lenses:

Scale-Invariant Monocular Depth Estimation via SSI Depth

Existing methods for scale-invariant monocular depth estimation (SI MDE) often struggle due to the complexity of the task, and limited and non-diverse datasets, hindering generalizability in real-world scenarios. This is while shift-and-scale-invariant (SSI) depth estimation, simplifying the task and enabling training with abundant stereo datasets achieves high performance. We present a novel approach that leverages SSI inputs to enhance SI depth estimation, streamlining the network’s role and facilitating in-the-wild generalization for SI depth estimation while only using a synthetic dataset for training. Emphasizing the generation of high-resolution details, we introduce a novel sparse ordinal loss that substantially improves detail generation in SSI MDE, addressing critical limitations in existing approaches. Through in-the-wild qualitative examples and zero-shot evaluation we substantiate the practical utility of our approach in computational photography applications, showcasing its ability to generate highly detailed SI depth maps and achieve generalization in diverse scenarios.

Deep Hybrid Camera Deblurring for Smartphone Cameras

Mobile cameras, despite their significant advancements, still have difficulty in low-light imaging due to compact sensors and lenses, leading to longer exposures and motion blur. Traditional blind deconvolution methods and learning-based deblurring methods can be potential solutions to remove blur. However, achieving practical performance still remains a challenge. To address this, we propose a learning-based deblurring framework for smartphones, utilizing wide and ultra-wide cameras as a hybrid camera system. We simultaneously capture a long-exposure wide image and short-exposure burst ultra-wide images, and utilize the burst images to deblur the wide image. To fully exploit burst ultra-wide images, we present HCDeblur, a practical deblurring framework that includes novel deblurring networks, HC-DNet and HC-FNet. HC-DNet utilizes motion information extracted from burst images to deblur a wide image, and HC-FNet leverages burst images as reference images to further enhance a deblurred output. For training and evaluating the proposed method, we introduce the HCBlur dataset, which consists of synthetic and real-world datasets. Our experiments demonstrate that HCDeblur achieves state-of-the-art deblurring quality. Codes and datasets are available at https://cg.postech.ac.kr/research/HCDeblur.

Self-Supervised Video Defocus Deblurring with Atlas Learning

Misfocus is ubiquitous for almost all video producers, degrading video quality and often causing expensive delays and reshoots. Current autofocus (AF) systems are vulnerable to sudden disturbances such as subject movement or lighting changes commonly present in real-world and on-set conditions. Single image defocus deblurring methods are temporally unstable when applied to videos and cannot recover details obscured by temporally varying defocus blur. In this paper, we present an end-to-end solution that allows users to correct misfocus during post-processing. Our method generates and parameterizes defocused videos into sharp layered neural atlases and propagates consistent focus tracking back to the video frames. We introduce a novel differentiable disk blur layer for more accurate point spread function (PSF) simulation, coupled with a circle of confusion (COC) map estimation module with knowledge transferred from the current single image defocus deblurring (SIDD) networks. Our pipeline offers consistent, sharp video reconstruction and effective subject-focus correction and tracking directly on the generated atlases. Furthermore, by adopting our approach, we achieve comparable results to the state-of-the-art optical flow estimation approach from defocus videos.

SESSION: Character Control

SuperPADL: Scaling Language-Directed Physics-Based Control with Progressive Supervised Distillation

Physically-simulated models for human motion can generate high-quality responsive character animations, often in real-time. Natural language serves as a flexible interface for controlling these models, allowing expert and non-expert users to quickly create and edit their animations. Many recent physics-based animation methods, including those that use text interfaces, train control policies using reinforcement learning (RL). However, scaling these methods beyond several hundred motions has remained challenging. Meanwhile, kinematic animation models are able to successfully learn from thousands of diverse motions by leveraging supervised learning methods. Inspired by these successes, in this work we introduce SuperPADL, a scalable framework for physics-based text-to-motion that leverages both RL and supervised learning to train controllers on thousands of diverse motion clips. SuperPADL is trained in stages using progressive distillation, starting with a large number of specialized experts using RL. These experts are then iteratively distilled into larger, more robust policies using a combination of reinforcement learning and supervised learning. Our final SuperPADL controller is trained on a dataset containing over 5000 skills and runs in real time on a consumer GPU. Moreover, our policy can naturally transition between skills, allowing for users to interactively craft multi-stage animations. We experimentally demonstrate that SuperPADL significantly outperforms RL-based baselines at this large data scale.

Strategy and Skill Learning for Physics-based Table Tennis Animation

Recent advancements in physics-based character animation leverage deep learning to generate agile and natural motion, enabling characters to execute movements such as backflips, boxing, and tennis. However, reproducing the selection and use of diverse motor skills in dynamic environments to solve complex tasks, as humans do, still remains a challenge. We present a strategy and skill learning approach for physics-based table tennis animation. Our method addresses the issue of mode collapse, where the characters do not fully utilize the motor skills they need to perform to execute complex tasks. More specifically, we demonstrate a hierarchical control system for diversified skill learning and a strategy learning framework for effective decision-making. We showcase the efficacy of our method through comparative analysis with state-of-the-art methods, demonstrating its capabilities in executing various skills for table tennis. Our strategy learning framework is validated through both agent-agent interaction and human-agent interaction in Virtual Reality, handling both competitive and cooperative tasks.

SESSION: Procedural Geometry

Recompose Grammars for Procedural Architecture

We present the novel grammar language Recomp for the procedural modeling of architecture. In grammar-based approaches, the procedural refinement process is based on shape subdivisions. This process of decomposition results in disconnected subparts, which not only restricts the geometric expressiveness but also limits the control over an appropriate shape granularity needed to coordinate design decisions. Recomp overcomes these limitations by extending grammar languages with the recomposition ability. Fundamental is the concept of rule inlining, allowing for the topological recomposition of edited subparts by collapsing a shape subtree into one single shape on which derivation can continue. This is completed with a versatile geometry tagging system, allowing authors to compile and transport context information at any level of detail and gain full control over the geometry independent of the structure of the shape tree. Through various examples, we demonstrate the power of Recomp in procedural layout and mass modeling, as well as its capabilities in facilitating context-sensitive design.

SESSION: Radiance Field Processing

A Construct-Optimize Approach to Sparse View Synthesis without Camera Pose

Novel view synthesis from a sparse set of input images is a challenging problem of great practical interest, especially when camera poses are absent or inaccurate. Direct optimization of camera poses and usage of estimated depths in neural radiance field algorithms usually do not produce good results because of the coupling between poses and depths, and inaccuracies in monocular depth estimation. In this paper, we leverage the recent 3D Gaussian splatting method to develop a novel construct-and-optimize method for sparse view synthesis without camera poses. Specifically, we construct a solution progressively by using monocular depth and projecting pixels back into the 3D world. During construction, we optimize the solution by detecting 2D correspondences between training views and the corresponding rendered images. We develop a unified differentiable pipeline for camera registration and adjustment of both camera poses and depths, followed by back-projection. We also introduce a novel notion of an expected surface in Gaussian splatting, which is critical to our optimization. These steps enable a coarse solution, which can then be low-pass filtered and refined using standard optimization methods. We demonstrate results on the Tanks and Temples and Static Hikes datasets with as few as three widely-spaced views, showing significantly better quality than competing methods, including those with approximate camera pose information. Moreover, our results improve with more views and outperform previous InstantNGP and Gaussian Splatting algorithms even when using half the dataset.

Rip-NeRF: Anti-aliasing Radiance Fields with Ripmap-Encoded Platonic Solids

Despite significant advancements in Neural Radiance Fields (NeRFs), the renderings may still suffer from aliasing and blurring artifacts, since it remains a fundamental challenge to effectively and efficiently characterize anisotropic areas induced by the cone-casting procedure. This paper introduces a Ripmap-Encoded Platonic Solid representation to precisely and efficiently featurize 3D anisotropic areas, achieving high-fidelity anti-aliased renderings. Central to our approach are two key components: Platonic Solid Projection and Ripmap encoding. The Platonic Solid Projection factorizes the 3D space onto the unparalleled faces of a certain Platonic solid, such that the anisotropic 3D areas can be projected onto planes with distinguishable characterization. Meanwhile, each face of the Platonic solid is encoded by the Ripmap encoding, which is constructed by anisotropically pre-filtering a learnable feature grid, to enable featurzing the projected anisotropic areas both precisely and efficiently by the anisotropic area-sampling. Extensive experiments on both well-established synthetic datasets and a newly captured real-world dataset demonstrate that our Rip-NeRF attains state-of-the-art rendering quality, particularly excelling in the fine details of repetitive structures and textures, while maintaining relatively swift training times, as shown in Fig. 1. The source code and data for this paper are at https://github.com/JunchenLiu77/Rip-NeRF.

N-Dimensional Gaussians for Fitting of High Dimensional Functions

In the wake of many new ML-inspired approaches for reconstructing and representing high-quality 3D content, recent hybrid and explicitly learned representations exhibit promising performance and quality characteristics. However, their scaling to higher dimensions is challenging, e.g. when accounting for dynamic content with respect to additional parameters such as material properties, illumination, or time. In this paper, we tackle these challenges for an explicit representations based on Gaussian mixture models. With our solutions, we arrive at efficient fitting of compact N-dimensional Gaussian mixtures and enable efficient evaluation at render time: For fast fitting and evaluation, we introduce a high-dimensional culling scheme that efficiently bounds N-D Gaussians, inspired by Locality Sensitive Hashing. For adaptive refinement yet compact representation, we introduce a loss-adaptive density control scheme that incrementally guides the use of additional capacity towards missing details. With these tools we can for the first time represent complex appearance that depends on many input dimensions beyond position or viewing angle within a compact, explicit representation optimized in minutes and rendered in milliseconds.

SESSION: Sound, Light, Radiofrequency

AONeuS: A Neural Rendering Framework for Acoustic-Optical Sensor Fusion

Underwater perception and 3D surface reconstruction are challenging problems with broad applications in construction, security, marine archaeology, and environmental monitoring. Treacherous operating conditions, fragile surroundings, and limited navigation control often dictate that submersibles restrict their range of motion and, thus, the baseline over which they can capture measurements. In the context of 3D scene reconstruction, it is well-known that smaller baselines make reconstruction more challenging. Our work develops a physics-based multimodal acoustic-optical neural surface reconstruction framework (AONeuS) capable of effectively integrating high-resolution RGB measurements with low-resolution depth-resolved imaging sonar measurements. By fusing these complementary modalities, our framework can reconstruct accurate high-resolution 3D surfaces from measurements captured over heavily-restricted baselines. Through extensive simulations and in-lab experiments, we demonstrate that AONeuS dramatically outperforms recent RGB-only and sonar-only inverse-differentiable-rendering–based surface reconstruction methods.

DiffSound: Differentiable Modal Sound Rendering and Inverse Rendering for Diverse Inference Tasks

Accurately estimating and simulating the physical properties of objects from real-world sound recordings is of great practical importance in the fields of vision, graphics, and robotics. However, the progress in these directions has been limited—prior differentiable rigid or soft body simulation techniques cannot be directly applied to modal sound synthesis due to the high sampling rate of audio, while previous audio synthesizers often do not fully model the accurate physical properties of the sounding objects. We propose DiffSound, a differentiable sound rendering framework for physics-based modal sound synthesis, which is based on an implicit shape representation, a new high-order finite element analysis module, and a differentiable audio synthesizer. Our framework can solve a wide range of inverse problems thanks to the differentiability of the entire pipeline, including physical parameter estimation, geometric shape reasoning, and impact position prediction. Experimental results demonstrate the effectiveness of our approach, highlighting its ability to accurately reproduce the target sound in a physics-based manner. DiffSound serves as a valuable tool for various sound synthesis and analysis applications.

Accelerating Saccadic Response through Spatial and Temporal Cross-Modal Misalignments

Human senses and perception are our mechanisms to explore the external world. In this context, visual saccades –rapid and coordinated eye movements– serve as a primary tool for awareness of our surroundings. Typically, our perception is not limited to visual stimuli alone but is enriched by cross-modal interactions, such as the combination of sight and hearing. In this work, we investigate the temporal and spatial relationship of these interactions, focusing on how auditory cues that precede visual stimuli influence saccadic latency –the time that it takes for the eyes to react and start moving towards a visual target. Our research, conducted within a virtual reality environment, reveals that auditory cues preceding visual information can significantly accelerate saccadic responses, but this effect plateaus beyond certain temporal thresholds. Additionally, while the spatial positioning of visual stimuli influences the speed of these eye movements, as reported in previous research, we find that the location of auditory cues with respect to their corresponding visual stimulus does not have a comparable effect. To validate our findings, we implement two practical applications: first, a basketball training task set in a more realistic environment with complex audiovisual signals, and second, an interactive farm game that explores previously untested values of our key factors. Lastly, we discuss various potential applications where our model could be beneficial.

Radar Fields: Frequency-Space Neural Scene Representations for FMCW Radar

Neural fields have been broadly investigated as scene representations for the reproduction and novel generation of diverse outdoor scenes, including those autonomous vehicles and robots must handle. While successful approaches for RGB and LiDAR data exist, neural reconstruction methods for radar as a sensing modality have been largely unexplored. Operating at millimeter wavelengths, radar sensors are robust to scattering in fog and rain, and, as such, offer a complementary modality to active and passive optical sensing techniques. Moreover, existing radar sensors are highly cost-effective and deployed broadly in robots and vehicles that operate outdoors. We introduce Radar Fields –- a neural scene reconstruction method designed for active radar imagers. Our approach unites an explicit, physics-informed sensor model with an implicit neural geometry and reflectance model to directly synthesize raw radar measurements and extract scene occupancy. The proposed method does not rely on volume rendering. Instead, we learn fields in Fourier frequency space, supervised with raw radar data. We validate our method’s effectiveness across diverse outdoor scenarios, including urban scenes with dense vehicles and infrastructure, and harsh weather scenarios, where mm-wavelength sensing is favorable.

SESSION: Art, Illusion, Fabrication

Diffusion Illusions: Hiding Images in Plain Sight

   We explore the problem of computationally generating special images that produce multi-arrangement optical illusions when physically arranged and viewed in a certain way, which we call ‘prime’ images. First, we propose a formal definition for this problem. Next, we introduce Diffusion Illusions, the first comprehensive pipeline designed to automatically generate a wide range of these multi-arrangement illusions. Specifically, we both adapt the existing ‘score distillation loss’ and propose a new ‘dream target loss’ to optimize a group of differentially parametrized prime images, using a frozen text-to-image diffusion model. We study three types of illusions, each where the prime images are arranged in different ways and optimized using the aforementioned losses such that images derived from them align with user-chosen text prompts or images. We conduct comprehensive experiments on these illusions and verify the effectiveness of our proposed method qualitatively and quantitatively. Additionally, we showcase the successful physical fabrication of our illusions — as they are all designed to work in the real world. Code and examples are publicly available at our interactive project website: https://diffusionillusion.github.io/

Cross-Image Attention for Zero-Shot Appearance Transfer

Recent advancements in text-to-image generative models have demonstrated a remarkable ability to capture a deep semantic understanding of images. In this work, we leverage this semantic knowledge to transfer the visual appearance between objects that share similar semantics but may differ significantly in shape. To achieve this, we build upon the self-attention layers of these generative models and introduce a cross-image attention mechanism that implicitly establishes semantic correspondences across images. Specifically, given a pair of images — one depicting the target structure and the other specifying the desired appearance — our cross-image attention combines the queries corresponding to the structure image with the keys and values of the appearance image. This operation, when applied during the denoising process, leverages the established semantic correspondences to generate an image combining the desired structure and appearance. In addition, to improve the output image quality, we harness three mechanisms that either manipulate the noisy latent codes or the model’s internal representations throughout the denoising process. Importantly, our approach is zero-shot, requiring no optimization or training. Experiments show that our method is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint between the two input images.

Generative Escher Meshes

This paper proposes a fully-automatic, text-guided generative method for producing perfectly-repeating, periodic, tile-able 2D imagery, such as the one seen on floors, mosaics, ceramics, and the work of M.C. Escher. In contrast to square texture images that are seamless when tiled, our method generates non-square tilings which comprise solely of repeating copies of the same object. It achieves this by optimizing both geometry and texture of a 2D mesh, yielding a non-square tile in the shape and appearance of the desired object, with close to no additional background details, that can tile the plane without gaps nor overlaps. We enable optimization of the tile’s shape by an unconstrained, differentiable parameterization of the space of all valid tileable meshes for given boundary conditions stemming from a symmetry group. Namely, we construct a differentiable family of linear systems derived from a 2D mesh-mapping technique - Orbifold Tutte Embedding - by considering the mesh’s Laplacian matrix as differentiable parameters. We prove that the solution space of these linear systems is exactly all possible valid tiling configurations, thereby providing an end-to-end differentiable representation for the entire space of valid tiles. We render the textured mesh via a differentiable renderer, and leverage a pre-trained image diffusion model to induce a loss on the resulting image, updating the mesh’s parameters so as to make its appearance match the text prompt. We show our method is able to produce plausible, appealing results, with non-trivial tiles, for a variety of different periodic tiling patterns.

Fabricable 3D Wire Art

This paper presents a computational method for automatically creating fabricable 3D wire sculptures from various input modalities, including 3D models, images, and even text. There are several challenges to wire art creation. For example, artists must express the desired visual as a sparse wire representation. It is also difficult to manually bend wires in the air without guidance to fabricate the designed 3D curves. Our workflow solves these challenges by using two core techniques. First, we present an algorithm that automatically generates a fabricable 3D curve representation of the target based on a loss function that measures the semantic distance between the rendered curve and the target. The loss function can be defined using different pre-trained vision-language neural networks to generate wire art from different input types. The loss function is then optimized using differentiable rendering specifically targeting 3D parametric curves. Our method can incorporate various fabrication constraints on the wire as additional regularization terms in the optimization process. Second, we present an algorithm to generate a 3D printable jig structure that can be used to fabricate the generated wire path. The major challenge in the jig generation stems from the design of an intersection-free surface mesh for 3D printing, which we address with our inflation algorithm. The experimental results indicate that our method can handle a wider range of input types and can produce physically fabricable wire shapes compared to previous wire generation methods. Various wire arts have been fabricated using our 3D-printed jig to demonstrate its effectiveness in 3D wire bending.