CVMP '24: Proceedings of 21st ACM SIGGRAPH Conference on Visual Media Production

Full Citation in the ACM Digital Library

SESSION: Session 1

High-Quality Facial Geometry from Sparse Heterogeneous Devices under Active Illumination

High-resolution facial geometry is essential for realistic digital avatars. Traditional reconstruction methods, such as multi-view stereo, often struggle with materials like skin, which exhibit complex light reflection, absorption, and scattering properties. Neural reconstruction methods have shown greater robustness to these view-dependent effects. However, positional-encoding-based implementations are typically slow, while faster hash-encoded methods may falter under sparse camera views. We present a geometry reconstruction method tailored for an active-illumination facial capture setup featuring sparse cameras with varying characteristics. Our technique builds upon hash-encoded neural surface reconstruction, which we enhance with additional active-illumination-based supervision and loss functions, allowing us to maintain high reconstruction speed and geometrical fidelity even with reduced camera coverage. We validate our approach through qualitative evaluations across diverse subjects, and quantitative evaluation using a synthetic dataset rendered with a virtual reproduction of our capture setup. Our results demonstrate that our method significantly outperforms previous neural reconstruction techniques on datasets with sparse camera configurations.

Enhanced Illumination Adjustment in 3D Outdoor Reconstructions via Shadow Removal through Color Transfer

The introduction of 3D reconstruction technology has revolutionized the digitization of real-world objects, from small artifacts to large-scale structures like buildings. This technology enables the rapid creation of virtual representations of the real world with minimal design expertise, offering a novel way to experience reality. However, it presents challenges such as baked-in illumination, which complicates subsequent relighting and integration into digital environments. This paper introduces a shadow removal algorithm, SRCT, that uses simulated lighting and color transfer techniques to reduce the visible effects of self- and cast shadows in the texture maps of 3D models resulting from the 3D reconstruction process. The effectiveness of this approach is validated through a comparison with existing shadow removal techniques. This validation utilizes a newly introduced dataset, EDEN, which comprises 3D reconstructions of buildings derived from drone imagery for qualitative evaluation, along with an additional dataset, Sunlit3D, featuring 3D reconstructions of buildings under various simulated lighting conditions for quantitative analysis.

RegSegField: Mask-Regularization and Hierarchical Segmentation for Novel View Synthesis from Sparse Inputs

Radiance Field (RF) representations and their latest variant, 3D-Gaussian Splatting (3D-GS), have revolutionized the field of 3D vision. Novel View Synthesis (NVS) from RF typically requires dense inputs, and for 3D-GS in particular, a high-quality point cloud from a multi-view stereo model is usually necessary. Sparse input RFs are commonly regularized by various priors, such as smoothness, depth, and appearance. Meanwhile, 3D scene segmentation has also achieved significant results with the aid of RFs, and combining the field with different semantic and physical attributes has become a trend. To further tackle NVS and 3D segmentation problems under sparse-input conditions, we introduce RegSegField, a novel pipeline to utilize 2D segmentations to aid the reconstruction of objects and parts. This method introduces a novel mask-visibility loss by matching 2D segments across different views, thus defining the 3D regions for different objects. To further optimize the correspondence of 2D segments, we introduce a hierarchical feature field supervised by a contrastive learning method, allowing iterative updates of matched mask areas. To resolve the inconsistent segmentation across different views and refine the mask matching with the help of RF geometry, we also employed a multi-level hierarchy loss. With the help of the hierarchy loss, our method facilitates scene segmentation at discrete granularity levels, whereas other methods require sampling at different scales or determining similarity thresholds. Our experiments show that our regularization approach outperforms various depth-guided NeRF methods and even enables sparse reconstruction of 3D-GS with random initialization.

SESSION: Session 2

Optimal OLAT Alignment for Image Based Relighting with Color-Multiplexed OLAT Sequence

We present two color-multiplexed illumination sequences for optimally-aligned one-light-at-a-time (OLAT) captures. We leverage color-multiplexing strategies to embed tracking frames within the OLAT photographs to correct for subject motion. Our method allows better motion estimation via optical flow than traditional methods, which interleaves tracking frames between OLATs. Comparison between rendered results and user study on comfortability both demonstrate that color-multiplexed sequences give better-aligned OLATs and are more comfortable for the subject during data capture. Our proposed sequences can replace traditional OLAT sequences for better data acquisition, which would benefit both light-stage rendering results and any state-of-the-art relighting methods that are trained on OLAT-generated data.

Image-Based Material Editing Using Perceptual Attributes or Ground-Truth Parameters

Image-based material editing neural networks have been trained on perceptual attributes because such attributes are human-friendly. But it seems that training such networks on non-perceptual material parameters has been neglected in comparison. It is interesting that collecting perceptual experiment data has been considered an acceptable additional effort until now. It would be much easier to generate a dataset with ground-truth material parameter attributes instead. Ground-truth parameters also avoid the noise that is inherent in perceptual experiment data. We show that existing neural networks can be trained on datasets with ground-truth material parameters and that they generate material edits of similar quality and that stay as close to the valid gamut of the trained material model as neural networks trained on perceptual material attributes. We expect that these results will encourage more study of the qualitative and quantitative differences between image-based material editing networks trained on material parameters and on perceptual attributes.

Low-light Video Enhancement with Conditional Diffusion Models and Wavelet Interscale Attentions

Videos captured in low-light conditions often suffer from various distortions, such as noise, low contrast, color imbalance, and blur. Consequently, a post-processing workflow is necessary but typically time-consuming. Developing AI-based tools for videos also requires significantly more computational resources compared to those for images. This paper introduces a novel framework aimed at reducing memory usage and computational time by enhancing videos in the wavelet domain. The framework utilizes conditional diffusion models to enhance brightness and adjust colors in the low-pass subbands while employing interscale-attention mechanisms to enhance sharpness in the high-pass subbands. To ensure temporal consistency, we integrate feature alignment and fusion into the denoiser of the diffusion models. Additionally, we introduce adaptive brightness adjustment as a preprocessing module to reduce the workload of the learnable networks. Experimental results demonstrate that our proposed methods outperform existing low-light video enhancement techniques with competitive inference times compared to image-based methods.

SESSION: Session 3

Multi-Resolution Generative Modeling of Human Motion from Limited Data

We present a generative model that learns to synthesize human motion from limited training sequences. Our framework provides conditional generation and blending across multiple temporal resolutions. The model adeptly captures human motion patterns by integrating skeletal convolution layers and a multi-scale architecture. Our model contains a set of generative and adversarial networks, along with embedding modules, each tailored for generating motions at specific frame rates while exerting control over their content and details. Notably, our approach also extends to the synthesis of co-speech gestures, demonstrating its ability to generate synchronized gestures from speech inputs, even with limited paired data. Through direct synthesis of SMPL pose parameters, our approach avoids test-time adjustments to fit human body meshes. Experimental results showcase our model’s ability to achieve extensive coverage of training examples, while generating diverse motions, as indicated by local and global diversity metrics.

PDFed: Privacy-Preserving and Decentralized Asynchronous Federated Learning for Diffusion Models

We present PDFed, a decentralized, aggregator-free, and asynchronous federated learning protocol for training image diffusion models using a public blockchain. In general, diffusion models are prone to memorization of training data, raising privacy and ethical concerns (e.g., regurgitation of private training data in generated images). Federated learning (FL) offers a partial solution via collaborative model training across distributed nodes that safeguard local data privacy. PDFed proposes a novel sample-based score that measures the novelty and quality of generated samples, incorporating these into a blockchain-based federated learning protocol that we show reduces private data memorization in the collaboratively trained model. In addition, PDFed enables asynchronous collaboration among participants with varying hardware capabilities, facilitating broader participation. The protocol records the provenance of AI models, improving transparency and auditability, while also considering automated incentive and reward mechanisms for participants. PDFed aims to empower artists and creators by protecting the privacy of creative works and enabling decentralized, peer-to-peer collaboration. The protocol positively impacts the creative economy by opening up novel revenue streams and fostering innovative ways for artists to benefit from their contributions to the AI space.

Interacting from Afar: A Study of the Relationship Between Usability and Presence in Object Selection at a Distance

Virtual Reality (VR) permits people to bend the rules of physics that exist in the real world, allowing for unique “magic” interactions. For example, if a user wants to interact with a far away object, they can summon it to themselves in the Virtual Environment (VE) as opposed to physically walking to it like they would need to outside of a headset. There is no set framework for how to facilitate this type of user interaction for the optimal user experience. We wanted to understand how people intuitively interact with objects at a distance, hypothesizing that the more usable someone found a system, the higher level of presence they would experience. We present a study that explores the relationship between presence and usability through questionnaires. We identify a positive correlation between presence and usability and propose a system that allows free-movement to encourage presence in VEs.