VRCAI '24: The 19th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry

Full Citation in the ACM Digital Library

SESSION: Session 1

PM4Furniture: A Scriptable Parametric Modeling Interface for Conceptual Furniture Design Using PM4VR

In the field of furniture design, Virtual Reality (VR) has shown potential in enabling immersive prototyping and visualization. However, current VR design tools are often limited by a lack of parametrization, making it challenging for designers to iterate complex furniture forms quickly. This paper presents PM4Furniture, a scriptable parametric modeling interface tailored for VR-based conceptual furniture design. The proposed system leverages a scripting interface of PM4VR framework with a low learning curve that allows designers to adjust the parameters of 3D furniture models in real time. By integrating a VR environment, PM4Furniture enhances user’s interaction and intuitive adjustments of design parameters with immediate visual feedback. We evaluate this novel interface through a preliminary user study with designers interacting with furniture design in an immersive VR environment, revealing PM4Furniture’s design efficiency and creativity in VR furniture prototyping.

VasNetMR: Interactive Exploration of Vascular Network Blood Flow Distribution with Mixed Reality

Vascular system is a complex network inside the human body and local vascular disease may affect other parts of the body with the circulation. As the three-dimensional structure of the vascular network is complex, it is challenging to intuitively visualize the entire system. Furthermore, it is not efficient to convey the function of the vascular network by solely showing the geometry of the vessels. We propose a mixed reality system “VasNetMR" showing the arterial network for the purpose of education. Users are enabled to interactively explore the geometry as well as the blood flow distribution of the vessel network. The key challenging problem is to update the flow rate and blood pressure throughout the whole body in a fast and correct way. We adapt a one-dimensional model to the network with proper boundary conditions and study how the simplified model works for a large arterial network. Finally, we develop VasNetMR based on this computational tool and conduct experiments and user studies. We compare the results from the interactive simulation with those from expensive simulations. At the sample locations, the average errors of the blood pressure and flow rate are less than 5.7mmHg and 1.9cm3/s. The computational time is less than 3 seconds. With a fast simulation model and a mixed reality interface, we enable interactive presentation of blood flow distribution in a whole-body vascular network. The features of our system may benefit applications in medical education or communication.

DKGV: A Dynamic Knowledge Graph Visualization Method Based on Force-Directed Layout

The dynamic knowledge graph is a data structure that adds temporal information to the nodes and edges of a traditional knowledge graph. It describes the changing processes of entities and relationships over time, thereby enriching and updating the expression of knowledge. The dynamic knowledge graph possesses characteristics such as time and labels; it is not only a knowledge graph but also a type of dynamic graph containing heterogeneous information.

To address the issue that traditional dynamic graph visualization methods do not account for heterogeneous information in the layout, leading to a relatively random distribution of node labels in the visualization results, this paper proposes DKGV, a dynamic knowledge graph visualization method based on a force-directed layout. DKGV utilizes an improved GCN to generate the initial layout of the dynamic knowledge graph data, performs static force-directed layout iterative processing, and conducts dynamic force-directed layout calculations on the dynamic knowledge graph. Finally, it employs a force-directed layout with boundaries to support three-dimensional temporal display after view transformation.

Experimental results show that DKGV can maintain the stability of the node layout during the evolution of the dynamic knowledge graph while making the overall layout more regular to better display the relationships between entities.

AURORA: Automated Unleash of 3D Room Outlines for VR Applications

Creating realistic VR experiences is challenging due to the labor-intensive process of accurately replicating real-world details into virtual scenes, highlighting the need for automated methods that maintain spatial accuracy and provide design flexibility. In this paper, we propose AURORA, a novel method that leverages RGB-D images to automatically generate both purely virtual reality (VR) scenes and VR scenes combined with real-world elements. This approach can benefit designers by streamlining the process of converting real-world details into virtual scenes. AURORA integrates advanced techniques in image processing, segmentation, and 3D reconstruction to efficiently create realistic and detailed interior designs from real-world environments. The design of this integration ensures optimal performance and precision, addressing key challenges in automated indoor design generation by uniquely combining and leveraging the strengths of foundation models. We demonstrate the effectiveness of our approach through experiments, both on self-captured data and public datasets, showcasing its potential to enhance virtual reality (VR) applications by providing interior designs that conform to real-world positioning.

Exploring the Impact of Bidirectional Interactions Between VR and SAR on Cultural Exhibition

Virtual Reality (VR) technology has transformed cultural exhibitions by enabling immersive experiences. To create a shareable VR experience for cultural exhibitions, some research has integrated VR with other devices. However, existing approaches may suffer from unidirectional interactions, making it difficult for users to modify or influence VR content, which negatively impacts the user experience. To address this issue, a VR collaborative exhibition system based on Spatial Augmented Reality (SAR) is proposed in this study. Using Liangzhu culture as a case, the system allows users to immersively explore cultural heritage through VR while projecting VR content into public spaces via SAR for synchronous viewing from a global perspective. The system supports bidirectional interactions to enhance participation and improve the user experience in VR exhibitions. Furthermore, a comparative study involving 16 participants was conducted to evaluate the impact of the proposed system on collaboration efficiency, cognitive load, and user experience. The results indicated that the proposed system demonstrated significantly higher collaboration efficiency than the screen sharing-based VR exhibition system. Participants reported that it had a lower cognitive load and a better user experience. The interviews further confirmed the user preference for it.

Augmented Reality Spatial Guidance Cues for Flexible Multi-Objective Task Environments

Improving the efficiency of Augmented Reality (AR) task guidance is actively explored in education and industry. Current AR techniques often let users to follow a fixed task sequence, limiting flexibility. This research explores the benefits and limitations of allowing participants to choose task order in AR task guidance. In large spaces, providing an overview of task locations is essential for effective navigation and completion of tasks dispersed in various locations. In a user study, we compared the baseline condition of spatially anchored tags to two spatial guidance techniques, line guidance and radar guidance, that guide users to navigate in a task environment with multiple distributed tasks locations. The results showed line guidance outperformed radar in task completion efficiency and recall. These methods significantly influenced route-planning strategies, and the results highlighted line guidance as a promising option for multi-objective task AR guidance interfaces.

Virtual Reality Training System for Nurses: Evaluating Risks in Patients Home Environments

In this paper, we introduce a Virtual Reality (VR) nurse training simulation system designed to enhance risk management skills within patient home environments. Our VR training system focuses on preparing nurses to identify, assess, and respond to various hazards and risk factors commonly encountered in home care settings. By simulating realistic home environments and potential patient safety risks, our VR training system aims to improve nurse's situational awareness, decision-making abilities, and readiness to address complex scenarios in non-hospital environments. This approach supports the development of critical competencies essential for providing safe and effective care in patient's homes. The VR training system enhances student engagement through immersive, interactive experiences and offers immediate feedback to support effective learning. Our VR nurse training system consist of several essential components: realistic and relevant scenarios, alignment with the nursing curriculum, active instructor involvement, robust assessment tools, accessibility for all learners, and a commitment to continual improvement. Although challenges exist such as the high cost of VR technology, potential technical issues, and the need for specialized instructor training, however these can be mitigated through thoughtful planning and institutional support. Overall, VR holds the potential to transform nurse training by delivering hands-on, practical experiences that go beyond what traditional teaching methods can offer, preparing students with essential skills in a safe and controlled environment.

SESSION: Session 2

FP-KDNet: Facial Perception and Knowledge Distillation Network for Emotion Recogniton in Coversation

Emotion recognition in conversation (ERC) is anchored in the burgeoning field of artificial intelligence, aiming to equip machines with the ability to discern and respond to human emotions in nuanced ways. However, recent studies have primarily focused on textual modalities, often neglecting the significant potential of non-verbal cues found in audio and video, which are critical for accurately capturing emotions. Furthermore, when researchers integrate these non-verbal cues into multimodal emotion recognition systems, they encounter challenges related to data heterogeneity across different modalities. This paper introduces the Facial Perception and Knowledge Distillation Network (FP-KDNet) to address these challenges. Specifically, a novel Facial Perceptual Attention (FPA) module was designed to capture non-verbal cues from videos, significantly enhancing the model’s ability to process visual information. Additionally, a knowledge distillation (KD) strategy was proposed to improve emotion representation in the non-verbal modality by leveraging data from the text modality, facilitating effective cross-modal information exchange. A multi-head attention mechanism further optimizes the integration of features across modalities, dynamically adjusting attention allocation to enhance conversational emotion recognition. The experimental results demonstrated that FP-KDNet achieves excellent performance on the MELD and IEMOCAP datasets, and ablation studies confirm the effectiveness of the multimodal fusion approach.

Segmentation and Immersive Visualization of Brain Lesions Using Deep Learning and Virtual Reality

Magnetic resonance imaging (MRIs) are commonly used for diagnosing potential neurological disorders, however preparation and interpretation of MRI scans requires professional oversight. Additionally, MRIs are typically viewed as single cross sections of the affected regions which does not always capture the full picture of brain lesions and can be difficult to understand due to 2D’s inherent abstraction of our 3D world. To address these challenges we propose a immersive visualization pipeline that combines deep learning image segmentation techniques using a VGG-16 model trained on MRI fluid attenuated inversion recovery (FLAIR) with virtual reality (VR) immersive analytics. Our visualization pipeline begins with our VGG-16 model predicting which regions of the brain are potentially affected by a disease. This output, along with the original scan, are then volumentrically rendered. These renders can then be viewed in VR using an head mounted display (HMD). Within the HMD users can move through the volumentric renderings to view the affected regions and utilize planes to view cross sections of the MRI scans. Our work provides a potential pipeline and tool for diagnosis and care.

A Frequency Reinforce and Information Mutual Embedding Network for Occluded Person Re-Identification

Occluded person re-identification (ReID) is vital for enhancing immersion and realism in virtual reality (VR) applications, particularly in fields such as security, surveillance, and interactive environments. Existing methods primarily concentrate on predicting occlusions or enhancing features, but they often neglect the issue of data imbalance, which restricts their effectiveness in complex VR scenarios. To address these challenges, we propose a novel network, the Frequency Reinforcement and Information Mutual Embedding Network (FRIME-Net), specifically designed to enhance the accuracy of occluded person retrieval in VR environments. FRIME-Net is a dual-branch network that combines ViT and CNN to capture a diverse range of person features. FRIME-Net features two innovative components: a high-low frequency reinforcement module that captures both coarse-grained and fine-grained features, and a full-part mutual embedding module that integrates global and part-based information, thereby improving feature robustness. Experimental results on three widely-used datasets, including both occluded and non-occluded settings, demonstrate the effectiveness of FRIME-Net, underscoring its potential to improve VR applications that require robust person re-identification capabilities.

GBC: Gaussian-Based Colorization and Super-Resolution for 3D Reconstruction

In this paper, we introduce GBC, an advanced framework for transforming low-resolution, monochrome video sequences into high-resolution, colorized, and geometrically accurate 3D models. GBC combines Bidirectional Optical Flow Super-Resolution (BOF-SR) for temporal super-resolution with a novel colorization approach, Temporal Optical Flow-based Colorization (TOF-CO), designed to enhance frame-to-frame temporal consistency. The integration of these modules with COLMAP-based 3D Gaussian splatting further extends GBC’s capability to reconstruct high-fidelity 3D scenes. Additionally, we created a custom dataset tailored to the challenges of low-light, low-quality historical footage, enabling robust evaluation alongside public datasets. This solution offers a new approach to video restoration, advancing temporal coherence, color accuracy, and 3D scene fidelity. The source code is publicly available at https://github.com/ffftuanxxx/GBC.

Typeface Generation through Style Descriptions With Generative Models

Typeface design plays a vital role in graphic and communication design. Different typefaces are suitable for different contexts and can convey different emotions and messages. Typeface design still relies on skilled designers to create unique styles for specific needs. Recently, generative adversarial networks (GANs) have been applied to typeface generation, but these methods face challenges due to the high annotation requirements of typeface generation datasets, which are difficult to obtain. Furthermore, machine-generated typefaces often fail to meet designers’ specific requirements, as dataset annotations limit the diversity of the generated typefaces. In response to these limitations in current typeface generation models, we propose an alternative approach to the task. Instead of relying on dataset-provided annotations to define the typeface style vector, we introduce a transformer-based language model to learn the mapping between a typeface style description and the corresponding style vector. We evaluated the proposed model using both existing and newly created style descriptions. Results indicate that the model can generate high-quality, patent-free typefaces based on the input style descriptions provided by designers. The code is available at: https://github.com/tqxg2018/Description2Typeface

Efficient Classroom Behavior Detection through End-to End Multiscale Feature Fusion

Classroom behavior detection is essential for smart education. Enhance teaching effectiveness, promote personalized learning, and support holistic student development. Existing methods that rely on single-feature input are susceptible to noise interference, which can lead to misjudgments in classroom behavior detection. This paper proposes an efficient classroom behavior detection through end-to-end multiscale feature fusion. Initially, the original YOLOv7 input module was replaced with a multi-feature information input module to enhance spatial feature utilization efficiency. Then, a feature weight attention mechanism was incorporated to allow the main network to weigh feature information on various scales comprehensively. This approach has improved the accuracy of the algorithm’s identification and reduced the number of network parameters. Finally, including a focal loss function suppressed background interference, improving recognition accuracy. Experimental results demonstrate that the improved algorithm attains 94.1% accuracy in the RizeHand dataset and effectively recognizes student behaviors in real-world scenarios.

Multiscale Structure Prompted Diffusion Models for Scene Text Image Super-Resolution

Scene Text Image Super-Resolution (STISR), which aims to enhance the resolution and legibility of text in low-resolution images, is crucial for accurate text recognition in real-world scenarios, especially in virtual reality (VR) environments where clarity is essential for immersive experiences. However, existing STISR methods often fall short of capturing detailed structural information, limiting the quality of reconstructed images. While diffusion models offer robust generative abilities, they can still result in structural inconsistencies, particularly when handling highly blurred or distorted text. To address these challenges, we introduce MSPDiff, a Multiscale Structure Prompted Diffusion Model designed specifically for STISR. MSPDiff integrates two main components: the Multiscale Structure Generation Model (MSGM) and the Prompted Diffusion Model (PDM). The MSGM captures multiscale structural features from text data, generating structural priors that guide the PDM. In this setup, the PDM leverages both the low-resolution images and the multiscale structures from the MSGM to enhance text clarity. Training both components simultaneously presents optimization challenges, as the MSGM is constrained by text data limitations while the PDM is optimized using noise. To address this, we implement a two-stage training approach: the MSGM is trained initially, after which the PDM is trained with the MSGM frozen. Experiments on the widely-used TextZoom benchmark demonstrate that MSPDiff outperforms state-of-the-art methods in both text recognition accuracy and image quality, establishing it as a promising approach for enhancing text clarity in virtual reality applications.

Structured Teaching Prompt Articulation for Generative-AI Role Embodiment with Augmented Mirror Video Displays

We present a classroom enhanced with augmented reality video display in which students adopt snapshots of their corresponding virtual personas according to their teacher’s live articulated spoken educational theme, linearly, such as historical figures, famous scientists, cultural icons, and laterally according to archetypal categories such as world dance styles. We define a structure of generative AI prompt guidance to assist teachers with focused specified visual role embodiment stylization. By leveraging role-based immersive embodiment, our proposed approach enriches pedagogical practices that prioritize experiential learning.

DBD-Diff: Defocus Blur Detection Using Semantic and Texture Correlation Guided Diffusion Model

Defocus blur detection (DBD) is essential in computer vision as it facilitates the precise identification and detection of in-focus regions within images. However, recent methods may cause accuracy degradation where the foreground and background objects are highly similar or have small color differences. Diffusion models have shown their robust performance in various computer vision tasks, as they can efficiently reconstruct images from noise features and naturally incorporate multiple object features through each iterative generation process. However, diffusion models for Defocus Blur Detection (DBD) tasks have not yet been explored, which may hinder advancements in DBD fields. In this work, we propose a novel diffusion framework specifically tailored for Defocus Blur Detection (DBD) tasks. To the best of our knowledge, this is the first diffusion model specifically designed for DBD. Then a Cross Domain Correlation Extractor and a Cross Domain Tokenized KAN are further proposed to extract two groups of object features, that can generate the semantic correlation feature and the texture correlation feature between the foreground and the background object for the diffusion model. Finally, we conduct extensive experiments to evaluate our framework on several datasets, and the results show that our method achieves state-of-the-art segmentation performance.

DTDMat: A Comprehensive SVBRDF Dataset with Detailed Text Descriptions

In this paper, we designed an automatic annotation tool to generate full descriptions to solve material datasets lacking essential text information challenge. This tool can extract six aspect tags from BRDF maps: intrinsic type, texture, color, roughness, lightness, and other relevant attributes. We applied this tool to both open-source material datasets and our own dataset to create DTDMat, which consists of 14,919 high-resolution Physically Based Rendering materials, each accompanied by a detailed text description. DTDMat covers 20 intrinsic material types and 22 texture structures. It stands out as the most diverse dataset in this domain and represents the largest texture dataset with associated text, offering a wide range of categories and diverse descriptions. We then trained a text-to-material generation framework based on DTDMat, yielding multiple generated BRDF maps that satisfy the input text.

Evaluating Visuohaptic Integration on Memory Retention of Morphological Tomographic Images

Scientific visualization and tomographic imaging techniques have created unprecedented possibilities for non-destructive analyses of digital specimens in morphology. However, practitioners encounter difficulties retaining critical information from complex tomographic volumes in their workflows. In light of this challenge, we investigated the effectiveness of visuohaptic integration in enhancing memory retention of morphological data. In a within-subjects user study (N=18), participants completed a delayed match-to-sample task, where we compared error rates and response times across visual and visuohaptic sensory modality conditions. Our results indicate that visuohaptic encoding improves the retention of tomographic images, producing significantly reduced error rates and faster response times than its unimodal visual counterpart. Our findings suggest that integrating haptics into scientific visualization interfaces may support professionals in fields such as morphology, where accurate retention of complex spatial data is essential for efficient analysis and decision-making within virtual environments.