About the Volumetric Working Groups
VFA’s working groups collaborate to integrate volumetric video’s tech verticals
How We Work
VFA’s Working Groups collaborate to develop specficiations that will allow the interoperable integration what today are the disparate technologies and products that required to create true 3 Dimensional Video.
Volumetric technology depends on several distinct technologies (we call them Verticals) to communicating with one another. Our members participate in several Volumetric working groups to ensure the interoperability of Capture, Process, Compress, and Playback .
Challenges of Volumetric Video?
The biggest challenge of volumetric video is the lack of interoperability. With four different ways to capture, three different ways to process, three different ways to compress and four different ways to render.
Currently the technology is fragmented. All the members of the VFA are committed to developing interoperability specificatons and best practices that are defined by real world uses.
An array of cameras captures a real scene.
Camera input is converted to 3D models.
3D models are prepared to stream over network.
Volumetric captures are played back on devices.
Capture & Acquisition Working Group
The capture and acquisition working group focuses on the first vertical, known as capture.
In this step, a person or scene is surrounded by over a dozen volumetric cameras.
There are four ways to perform volumetric capture: time of flight, structured light, photogrammetry & multiview depth and stereo disparity. After the scene is captured, it is processed in the reconstruction step.
Time of flight
This generates a color and depth image by emitting infrared light and measuring the distance from the camera to the person being captured. Microsoft’s Azure Kinect camera uses time of flight.
Photogrammetry & Multiview Depth
This technique uses several color cameras to generate the depth image or a point cloud of the person being captured. The cameras generate both a depth and color image.
This uses two monochrome cameras that can detect infrared, a color camera, and a laser that projects infrared dots onto the scene. This technique generates a depth image and a color image. Intel’s release camera uses structure light.
This technique uses two color cameras to simulate the left and right eye and to generate a depth image and a single color image from one of the color cameras.
Reconstruction & Encoding Working Group
The reconstruction and encoding working group works on process.
During Process, a 3D world is reconstructed from the captured scene. This step involves processing the 2D videos into 3D models. The four methods for processing are depth map, voxel, point cloud, and mesh generation. The 3D models are then sent to compression so they can be streamed over the network.
Depth Map Generation
A depth map is an image that represents the location of every pixel (or subpixel) in 3D space. This is similar to how a terrain map works but far more precise. This technique can be part of the volumetric cameras such as the Intel Realsense cameras or the Microsoft’s Azure Kinect cameras.
This technique usually takes a point cloud and turns each point into a 3D cube. This fills in any of the holes in a point cloud and will have more than a single color associated with the voxel, but a small color image wrapped around the voxel to represent the subject being captured.
Point Cloud Generation
This technique can use SLAM or a depth map to generate points in 3D space to represent the subject being captured. These points also include a color value that matches the color of the subject being captured.
This technique usually takes a point cloud and turns every 3 points into a triangle that represents the surface of the subject being captured. This results in hundreds to thousands of triangles that are connected to each other as a single mesh. The color of the mesh is usually stitched together from the color cameras and then applied as a UV map.
Decode & Render Working Group
The decode and render working group collaborate to address compression techniques.
The 3D models are compressed so they can be streamed over a network. There are three different methods of compression: mesh, point cloud, and Depth & UV Map Compression.
This technique compresses a mesh’s data over a sequence of frames so they can be streamed over a network. The device will need to uncompress the mesh but won’t have to generate the mesh from scratch.
This technique compresses the depth maps and UV maps over frames so they can be streamed over a network. The device will have to then generate points, or mesh or voxels to render the device.
This technique compresses the points over a sequence of frames so they can be streamed over a network. The device will either render the decompressed points as-is; generate a mesh and render the mesh; turn the points into voxels and render the voxels.
Out-of-Mux Interactivity Working Group
The Out-of-Mux Interactivity Working Group focuses on playback of the 3D model on devices.
The 4 modes of playback are: traditional 2D video, 3D rendering on a big screen tv, 3D rendering on an XR device, and 3D rendering on a smartphone.
Traditional 2D Video
This technique can combine the volumetrically captured model with SFX from a movie / tv show and just output the 2D video so it can be part of the movie or tv show. Another example is a live sports event in which a virtual camera can be used to generate a highlight that happened in the game and that highlights can be displayed on the stadium’s big screen tv’s.
This technique sends the 3D model to a game console, a set top box / streaming stick or a smart tv. The viewer can use a remote control to select the type of perspective they’d like using a menu of options. The viewer can also use a game console to control the virtual camera that is rendering the 3D model.
This technology allows the user to use the smart phone’s touch screen to move around the 3D model and or project the 3D model into the real world using AR built into the smartphone.