Kinect Hacking

Translations
The Kinect is an accessory for Microsoft's Xbox game console. It contains an array of microphones, an active-sensing depth camera using structured light, and a color camera. The Kinect is intended to be used as a controller-free game controller, tracking the body or bodies of one or more players in its field of view.

The motivation for this project was to convert the Kinect into a 3D camera by combining the depth and color image streams received from the device, and projecting them back out into 3D space in such a way that real 3D objects inside the cameras' field of view are recreated virtually, at their proper sizes.

Kinect Sensors

The Kinect contains a regular color camera, sending images of 640*480 pixels 30 times a second. It also contains an active-sensing depth camera using a structured light approach (using what appears to be an infrared LED laser and a micromirror array), which also sends (depth) images of 640*480 pixels 30 times a second (although it appears that not every pixel is sampled on every frame).

What Makes The Kinect Special?

It is important to understand the difference between 3D cameras like the Kinect on one hand, regular (2D) cameras on the other hand, and so-called "3D cameras" -- actually, stereoscopic 2D cameras -- on the third hand (ouch).

Kinect vs Regular 2D Camera

Any camera, 2D or otherwise, works by projecting 3D objects (or people...), which you can think of as collections of 3D points in 3D space, onto a 2D imaging plane (the picture) along straight lines going through the camera's optical center point (the lens). Normally, once 3D objects are projected to a 2D plane that way, it is impossible to go back and reconstruct the original 3D objects. While each pixel in a 2D image defines a line from that pixel through the lens back out into 3D space, and while the original 3D point that generated the pixel must lie somewhere on that line, the distance that 3D point "traveled" along its line is lost in projection. There are approaches to estimate that distance for many pixels in an image by using multiple images or good old guesswork, but they have their limitations.

A 3D camera like a Kinect provides the missing bit of information necessary for 3D reconstruction. For each 2D pixel on the image plane, it not only records that pixel's color, i.e., the color of the original 3D point, but also that 3D point's distance along its projection line. There are multiple technologies to sense this depth information, but the details are not really relevant. The important part is that now, by knowing a 2D pixel's projection line and a distance along that projection line, it is possible to project each pixel back out into 3D space, which effectively reconstructs the originally captured 3D object(s). This reconstruction, which can only contain one side of an object (the one facing the camera), creates a so-called facade. By combining facades from multiple calibrated 3D cameras, one can even generate more complete 3D reconstructions.

Kinect vs So-Called "3D Camera"

There exist stereoscopic cameras on the market, which are usually advertised as "3D cameras." This is somewhat misleading. A stereoscopic camera, which can typically be recognized by having two lenses next to each other, does not capture 3D images, but rather two 2D images from slightly different viewpoints. If these two images are shown to a viewer, where the viewer's left eye sees the image captured through the left lens, and the right eye the other one, the viewer's brain will merge the so-called stereo pair into a full 3D image. The main difference is that the actual 3D reconstruction does not happen in the camera, but in the viewer's brain. As a result, images captured from these cameras are "fixed." Since they are not really 3D, they can only be viewed from the exact viewpoint from which they were originally taken. Real 3D pictures, on the other hand, can be viewed from any viewpoint, since that simply involves rendering the reconstructed 3D objects using a different perspective.

While it is possible to convert stereo pairs into true 3D images using computer vision approaches (so-called depth-from-stereo methods), those do not work very well in practice.

Project Goals

The goal of this project was to develop the software necessary to connect an unmodified, off-the-shelf, Kinect device to a regular computer, and use it as a 3D camera for a variety of 3D graphics and virtual reality applications. The software is implemented as a set of applications based on the Vrui VR toolkit, and additionally as a Vrui vislet to facilitate using the 3D video stream received from a Kinect with all existing Vrui VR applications.

Project Details

The software is composed of several classes wrapping aspects of the underlying libusb library into an exception-safe C++ framework, classes encapsulating control of the Kinect's tilt motor and color and depth cameras, and a class encapsulating the operations necessary to reproject a combined depth and color video stream into 3D space. It also contains several utility applications, including a simple calibration utility.

This software is based on the reverse engineering work of Hector Martin (marcan42 on twitter and YouTube). I didn't use any of his code, but the "magic incantations" that need to be sent to the Kinect to enable the cameras and start streaming. Those incantations were essential, because I don't own an Xbox myself, so I couldn't snoop its USB protocol. Thanks Hector!

The Kinect driver code and the 3D reconstruction code are entirely written from scratch in C++, using my own Vrui VR toolkit for 3D rendering management and interaction.

Pages In This Section

Movies
Movies showing 3D video streams from the Kinect, and how they can be integrated into other 3D graphics or VR software.
Download
Download page for the current and several older releases of the Kinect 3D Video Capture project, released under the GNU General Public License.

Translations

This page has been translated into other languages by volunteers: