Correcting 360 Degree Stereo Video Capture (Part 2 of 3)

We continue the series on our work with stereoscopic cameras and its contribution to stereoscopic 360 degree video capture from Part  1.  In this part of the series we cover Steps 1 and 2 out of 6:

  • Depth From Disparity Using Matched Witness and Center Cameras and Lenses
  • Depth From Disparity Using MISMATCHED Witness and Center Cameras and Lenses


Step 1: Depth From Disparity Using Matched Witness and Center Cameras and Lenses

image8We used 3 Red One MX cameras on a custom machined mounting plate that allowed us complete freedom of alignment of the cameras.  There was quite a bit of jitter of the cameras relative to each other, due to the loudness of the concert, and animated nature of the crowd around us.  We took an immediate SD card test to the broadcast compound where we verified a) good take and b)that our system could do good disparity measurement in spite of the camera misalignment, noise, underexposure, and changing focal length of the center camera.

Since the camera rig was set up with 50mm primes, our first AC (assistant camera) set up focus for the stage about 70 ft away.  However, before the event started, we left focus as-is and did a couple of panning shots of the crowd to the immediate left of the camera position.  We feed all information to our system in one packed frame — for all 3 cameras, we feed a synced side-by-side by top-and-bottom metaframe.  The center cam is on left, and left and right witness cameras stacked TaB (top-and-bottom) on the right like so:


The first things to note are the witness cameras on the right.  There is a HUGE amount of horizontal and vertical disparity between them – the witness cameras and lenses are quite large.  Of course, the central camera is in between them as well.

Since our baseline interaxial distance was variable (having to slide cameras around to plug and unplug things), and the camera lenses were set to focus about 70 feet away, this resulted in a good acid test:

  1. the footage is variably blurred, nothing is completely in focus except for a brief moment when the very far crowd comes into frame – not good for disparity estimation
  2. the lighting is just ambient venue lighting, so there is a good deal of shot noise given the ISO setting, again, not good
  3. Because of the far lens, and the close proximity of the crowd, the wide interaxial distance resulted in pretty massive (hundreds of pel) disparities. This usually kills disparity estimation with the first bullet.
  4. The rig as constructed is very front-heavy, resulting in widely divergent vertical disparities from camera-to-camera.
  5. Vertical disparities oscillate from frame to frame whenever the camera is moved, or the immediately surrounding crowd is moving about, or SPL vibration, due to 4.

Any one of these is usually enough to guarantee failure for any stereo disparity analysis system, let alone in combination.  With that said, here’s what we fed our system, and what we got.

This is our depth pass, automatically adjusted to match the central camera:


This has its flaws, but it is solid enough to generate a very good stereoscopic render pass.

Next, here is some of the actual concert footage.  Here’s one of the Side-by-Side/Top-and-Bottom (SbS/TaB) input frames from early in the concert:


So what’s wrong here?  We have:

  1. The crowd – comprising a significant amount of the shot. To say they are not well-lit would be an understatement.  Due to the lighting and exposure — with the noise strength almost as “loud” as the image strength, this is at the limits of our system’s ability to gather disparity information.  The image features here from the crowd at bottom are also expected by the HVS (Human Visual System) to be the most negative parallax due to context cues so a missed depth cue here will stand out plainly.
  1. Recall I talked about how disparity estimation will struggle with transparency, mist, smoke, and like-situated environmental effects. No less than 50% of the frame is comprised of translucent mist machine output, clouds of cigarette smoke from the band, and highly directional lighting that accentuates it.
  1. Flashbulbs – it’s a dark arena, and people are taking flash pictures. Flashbulbs tend to flummox optical flow systems.  You might get strobing WITHIN the scanlines of a frame, or strobing from frame-to-frame, or strobing from left-to-right, all of which makes computing optical flow an absolute nightmare.  However, our system seemed to manage it well.
  1. High-dynamic range in a dark environment – we processed this from QT proxies at 8-bit depth, not 16 or even 10. So the differential lighting (and that we’re shooting lights themselves) within the frame causes a severe case of noise and banding sensitivity. Easily solved by using 16-bit DPX sequences in the final pass, however.
  1. The baseline vertical disparities, variable vertical disparity with each beat of the drum and footfall of a drunk concertgoer, and large horizontal baseline disparities make this even more difficult, in spite of the subject being 70 feet away from the camera.

Here is the depthmap for the preceding frame:


We have a depth inpainting algorithm that can help solve instances such as the crowd at the bottom:


We next tested a variety of lens combinations for the witness cameras and the center camera in the next outing (Dallas).

Step 2: Depth From Disparity Using MISMATCHED Witness and Center Cameras and Lenses

image9For this shoot, we again used 3 Red One MX cameras, except instead of primes, we used a combination of zoom lenses and used a new feature of our system: an automatic zoom tracker that independently matches the two outboard witness camera focal lengths and offsets to the center camera.  This was a much more ambitious shoot, and included about three hours of footage.

In Dallas, we put together a much more ambitious rig, and had no shortage of opportunity to put it through its paces.  In essence, we had a three camera rig with a really wide interaxial distance, using a variety of lenses including some that would make most stereographers wake up at night in cold sweats.  While our endzone camera position wasn’t ideal for a whole football game, there was plenty for us to shoot; it was well lit, and the venue itself was a perfect playground for this sort of thing (Cowboys Stadium is 1/4 mile from corner to corner).

Compared to the Seattle AIC footage, this was a great deal more complicated because we had, in most cases, 3 zoom lenses on the rig, and these were BIG Angenieux zoom lenses with cinematic, narrow depth-of-field.  We were also shooting long aspect from the endzone.  In retrospect, we should’ve had a dedicated focus puller – or maybe three, but, this gave us a really good acid test.  Another confounding factor is that the Angenieuxcenter zoom lens seemed to get significantly slower with longer focal lengths, an eye-opening and unexpected revelation for which we had to improvise a solution.


We put together a melt of about 15 shots, totaling about 6 minutes.  Represented are a variety of lens combinations, focal length all the way to 250mm, fast motion, soft focus — in essence, when you view this, just imagine the take with a native stereo rig.  In fact, if you want the raw witness cameras as a stereo feed, we can put that together as well — if you’re masochistic.

This document shows the zoom tracking and depth maps obtained; the final stereo renders will depend on the way you want to view them stereoscopically. The renders are also a tad on the large side; we would need to sneakernet them in via a portable RAID.


Here was our workflow.

  • EDL’s were created by editing timelines with the 720 proxy Quicktime reference files in FCP.
  • A conform with no color correction was created via the Red RAW 1080p proxy via Quicktime reference files in FCP to Raw YUV 10-bit MOV.
  • We concatenated the center and left/right camera feeds in SbS/TaB format for ingest to our pipeline, same as with the Seattle AIC footage, although this time we used 3840×1080.
  • Focal length and baseline offset matching was performed as an initial pass. Our current system can perform this at 3840×1080 at approximately 3 FPS, GPU limited.  We expect that this can be optimized to realtime (24 fps) with a bit of effort.
  • Our depth estimation and depth-image-based rendering was performed as an intermediate pass, at near-real-time (we were disk IO, not memory or arithmetic limited, 20 fps). Stereo floating window, and rule-based primary color correction was performed by our pipeline in this pass as well, generating a full resolution color-corrected conform left/right stereoscopic pair at 1920×1080.
  • DCP wrap was performed as the final pass for stereoscopic theatrical review at Dolby Labs in Burbank and OffHollywood in NYC.

Please note that no per-shot configurations were done (with exception of editing, of course).  All of the passes performed above were unattended.  We had no metadata whatsoever.  We were shooting a worst-case scenario, and this was intentional.

Follows are sample frames from about 3 hours of shooting:

Shot 1: Kitna 20 yard pass away from camera position.  Very narrow DOF center lens, zoom lenses on witnesses, fixed.  Nothing pathological. Good depth separation of players, field positions.  Background depth cues are solid as well.

image15Frame 0: zoom tracking

image16Frame 0: RGB+Z, pre color corrected

Shot 2: Long focal-length of cheerleaders along near endzone mezzanine.  Lots of focus pulls, panning, lots of zooming.  All at long FL relative to witness cameras; the zoomed witness cameras are 25% off vertical alignment in some frames.  This is Worst-Case-Scenario. We’re at the limit of our aperture of analysis.  Occasionally, a frame falls down, but our temporal model corrects it before it becomes a major eye rip. 

image17Frame 277: zoom tracking

image18Frame 277: RGB+Z, pre color corrected

Shot 4: Local crowd pan. Focal length varations, some focus wonky.  The resulting depth and disparity cues are consistent — bottom left is near, top right is far. Good object separation, good muting of eye rips due to panning.

image19Frame 1086: zoom tracking

image20Frame 1086: RGB+Z, pre color corrected

 Shot 5: Long pass toward camera position.  BIG focal length changes, focus changes.  Mismatched focus, speed, and focal length.  Still comes off as well-done stereo shot. There are a few frames where the zoom mis-tracks, the rest of the system makes up for it when we correlate to the center.

image21Frame 1335: zoom tracking (note that right camera track mis-zoomed by 10%)

image22Frame 1335: RGB+Z, pre color corrected

 Shot 13: NOMNOM girl.  Self-explanatory.

image23Frame 3600: zoom tracking

image24Frame 3600: RGB+Z, pre color corrected

With these results, we decided to take what we had learned, correct the issues, and take on something even more ambitious.

We will continue the series on our work with stereoscopic cameras and its contribution to stereoscopic 360 degree video capture in Part  3.  In the last part of the series we cover Steps 3 through 6 and conclude:

  • Depth From Disparity Using HANDHELD Witness and Center Cameras and Lenses, multiple camera positions
  • Baseline Camera Interop Tests
  • Witness Cameras at Slower Framerates Than The Primary Camera
  • Final step: Mixed Models Including Motion Depth Plus Stereo Depth
  • Interesting Observations and Highlights

~ by opticalflow on May 11, 2016.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: