I sent an email to Ross a while back asking for his thoughts, but I haven't gotten a response. So I guess I'll just do a public post here.
The short version is: Have we considered using voxels?
If anyone here isn't familiar with them, voxels are basically just pixels in 3D space. You've probably seen them in some retro-esque 3D games that are going for a visual style similar to something like Minecraft. Cube World is probably the most obvious example I can think of.
However voxels have much, MUCH more versatility than just blocky retro-throwback stuff. Computers now can actually render really detailed models using voxel geometry instead of polygons. And for a while, some lesser-known titles were using voxel based engines as opposed to polygonal ones to try and get more detailed environments. Nowadays because Polygons have become the standard and additional levels of detail aren't really that big of a deal anymore, Voxels have kind of fallen into a niche. Usually they're used for simulation games where updating polygonal 3D models on the fly aren't really an option. I'm sure a lot of you are familiar with Teardown, and in my search I also found this:
However both of these games seem to be low-balling what voxels are capable of.
Here's a tech demo I found of a tank, and this is from over five years ago:
And here's an example of Voxels being used to create realistic terrain on the Nintendo DS:
I bring all this up because I feel like this is the easiest compromise for reading a 3D world off of a video file. For one, the AI wouldn't have to carefully construct polygonal models for everything it finds. It just needs to replicate what it sees: Pixels. And two, we don't have the special engines with advanced optimization tricks needed to generate those huge, detailed open world games... but voxels are capable of having a very crude level of detail system. As the object gets further away, just merge the voxels into fewer larger ones.
The main thing I'm worried about would be Ross's obsession with Anti-Aliasing, because (and correct me if I'm wrong) I think most of the proper Anti-Aliasing systems in games these days rely on the model and texture data to get the best result. FXAA could probably be decent general substitute, but if you're in VR and you mash your face into something it's just gonna look weird.