Netflix VOID: Open-Source AI That Removes

Netflix VOID: Open-Source AI That Removes Objects From Video and Rewrites Physics (2026)

On April 3, 2026, Netflix published its first open-weight AI model on Hugging Face. VOID (Video Object and Interaction Deletion) does something no other public model has done well: it removes objects from video footage and then simulates how the rest of the scene would behave without them.

Remove a person holding a guitar, and the guitar falls. Erase someone jumping into a pool, and the splash vanishes, the water settles, the poolside stays dry. This goes far beyond pixel-filling. VOID models the causal chain between objects in a scene, then generates new frames showing physics-plausible outcomes.

The model is free. The weights sit on Hugging Face. The code lives on GitHub. And within 16 hours of the announcement, the Reddit post on r/LocalLLaMA had pulled in over 800 upvotes and 125 comments.

What VOID Does That Others Cannot

Tools like Runway, ProPainter, and DiffuEraser have handled video object removal for years. They mask out objects and fill in the gap. They handle shadows and reflections with reasonable accuracy.

But they all share a blind spot: physical interactions.

When you remove a person carrying a mug using traditional inpainting, the mug vanishes too. Or worse, it floats in mid-air where the person's hand used to be. The scene loses physical coherence because the tool treats every pixel as independent.

VOID treats pixels as part of a physics system. Its two-stage pipeline first identifies what the removed object was doing to its surroundings, then generates replacement frames where those interactions resolve according to physical laws.

The researchers call this "counterfactual reasoning." Given a scene where Event A (person holding guitar) leads to State B (guitar stays upright), VOID computes State C (guitar with no support, so it tips over and falls). The model learned this behavior from paired training data: videos with objects present alongside videos of the same scene with objects absent.

How VOID Works Under the Hood

The architecture builds on CogVideoX-Fun-V1.5-5b-InP, a 5-billion-parameter 3D transformer from Alibaba. Netflix fine-tuned it for interaction-aware video inpainting using a system they call "quadmask conditioning."

A quadmask encodes four regions per pixel:

Value 0: The primary object to remove
Value 63: Overlap between the object and affected areas
Value 127: Affected regions (objects that would move, fall, or shift after removal)
Value 255: Background to keep unchanged

This four-value mask gives the model explicit information about which parts of the scene depend on the removed object. Standard inpainting masks use binary values (remove or keep). The quadmask adds nuance that drives physics-aware generation.

Inference runs in two passes:

Pass 1 handles the core inpainting. The model takes the masked video, reads a text prompt describing the desired background, and generates replacement frames. Pass 2 refines temporal consistency. It uses optical flow from the Pass 1 output to create warped-noise latents, then runs a second generation pass initialized from those latents. Pass 2 is optional but improves smoothness on longer clips.

The Mask Generation Pipeline

Creating quadmasks by hand would be tedious, so VOID ships with a VLM-powered mask pipeline. It uses Meta's SAM2 for object segmentation and Google's Gemini for reasoning about affected regions.

The workflow: point at the object you want removed, SAM2 segments it across frames, Gemini analyzes which surrounding regions would change after removal, and the pipeline composites everything into a quadmask video. Four stages, automated end-to-end once you select the target.

Training Data: Synthetic Physics Simulations

The training data came from two synthetic sources:

HUMOTO generated human-object interaction videos in Blender with physics simulation. Render a person picking up a cup, then render the same scene without the person. The cup falls. Now you have a training pair. Kubric (Google's synthetic data pipeline) produced object-only interaction videos using 3D scanned objects. Push a ball off a table, record both versions. Same principle, no humans required.

This paired approach gave VOID ground-truth counterfactual data. The model learned what "should" happen when objects lose support, contact, or containment, because it trained on thousands of physics-simulated examples showing that outcome.

Training ran on 8x NVIDIA A100 GPUs (80GB each) using DeepSpeed ZeRO Stage 2.

Performance: VOID vs. the Competition

Netflix ran a human preference study with 25 evaluators across multiple real-world video scenarios. Evaluators watched object-removal results from VOID and four competitors, then picked which output looked most convincing.

The results, from the arXiv paper (as of April 2026):

VOID: Chosen 64.8% of the time
Runway (Aleph): 18.4%
Generative Omnimatte: Single digits
ProPainter: Single digits
DiffuEraser: Single digits

A 3.5x preference gap over Runway, the closest commercial competitor. The gap widened on scenes with complex physical interactions, like removing a person from a collision or erasing an object mid-fall.

What You Need to Run It

VOID is free to download and run, but the hardware bar is high.

Minimum requirements:

GPU with 40GB+ VRAM (NVIDIA A100 or equivalent)
Python environment with PyTorch
SAM2 installed for mask generation
Gemini API key for the VLM mask pipeline

Output specs:

Default resolution: 384x672
Maximum frames: 197
Scheduler: DDIM with 50 denoising steps
Precision: BF16 with FP8 quantization for memory efficiency

For most independent creators, running VOID on local hardware means an RTX 4090 at minimum (24GB VRAM with aggressive quantization) or cloud GPU rental. An A100 instance on major cloud providers runs $1 to $3 per hour, as of April 2026. Netflix provides a Google Colab notebook that handles setup and runs inference on sample videos.

Why Netflix Built This

Netflix does not sell video editing tools. So why release a state-of-the-art video inpainting model?

The strongest reason is production cost. Netflix produces hundreds of original films and series per year, with post-production budgets running into millions per title. A tool that automates object removal with physics accuracy could save significant VFX hours across their slate.

Recruiting talent plays a role too. Open-sourcing a model that beats Runway at its own game attracts researchers. The paper credits six authors spanning Netflix and Sofia University, and publishing on Hugging Face with full training code puts that work in front of the global ML community. That visibility also boosts Netflix Research's credibility in computer vision, a field where they have published less than in recommendation systems or streaming optimization.

The model ships under open weights (available on Hugging Face), though users should check the specific license terms on the model card before commercial deployment.

What This Means for Video Editors and Creators

The practical impact splits along a hardware line.

If you have GPU access (cloud or local A100-class hardware), VOID gives you production-quality object removal for free. No subscription, no per-minute credits, no watermarks. For studios and freelancers who process high volumes of footage, the economics change when the tool costs compute time instead of SaaS fees. Runway charges $12/month on its Standard plan with annual billing (as of April 2026), with credit-based pricing on top for intensive operations. Commercial alternatives from Pika and Kling AI follow similar models. VOID eliminates the software cost, shifting expense to infrastructure. If you lack GPU access, the 40GB VRAM requirement puts VOID out of reach for casual use. The Colab notebook helps for experimentation, but sustained production work demands dedicated hardware. For occasional object removal, Runway or Descript remain more practical options with lower entry barriers.

The Bigger Picture: Streaming Companies Building AI Tools

Netflix joining the open-source AI model space signals a shift. Content companies are transitioning from AI consumers to AI producers.

Netflix follows Disney, which invested in production AI through Industrial Light & Magic, and Amazon, which built AI capabilities across Alexa and AWS. But Netflix went further by publishing a research-grade model that competes head-to-head with funded AI startups.

The pattern matters because these companies control the production pipelines where AI video tools get tested at scale. Netflix processes petabytes of video content. If VOID works on their internal footage, the quality bar reflects real production scenarios, not demo reels.

For the open-source AI video community, VOID fills a gap. Sora from OpenAI handles generation, and HeyGen dominates avatar video, but no open model tackled physics-aware object removal until now. Luma Dream Machine covers style transfer on the creative side. VOID owns the removal-and-reconstruction niche with a model anyone can download and deploy.

Should You Try VOID?

Yes, if:

You work in post-production and need object removal on a regular basis
You have access to A100-class GPUs (cloud or local)
You want to experiment with interaction-aware inpainting for VFX pipelines
You build video editing tools and need an open model to integrate

Not yet, if:

You need quick, occasional object removal (use Runway or Descript instead)
Your GPU has less than 24GB VRAM
You want a polished UI rather than a CLI pipeline

VOID ships as a research model with a CLI interface, Colab notebook, and Python scripts. No drag-and-drop UI exists yet. Someone will build one. Until then, VOID serves technical users and teams with ML infrastructure.

Getting Started

Clone the GitHub repo
Download checkpoints from Hugging Face (voidpass1.safetensors required, voidpass2.safetensors optional)
Install requirements and the base CogVideoX model
Set up SAM2 and a Gemini API key for mask generation
Run the included sample videos to verify your setup
Process your own footage using the mask generation pipeline

The Colab notebook at notebook.ipynb handles steps 1 through 5 in a single click.

The Bottom Line

Netflix shipped a video object removal model that beat Runway in human preference tests by a factor of 3.5x, then gave it away for free. VOID handles the hard problem that other tools avoid: modeling what happens to a scene after you delete something from it.

The 40GB VRAM requirement limits who can run it today. But quantized forks tend to appear within weeks of a major release, and hosted demos often follow within a month. Within months, VOID's physics-aware inpainting could become accessible to any creator with a mid-range GPU.

For now, VOID represents the most capable open-weight video inpainting model available. If you edit video for a living, download it and run the sample notebook this weekend.

Netflix VOID: Open-Source AI That Removes Objects From Video and Rewrites Physics (2026)