Netflix VOID: Open-Source AI That Removes Objects From Video and Rewrites Physics (2026)
Table of Contents
- What VOID Does That Others Cannot
- How VOID Works Under the Hood
- The Mask Generation Pipeline
- Training Data: Synthetic Physics Simulations
- Performance: VOID vs. the Competition
- What You Need to Run It
- Why Netflix Built This
- What This Means for Video Editors and Creators
- The Bigger Picture: Streaming Companies Building AI Tools
- Should You Try VOID?
- Getting Started
- The Bottom Line
Netflix VOID: Open-Source AI That Removes Objects From Video and Rewrites Physics (2026)
On April 3, 2026, Netflix published its first open-weight AI model on Hugging Face. VOID (Video Object and Interaction Deletion) does something no other public model has done well: it removes objects from video footage and then simulates how the rest of the scene would behave without them.
Remove a person holding a guitar, and the guitar falls. Erase someone jumping into a pool, and the splash vanishes, the water settles, the poolside stays dry. This goes far beyond pixel-filling. VOID models the causal chain between objects in a scene, then generates new frames showing physics-plausible outcomes.
The model is free. The weights sit on Hugging Face. The code lives on GitHub. And within 16 hours of the announcement, the Reddit post on r/LocalLLaMA had pulled in over 800 upvotes and 125 comments.
What VOID Does That Others Cannot
Tools like Runway, ProPainter, and DiffuEraser have handled video object removal for years. They mask out objects and fill in the gap. They handle shadows and reflections with reasonable accuracy.
But they all share a blind spot: physical interactions.
When you remove a person carrying a mug using traditional inpainting, the mug vanishes too. Or worse, it floats in mid-air where the person's hand used to be. The scene loses physical coherence because the tool treats every pixel as independent.
VOID treats pixels as part of a physics system. Its two-stage pipeline first identifies what the removed object was doing to its surroundings, then generates replacement frames where those interactions resolve according to physical laws.
The researchers call this "counterfactual reasoning." Given a scene where Event A (person holding guitar) leads to State B (guitar stays upright), VOID computes State C (guitar with no support, so it tips over and falls). The model learned this behavior from paired training data: videos with objects present alongside videos of the same scene with objects absent.
How VOID Works Under the Hood
The architecture builds on CogVideoX-Fun-V1.5-5b-InP, a 5-billion-parameter 3D transformer from Alibaba. Netflix fine-tuned it for interaction-aware video inpainting using a system they call "quadmask conditioning."
A quadmask encodes four regions per pixel:
- Value 0: The primary object to remove
- Value 63: Overlap between the object and affected areas
- Value 127: Affected regions (objects that would move, fall, or shift after removal)
- Value 255: Background to keep unchanged
This four-value mask gives the model explicit information about which parts of the scene depend on the removed object. Standard inpainting masks use binary values (remove or keep). The quadmask adds nuance that drives physics-aware generation.
Inference runs in two passes:
Pass 1 handles the core inpainting. The model takes the masked video, reads a text prompt describing the desired background, and generates replacement frames. Pass 2 refines temporal consistency. It uses optical flow from the Pass 1 output to create warped-noise latents, then runs a second generation pass initialized from those latents. Pass 2 is optional but improves smoothness on longer clips.The Mask Generation Pipeline
Creating quadmasks by hand would be tedious, so VOID ships with a VLM-powered mask pipeline. It uses Meta's SAM2 for object segmentation and Google's Gemini for reasoning about affected regions.
The workflow: point at the object you want removed, SAM2 segments it across frames, Gemini analyzes which surrounding regions would change after removal, and the pipeline composites everything into a quadmask video. Four stages, automated end-to-end once you select the target.
Training Data: Synthetic Physics Simulations
The training data came from two synthetic sources:
HUMOTO generated human-object interaction videos in Blender with physics simulation. Render a person picking up a cup, then render the same scene without the person. The cup falls. Now you have a training pair. Kubric (Google's synthetic data pipeline) produced object-only interaction videos using 3D scanned objects. Push a ball off a table, record both versions. Same principle, no humans required.This paired approach gave VOID ground-truth counterfactual data. The model learned what "should" happen when objects lose support, contact, or containment, because it trained on thousands of physics-simulated examples showing that outcome.
Training ran on 8x NVIDIA A100 GPUs (80GB each) using DeepSpeed ZeRO Stage 2.
Performance: VOID vs. the Competition
Netflix ran a human preference study with 25 evaluators across multiple real-world video scenarios. Evaluators watched object-removal results from VOID and four competitors, then picked which output looked most convincing.
The results, from the arXiv paper (as of April 2026):
- VOID: Chosen 64.8% of the time
- Runway (Aleph): 18.4%
- Generative Omnimatte: Single digits
- ProPainter: Single digits
- DiffuEraser: Single digits
A 3.5x preference gap over Runway, the closest commercial competitor. The gap widened on scenes with complex physical interactions, like removing a person from a collision or erasing an object mid-fall.
What You Need to Run It
VOID is free to download and run, but the hardware bar is high.
Minimum requirements:- GPU with 40GB+ VRAM (NVIDIA A100 or equivalent)
- Python environment with PyTorch
- SAM2 installed for mask generation
- Gemini API key for the VLM mask pipeline
- Default resolution: 384x672
- Maximum frames: 197
- Scheduler: DDIM with 50 denoising steps
- Precision: BF16 with FP8 quantization for memory efficiency
For most independent creators, running VOID on local hardware means an RTX 4090 at minimum (24GB VRAM with aggressive quantization) or cloud GPU rental. An A100 instance on major cloud providers runs $1 to $3 per hour, as of April 2026. Netflix provides a Google Colab notebook that handles setup and runs inference on sample videos.
Why Netflix Built This
Netflix does not sell video editing tools. So why release a state-of-the-art video inpainting model?
The strongest reason is production cost. Netflix produces hundreds of original films and series per year, with post-production budgets running into millions per title. A tool that automates object removal with physics accuracy could save significant VFX hours across their slate.
Recruiting talent plays a role too. Open-sourcing a model that beats Runway at its own game attracts researchers. The paper credits six authors spanning Netflix and Sofia University, and publishing on Hugging Face with full training code puts that work in front of the global ML community. That visibility also boosts Netflix Research's credibility in computer vision, a field where they have published less than in recommendation systems or streaming optimization.
The model ships under open weights (available on Hugging Face), though users should check the specific license terms on the model card before commercial deployment.
What This Means for Video Editors and Creators
The practical impact splits along a hardware line.
If you have GPU access (cloud or local A100-class hardware), VOID gives you production-quality object removal for free. No subscription, no per-minute credits, no watermarks. For studios and freelancers who process high volumes of footage, the economics change when the tool costs compute time instead of SaaS fees. Runway charges $12/month on its Standard plan with annual billing (as of April 2026), with credit-based pricing on top for intensive operations. Commercial alternatives from Pika and Kling AI follow similar models. VOID eliminates the software cost, shifting expense to infrastructure. If you lack GPU access, the 40GB VRAM requirement puts VOID out of reach for casual use. The Colab notebook helps for experimentation, but sustained production work demands dedicated hardware. For occasional object removal, Runway or Descript remain more practical options with lower entry barriers.The Bigger Picture: Streaming Companies Building AI Tools
Netflix joining the open-source AI model space signals a shift. Content companies are transitioning from AI consumers to AI producers.
Netflix follows Disney, which invested in production AI through Industrial Light & Magic, and Amazon, which built AI capabilities across Alexa and AWS. But Netflix went further by publishing a research-grade model that competes head-to-head with funded AI startups.
The pattern matters because these companies control the production pipelines where AI video tools get tested at scale. Netflix processes petabytes of video content. If VOID works on their internal footage, the quality bar reflects real production scenarios, not demo reels.
For the open-source AI video community, VOID fills a gap. Sora from OpenAI handles generation, and HeyGen dominates avatar video, but no open model tackled physics-aware object removal until now. Luma Dream Machine covers style transfer on the creative side. VOID owns the removal-and-reconstruction niche with a model anyone can download and deploy.
Should You Try VOID?
Yes, if:- You work in post-production and need object removal on a regular basis
- You have access to A100-class GPUs (cloud or local)
- You want to experiment with interaction-aware inpainting for VFX pipelines
- You build video editing tools and need an open model to integrate
- You need quick, occasional object removal (use Runway or Descript instead)
- Your GPU has less than 24GB VRAM
- You want a polished UI rather than a CLI pipeline
VOID ships as a research model with a CLI interface, Colab notebook, and Python scripts. No drag-and-drop UI exists yet. Someone will build one. Until then, VOID serves technical users and teams with ML infrastructure.
Getting Started
- Clone the GitHub repo
- Download checkpoints from Hugging Face (voidpass1.safetensors required, voidpass2.safetensors optional)
- Install requirements and the base CogVideoX model
- Set up SAM2 and a Gemini API key for mask generation
- Run the included sample videos to verify your setup
- Process your own footage using the mask generation pipeline
The Colab notebook at notebook.ipynb handles steps 1 through 5 in a single click.
The Bottom Line
Netflix shipped a video object removal model that beat Runway in human preference tests by a factor of 3.5x, then gave it away for free. VOID handles the hard problem that other tools avoid: modeling what happens to a scene after you delete something from it.
The 40GB VRAM requirement limits who can run it today. But quantized forks tend to appear within weeks of a major release, and hosted demos often follow within a month. Within months, VOID's physics-aware inpainting could become accessible to any creator with a mid-range GPU.
For now, VOID represents the most capable open-weight video inpainting model available. If you edit video for a living, download it and run the sample notebook this weekend.
Master AI Agent Building
Get our comprehensive guide to building, deploying, and scaling AI agents for your business.
What you'll get:
- 📖Step-by-step setup instructions for 10+ agent platforms
- 📖Pre-built templates for sales, support, and research agents
- 📖Cost optimization strategies to reduce API spend by 50%
Get Instant Access
Join our newsletter and get this guide delivered to your inbox immediately.
We'll send you the download link instantly. Unsubscribe anytime.
🔧 Tools Featured in This Article
Ready to get started? Here are the tools we recommend:
RunwayML
RunwayML: Comprehensive AI video and image generation platform for creative professionals
Pika
AI video generation platform that transforms images and text into dynamic videos with creative effects and animations.
Kling AI
Chinese AI video generator known for high-quality video synthesis and advanced motion understanding Kling AI is a AI Video that provides powerful automation capabilities for modern builders and developers. The platform focuses on streamlining workflows, improving productivity, and enabling users to accomplish complex tasks efficiently through intelligent automation and user-friendly interfaces.
LumaLabs Dream Machine
Revolutionary AI video generation platform creating cinematic-quality videos with advanced physics simulation and temporal consistency for professional content creators.
Descript
Revolutionary text-based video and podcast editing platform with AI co-editor, automatic transcription, and professional audio enhancement tools. Edit videos by editing text.
HeyGen
AI video generation platform that creates realistic talking avatar videos from text, with voice cloning, video translation across 175+ languages, and custom Digital Twin creation for scalable business video production.
+ 3 more tools mentioned in this article
📖 Related Reading
Enjoyed this article?
Get weekly deep dives on AI agent tools, frameworks, and strategies delivered to your inbox.