Comprehensive analysis of FDM-1's strengths and weaknesses based on real user feedback and expert evaluation.
First computer-use foundation model trained on internet-scale video (11M hours), versus the largest open computer-use dataset of under 20 hours of 30 FPS video
Native 30 FPS video processing enables continuous control like smooth mouse movement and CAD operations rather than discrete screenshot-by-screenshot reasoning
Highly efficient video encoder compresses nearly 2 hours of footage into just 1M tokens, unlocking minute-scale context windows
Unsupervised training via the inverse dynamics model removes the bottleneck of expensive contractor-labeled screenshots
Test-time compute via OS checkpoints / forking VMs lets the model retry from validated intermediate states on long-horizon tasks
Demonstrably general â the same model performs CAD modeling, website fuzzing, and real-world driving without task-specific RL environments
6 major strengths make FDM-1 stand out in the automation category.
No public API, pricing page, or self-serve access â gated to enterprise and research partners
Capabilities are demonstrated through curated video clips rather than peer-reviewed benchmarks against established computer-use leaderboards
Released February 23, 2026, so production track record, reliability, and safety guardrails are unproven at scale
Inference at 30 FPS on minute-long video contexts implies significant GPU cost not disclosed publicly
No documentation of supported operating systems, integrations, or developer tooling beyond the research blog post
5 areas for improvement that potential users should consider.
FDM-1 has potential but comes with notable limitations. Consider trying the free tier or trial before committing, and compare closely with alternatives in the automation space.
FDM-1 is a foundation model for general computer use built by Standard Intelligence (standard intelligence pbc), announced February 23, 2026. Unlike prior computer-use agents that fine-tune a vision-language model on screenshots, FDM-1 trains and infers directly on video at 30 FPS. It was trained on a portion of an 11-million-hour screen recording dataset labeled by a custom inverse dynamics model. The team positions it as the first fully general computer action model.
Traditional computer-use agents fine-tune a VLM on contractor-annotated screenshots, which limits them to a few seconds of context, low framerates, and short-horizon tasks. FDM-1 instead trains directly on 30 FPS video and uses a video encoder that compresses ~2 hours into 1M tokens, giving it minute-scale context. It also avoids per-task reinforcement learning environments, learning unsupervised from the open internet's video corpus. Based on our analysis of 870+ AI tools, this is the only Automation entry that trains a custom video foundation model end-to-end for computer use.
Standard Intelligence demonstrated FDM-1 performing multi-action CAD sequences in Blender (including extruding faces on an n-gon to make a gear), exploring and fuzzing complex websites, and driving a car in the real world â all at 30 FPS. The CAD demo uses OS checkpoints created at successful operations (extrude, select, etc.) to enable test-time compute via a forking VM. The blog post emphasizes that capabilities consistently improve with scale, and the team frames the current model as the first step toward CAD, finance, engineering, and ML-research coworker agents.
FDM-1 has no published pricing or self-serve access as of the February 23, 2026 announcement. Standard Intelligence describes it as a research milestone in a blog post at si.inc/posts/fdm1/, and access appears to be limited to enterprise or research partnerships. Compared to other Automation tools in our directory that publish $20â$200/month tiers, FDM-1 sits firmly in the enterprise / contact-sales segment with no free or developer tier announced.
The training recipe has three core components, all described in the launch post. First, a video encoder that compresses approximately 2 hours of 30 FPS video into 1 million tokens, enabling long-context training. Second, an inverse dynamics model that labels raw screen recordings with the actions that produced them, removing the need for contractor annotation. Third, a forward dynamics model that predicts future frames conditioned on actions, which is the component used to drive the agent at inference time.
Consider FDM-1 carefully or explore alternatives. The free tier is a good place to start.
Pros and cons analysis updated March 2026