Milestone Launches Vision Language Model (VLM)

Milestone Systems released an advanced vision language model (VLM) specialising in traffic understanding and powered by NVIDIA Cosmos Reason. The VLM powers two new products: a Video Summarisation tool for XProtect Video Management Software and a VLM as a Service for third-party integrations.

Video Summarisation for XProtect allows users to search summaries from visual data and automates reporting.

Today’s video systems capture vast amounts of data, and reviewing footage remains time-consuming and largely manual. With Milestone Systems’ new Video Summarisation tool – a generative AI-powered plug-in for the XProtect Smart Client – users and operators can now rely on a specialised product that automates operator workflows, saves valuable time, and reduces false alarm fatigue significantly. Early reports show video summarisation could reduce operator false alarm fatigue by up to 30%.

The Video Summarisation tool analyses camera footage and describes what’s happening. Users simply send a snippet of video and a prompt describing their request, and the model will generate a text summary in seconds.

Key capabilities:

Convert video segments into structured text summaries inside XProtect Smart Client
Search summaries based on video content, rather than timestamps or manual tagging
Bookmark and filter summaries to streamline review workflows
Integrate seamlessly with existing XProtect event and rule logic to trigger automated summaries based on specific alarms or alerts
Focus attention on valid events by filtering out irrelevant motion or noise
Access customised, sovereign VLMs per region, starting with the US and EU. More regions to follow.

The Video Summarisation is free to download and takes only a few minutes to install directly in the XProtect Smart Client. And users only pay when prompted by the VLM.

VLM as a Service for developers: Add production-ready video intelligence to any application

With Milestone’s Hafnia VLM as a Service (VLMaaS), developers, integrators and partners get API access to production-ready video intelligence built on NVIDIA’s latest technology and fine-tuned on responsibly sourced data.

The VLMaaS helps developers create AI-powered solutions quickly without needing to set up, fine-tune or manage their own AI systems – it enhances any existing solutions with generative AI, regardless of the level of analytics currently in place. This makes it fast and simple to add advanced video intelligence features to applications, whether it’s testing a minimum viable product (MVP) or scaling a platform.

With VLMaaS, the development of AI and analytics can be accelerated significantly – up to 70 times less effort than doing the work to fine-tune a VLM model to do the same.

Key capabilities:

Access a high-accuracy vision language model, fine-tune on traffic-optimised data and build on NVIDIA Cosmos Reason
Follow prompt-based instructions for traffic-related operations
API-first delivery – simple integration via HTTPS
Fine-tuned models for the US and EU markets, with more regions to follow
Designed to build standalone solutions or integrate with the Milestone product portfolio
100% responsibly sourced training data with auditable data lineage, GDPR- and EU AI Act-compliant, used for the fine-tuning of the model

Pricing for the VLMaaS is pay-per-use (based on API calls) – no large upfront investments or custom training costs.

Andrew Burnett, Acting Chief Technology Officer, Milestone Systems, said:
“With the Vision Language Model as a Service and Video Summarisation for XProtect, we’re tackling some of the most challenging bottlenecks: video overload and time-consuming manual work. Operators get immediate insight directly within XProtect; builders get API‑first access to production‑ready intelligence without bespoke training or heavy infrastructure.

Because this model is specialised for real-world traffic video and fine-tuned on responsibly sourced data, customers can trust the results, deploy with confidence, and enhance all existing solutions in place. It’s the fastest, most advanced and impactful path to turning video into actionable outcomes.”

XProtect customers like the cities of Genoa, Italy, and Dubuque, Iowa, US, are excited to use these new capabilities, leading the way in adopting advanced video intelligence solutions to enhance traffic management.

Built on responsible AI, Powered by Real-World Data

The two new offerings are powered by Milestone’s Hafnia VLM, which has been fine-tuned on 75,000 hours of responsibly sourced, real-world video data from either Europe or the US, using NVIDIA Cosmos Curator for data preparation and running either on cloud infrastructure or regional data centres. Leveraging NVIDIA Cosmos Reason VLM and Milestone’s data for fine-tuning makes it one of the most advanced video AI platforms in the industry.

Milestone Launches Vision Language Model (VLM)

Key capabilities:

VLM as a Service for developers: Add production-ready video intelligence to any application

Key capabilities:

Built on responsible AI, Powered by Real-World Data

Must Read

About us

Technology

Electronics

Industry