Building a streaming service in a day with Bitmovin's AI-native stack
A working Netflix-style OTT site at https://bitflix.slederer.com/: 10 titles, per-title HLS+DASH on CloudFront, hover previews, chapter markers, ad-marker overlays, viewer analytics, a 10-minute on-demand live channel, and a 20-country playback QA report. End to end on AWS, built in a day, mostly by an LLM agent following a written plan.
This post walks through what the AI layer buys you when you wire up the full Bitmovin product surface, and where the friction still is.
The pipeline
download.blender.org / archive.org
│
▼ ingest CLI (parallel, resumable)
S3 bitflix-slederer-input
│
▼ Bitmovin Encoding API (per-title H.264 + AAC)
│ Encoding Templates YAML, one template, ten titles
│
├──▶ AI Scene Analysis → scenes.json + ad-markers.json
│ (named characters, objects, dialogue, scene cuts)
│
├──▶ AISA-driven preview clips
│ (TimeBasedTrimmingInputStream, no ffmpeg)
│
├──▶ Trick-play sprite + WebVTT
│
└──▶ Stream Lab device matrix QA
(31 device targets across 20 countries)
│
▼
S3 bitflix-slederer-output → CloudFront → bitflix.slederer.com
│
▼
Bitmovin Player + Bitmovin Analytics
(chapters, ad cues, trick-play, impressions)
Bitmovin ships its AI features as ordinary API surface. You don't build a separate ML pipeline. The same encoding job that produces your manifests also produces structured scene data.
What the AI does
Scene analysis writes the metadata layer
Every encoded asset gets an AI Scene Analysis run. The output for Big Buck Bunny:
{
"scenes": [
{
"title": "Enchanted Animated Landscape and Singing Bird",
"startInSeconds": 0.0,
"endInSeconds": 25.41,
"content": {
"characters": [
{
"name": "Unknown",
"appearance": "A plump, gray bird with a yellow beak and wide, expressive eyes.",
"description": "The bird sings dramatically, spreading its wings…"
}
],
"objects": [
{ "description": "Rolling green hills, lush trees, and a pastel sky." }
]
}
},
…
]
}
10 named scenes per title. Each scene comes with character descriptions, object summaries, and frame-accurate boundaries. We use that data in three places on the site:
- Chapter markers on the player seekbar. Every
startInSecondsbecomes a tick on the timeline. Every scene title becomes a tooltip. - A second
ad-markers.jsonwith SCTE candidates derived from those scene boundaries: pre-roll, three mid-rolls picked at the largest gaps between scenes, post-roll. Rendered as red ticks on the seekbar. - A blurb under each catalog tile that's actually about the asset.
We didn't run a separate model. The encoding produced all of it.
Preview clips without ffmpeg
The hover preview on each catalog tile is a 10-second clip from a representative scene of the asset, picked by AISA and extracted by Bitmovin. There's no local video tooling involved:
# 1. Pick a scene from the AISA output (5-30 seconds, in the middle 60% of runtime)
start_s, duration_s = _pick_scene(scenes_doc, title.runtime_seconds)
# 2. Submit a Bitmovin imperative encoding that trims the input
ingest = api.encoding.encodings.input_streams.ingest.create(
encoding_id=enc.id,
ingest_input_stream=IngestInputStream(input_id=s3_in.id, input_path=f"{slug}.mp4"),
)
trim = api.encoding.encodings.input_streams.trimming.time_based.create(
encoding_id=enc.id,
time_based_trimming_input_stream=TimeBasedTrimmingInputStream(
input_stream_id=ingest.id, offset=start_s, duration=duration_s,
),
)
# 3. H.264 + AAC streams reading from the trim, then MP4 + HLS muxings
For ten titles that's ten short encodings, each producing one clip from the most representative scene the model found. The Python codebase has zero ffmpeg subprocesses in it. Every video operation is an API call.
Per-title ladder optimization
mode: PER_TITLE_TEMPLATE in the encoding template. The encoder analyzes each asset's complexity and produces a custom ABR ladder per title. NASA's ISS Earth Time-Lapse (slow-moving, low complexity) gets fewer rungs at lower bitrates than Sintel (high motion, fine detail). One template, ten different ladders, no manual tuning.
Wiring it into the player
The player config is one object. The AI metadata wires straight in:
const player = new bitmovin.player.Player(container, {
key: PLAYER_KEY,
analytics: { key: ANALYTICS_KEY, videoId, title, customData1: "bitflix" },
});
await player.load({
hls: title.hls,
dash: title.dash,
thumbnailTrack: { url: title.thumbnails_vtt }, // trick-play scrub previews
metadata: {
markers: scenes.map(s => ({ // chapter ticks
time: s.startInSeconds,
title: s.title,
})),
},
});
You get scrub-thumbnail previews, chapter markers, ABR rung selection, viewer impressions in the Bitmovin Analytics dashboard, and the default UI. The ad-marker overlays are a small custom paint on the seekbar; the data being painted is just AISA output.
Stream Lab: device matrix in one HTTP call
After encoding, every HLS and DASH manifest goes to Bitmovin Stream Lab. 31 device targets, executed across 20 country VPN locations. The job runs unattended for about five minutes and returns a per-device pass/fail report.
The geo summary for Bitflix across all 20 countries:
| Metric | Min | Avg | Max |
|---|---|---|---|
| Manifest fetch (ms) | 3 | ~5 | 10 |
| First segment (ms) | 80 | 130 | 220 |
| Startup (ms) | 2195 | 2248 | 2333 |
| Seek (ms) | ~2300 | ~2350 | ~2400 |
Startup stays under 2.4 seconds on every continent we tested, with no tuning beyond what the per-title encoder and CloudFront produce by default.
A separate test rig drives a real Chromium against the live page from 50 countries via residential proxies, checking that playback actually starts and that Bitmovin Analytics receives the impression. Stream Lab gives you a predictive matrix; the real-browser run gives you empirical coverage. They're complementary.
What was rough
Building this with an LLM as the engineer makes the friction measurable. Each workaround the agent had to invent is a place where the product surface cost more than it should have. The full log is at https://bitflix.slederer.com/web/feedback.html (45 items, with severity, repro, and fix proposals). Headline categories:
- Per-title encoding has surprising defaults around encoding regions and stream selection that aren't surfaced anywhere. The agent had to reverse-engineer them by reading SDK source.
- AI Scene Analysis runs implicitly per account (via
assetDescription), which is a nice default. But the run/poll/discover lifecycle isn't documented, and the artefact location and shape are different from any heuristic fallback. - Stream Lab API has an undocumented terminal status (
notified) that hangs naive pollers, and the docs split confusingly between two subdomains. - Player Web X 10 beta has a working v8-compat shim and a non-functional native bundle. The native bundle accepts
sources.add(...)but never attaches the source to its own<video>. Single-segment HLS playlists silently fail to play (target-duration scheduling bug). Both filed as P0/P1. - The docs MCP at
agentic.bitmovin.comwill fabricate schema fields that don't exist if asked broadly. Pinning it to known-good SDK examples works fine; free-form "what fields does this take?" doesn't.
Bitmovin shipping an LLM-powered docs MCP and an Agent Toolkit is real progress, ahead of the peer set. Closing the gaps the agent surfaced is the next step.
Run it yourself
- Live demo: https://bitflix.slederer.com/
- Source: https://github.com/slederer/OS-streaming-v2
- One-shot summary: https://bitflix.slederer.com/web/summary.html
- Friction log (45 items): https://bitflix.slederer.com/web/feedback.html
A streaming service used to take a team several months. With Bitmovin's AI-native encoder, the SDK in its current shape, and a tolerance for the rough edges in beta features, a small team can stand one up in a day.