smolvlm2-captioner

Applies the SmolVLM2-2.2B-Instruct multimodal model to video frames selected by input TimeFrame annotations for prompt-driven captioning / scene description. Each invocation runs a single prompt against the TimeFrames selected by tfLabels; to apply different prompts to different label subsets (e.g. one prompt for slates, another for chyrons), run the app once per (prompt, tfLabels) combination. Per-TimeFrame captioning is composite: every frame sampled from a TF is fed to the model in a single prompt and yields one caption per TF. This app ships only the 2.2B-Instruct variant – the largest and most general-purpose model in the SmolVLM2 family. The smaller (256M and 500M) SmolVLM2 releases are post-trained specifically for video-QA tasks and we do not expect them to generalize well, given their size.

v1.1 (@keighrim, 2026-07-09, source, image)
v1.0 (@keighrim, 2026-06-01, source, image)
v0.5 (@kelleyl, 2026-02-27, source, image)
v0.4 (@kelleyl, 2026-02-26, source, image)
v0.3 (@kelleyl, 2026-01-28, source, image)
v0.2 (@kelleyl, 2026-01-28, source, image)
v0.1 (@kelleyl, 2025-11-20, source, image)