In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Google Gemini API > Gemini Multimodal - Video Understanding

Gemini Multimodal - Video Understanding

Author: Venkata Sudhakar

Gemini can analyse video files directly ï¿½ not just images. You can upload a recorded product demo, a customer interview, a training video, or a meeting recording, and ask Gemini to extract insights, summarise key points, identify timestamps, transcribe speech, or answer questions about what happens in specific scenes. This opens a new class of business applications: automated quality review of sales call recordings, instant summarisation of long training sessions, content moderation of user-generated video, and intelligent search across video libraries.

Videos are uploaded using the Gemini Files API, which handles files up to 2GB. After upload, the file is processed and becomes available for multimodal queries. You reference the uploaded file by its URI in the contents of a generate_content call. Gemini supports MP4, MOV, AVI, and other common formats. For long videos, Gemini samples frames at regular intervals to understand the content ï¿½ you can ask about scenes at specific timestamps and it will reference the correct part of the video.

The below example shows a retail company analysing recorded product demo videos to extract key features mentioned, customer questions asked, and a structured summary ï¿½ automating what previously required a human reviewer watching every recording.

import time
from google import genai
from google.genai import types
from pathlib import Path

client = genai.Client(api_key="your-gemini-api-key")

def upload_video(file_path: str) -> types.File:
    # Upload video to Gemini Files API
    print("Uploading video:", file_path)
    video_file = client.files.upload(
        path=file_path,
        config=types.UploadFileConfig(
            mime_type="video/mp4",
            display_name=Path(file_path).stem
        )
    )
    # Wait for processing to complete
    print("Processing video...")
    while video_file.state.name == "PROCESSING":
        time.sleep(3)
        video_file = client.files.get(name=video_file.name)
    if video_file.state.name == "FAILED":
        raise ValueError("Video processing failed: " + video_file.name)
    print("Video ready:", video_file.uri)
    return video_file

def analyse_product_demo(video_file: types.File) -> dict:
    # Ask multiple questions about the video in one call
    prompt = (
        "Analyse this product demo video and provide:\n"
        "1. SUMMARY: A 3-sentence overview of what is demonstrated\n"
        "2. FEATURES: List every product feature mentioned or shown\n"
        "3. QUESTIONS: List any customer questions asked in the video\n"
        "4. TIMESTAMPS: Note the timestamp of each major section\n"
        "5. SENTIMENT: Overall tone - positive, neutral, or mixed\n"
        "Be specific and use timestamps in MM:SS format."
    )
    resp = client.models.generate_content(
        model="gemini-2.0-flash",
        contents=[
            types.Part.from_uri(
                file_uri=video_file.uri,
                mime_type="video/mp4"
            ),
            types.Part.from_text(prompt)
        ],
        config=types.GenerateContentConfig(max_output_tokens=800)
    )
    return {"analysis": resp.text, "video": video_file.display_name}

Uploading and analysing a product demo recording,

It gives the following output with structured video analysis,

Uploading video: shopmax_tv_demo_april.mp4
Processing video...
Video ready: https://generativelanguage.googleapis.com/v1beta/files/abc123

=== VIDEO ANALYSIS: shopmax_tv_demo_april ===
SUMMARY: This 8-minute demo showcases the Samsung 65-inch QLED TV with
focus on picture quality, smart features, and gaming capabilities. The
presenter walks through setup, content streaming, and the gaming mode.

FEATURES:
- QLED panel with Quantum HDR 32X (shown at 01:20)
- 4K upscaling for HD content (demonstrated at 02:45)
- Samsung SmartHub with OTT apps (03:10)
- Auto Low Latency Mode for gaming (05:30)
- Multi-View split screen feature (06:15)

QUESTIONS:
- "Does it support Dolby Atmos?" asked at 04:22
- "What is the input lag for gaming?" asked at 05:45

TIMESTAMPS:
00:00 - Introduction and unboxing
01:00 - Picture quality demonstration
03:00 - Smart TV features walkthrough
05:30 - Gaming mode setup
07:00 - Final comparison and pricing

SENTIMENT: Positive - presenter is enthusiastic and questions are engaged

Remote control scene: The remote appears at 06:45. The presenter highlights
the voice control button, direct OTT app shortcuts, and the solar charging
panel on the back of the remote.

Video file deleted from Gemini Files API

Video analysis production patterns: keep uploaded videos short (under 10 minutes) for fastest analysis ï¿½ Gemini works best with focused clips rather than multi-hour recordings. For long recordings, split into chapters and analyse each chapter separately. Use the Files API delete endpoint immediately after analysis to avoid storage accumulation. For high-volume use cases like moderating thousands of user-submitted videos daily, combine the Files API upload with Gemini Batch API (Tutorial 300) to process many videos concurrently at 50 percent reduced cost. Store the analysis results in BigQuery for searchable video metadata across your entire content library.

Send your comments, suggestions or queries regarding this site to [email protected].