unwind ai
Posts
Build a Multimodal Agentic RAG App with Gemini Embedding 2 and Google ADK

Build a Multimodal Agentic RAG App with Gemini Embedding 2 and Google ADK

(100% open source)

Shubham Saboo & Gargi Gupta
May 09, 2026

If you have built a RAG app before, you know how quickly the "just retrieve the right chunk" problem fragments the moment your sources stop being plain text. Product PDFs, UI screenshots, recorded calls, demo videos, and support notes all carry the answer your user is asking for, but each lives in its own embedding silo.

Stitching them together usually means three pipelines, two vector stores, and a glue layer you regret pretty soon.

In this tutorial, you'll build a fully-working multimodal agentic RAG app where text, URLs, PDFs, images, audio, and video all share a single 768-dimension embedding space, and a small Google Agent Development Kit (ADK) coordinator turns the retrieved evidence into a grounded, cited answer.

The two pieces doing the heavy lifting are Gemini Embedding 2, which embeds every modality into the same vector space, and Google ADK, which wraps the retrieval call in an agent that inspects the workspace, calls the retrieval tool, and writes the answer.

You'll see exactly how those two pieces compose without any extra orchestration framework.

Don’t forget to share this tutorial on your social channels and tag Unwind AI (X, LinkedIn, Threads, Facebook) to support us!

What We’re Building

A multimodal Agentic RAG demo where you can drop in any file or URL and ask questions across the whole index. The same retrieval packet powers both the answer and the citation panel, so the UI never disagrees with the model.

Key features:

Truly multimodal index — text, URLs, PDFs, images, audio, and video all live in one cosine-similarity space.
Gemini Embedding 2 with task prefixes — separate prefixes for documents and queries to improve retrieval quality.
Google ADK agent — coordinates inspect_embedding_space and retrieve_relevant_context tools, then synthesizes a grounded answer.
Single retrieval, two consumers — /ask retrieves once, then passes the same packet to the agent and to the UI.
3D PCA embedding view — every source is one point; ask a question and the query and cited sources light up in the same projection.
SSRF-safe URL ingestion — private and loopback IPs blocked unless you opt in.

How It Works

End-to-end, one question flows like this:

You add sources. Each source is chunked (text/URL) or uploaded once (PDF/image/audio/video). Every chunk gets a Gemini Embedding 2 vector with the task: retrieval document prefix. Files get a media vector blended with a text annotation vector, so titles still help retrieval.
You ask a question. /ask embeds the query with the task: question answering | query prefix, scores every chunk by cosine similarity, keeps the best chunk per source, takes the top k, and projects everything into 3D using power-iteration PCA.
The agent runs. _run_adk_agent builds a per-request agent whose retrieve_relevant_context tool is a closure over the already-computed retrieval packet. The agent calls inspect_embedding_space, then "calls" the retrieval tool, then writes a grounded answer with no inline citation IDs.
The UI renders. The frontend shows the answer text, the citation panel (built from the same matches), the agent trace, and the updated 3D view with the query point and highlighted sources.

The architectural insight is the single-retrieval contract: one query embedding, one ranked list of matches, two consumers (the agent and the UI). That's what keeps citations honest.

Prerequisites

Before we begin, make sure you have the following:

Python installed on your machine (version 3.12 is recommended)
Your Gemini API key for using Gemini Embedding 2
A code editor of your choice
Basic Python and FastAPI familiarity

Code Walkthrough

Setting Up the Environment

First, let's get our development environment ready:

Clone the GitHub repository:

git clone https://github.com/Shubhamsaboo/awesome-llm-apps.git

🌟 Don't forget to star the opensource repo to show your support.

Go to the multimodal_agentic_rag folder:

cd rag_tutorials/multimodal_agentic_rag/backend

Install the required dependencies:

pip install -r requirements.txt

Grab your Gemini API key from Google AI Studio and set it in your current session:

export GOOGLE_API_KEY="your-google-ai-studio-key"

Creating the App

Project structure:

rag_tutorials/multimodal_agentic_rag/
|-- README.md
|-- assets/
|   `-- multimodal-agentic-rag-architecture.png
|-- backend/
|   |-- app_state.py
|   |-- rag_store.py
|   |-- requirements.txt
|   |-- server.py
|   `-- agentic_rag_agent/
|       |-- __init__.py
|       `-- agent.py
`-- frontend/
    |-- index.html
    |-- package.json
    |-- src/
    |   |-- App.tsx
    |   |-- main.tsx
    |   `-- styles.css
    |-- tsconfig.json
    `-- vite.config.ts

We’ll skip the frontend code walkthrough and focus on the backend architecture.

1. The Shared Store (`app_state.py`)

A single line keeps the in-memory index addressable from both FastAPI and the ADK tools:

from rag_store import MultimodalRagStore

RAG_STORE = MultimodalRagStore()

Both server.py and the ADK tool functions import RAG_STORE from here, so the agent always sees the same sources you uploaded through the UI.

2. The Multimodal Store (`rag_store.py`)

This is where most of the interesting code lives. A few constants set the contract:

EMBED_MODEL = "gemini-embedding-2"
DEFAULT_DIMENSIONS = 768
CHUNK_WORDS = 170
CHUNK_OVERLAP = 35
INLINE_MEDIA_LIMIT_BYTES = 18 * 1024 * 1024

We chunk text into roughly 170-word windows with 35-word overlap and embed each chunk separately. Anything bigger than ~18 MB or any audio/video file goes through the Gemini File API instead of inline bytes.

Embedding text with task prefixes

Gemini Embedding 2 supports task prefixes — small instructions like "task: retrieval document" or "task: question answering | query" that tell the model how this content will be used. Documents and queries get different prefixes, which measurably improves retrieval:

def _embed_text(self, text: str, task_prefix: str) -> list[float]:
    content = f"{task_prefix}: {text}"
    client = self._require_client()

    result = client.models.embed_content(
        model=EMBED_MODEL,
        contents=[content],
        config=types.EmbedContentConfig(output_dimensionality=self.dimensions),
    )
    return result.embeddings[0].values

The interesting bit is output_dimensionality=768: Gemini Embedding 2 supports truncating to smaller, latency-friendlier vectors right at the API call, so you don't have to pay for storage or cosine math on the full embedding width.

Embedding files (PDFs, images, audio, video)

Multimodal is where Gemini Embedding 2 earns its keep. Small images and PDFs go inline; large files and all media go through the File API:

def _embed_file(self, data, mime_type, title, notes):
    client = self._require_client()

    use_file_api = (
        len(data) > INLINE_MEDIA_LIMIT_BYTES
        or mime_type.startswith("video/")
        or mime_type.startswith("audio/")
    )
    if use_file_api:
        return self._embed_uploaded_file(data, mime_type, title), "gemini-file-api"

    part = types.Part.from_bytes(data=data, mime_type=mime_type)
    result = client.models.embed_content(
        model=EMBED_MODEL,
        contents=[part],
        config=types.EmbedContentConfig(output_dimensionality=self.dimensions),
    )
    return result.embeddings[0].values, "gemini-inline"

The File API path uploads the file, polls until its state is ACTIVE/SUCCEEDED, embeds via Part.from_uri, and then deletes the uploaded file in a finally block — important so you don't leak storage on every upload.

To make a PDF or image still findable by its title (e.g., "the launch deck"), we blend the media vector with a text vector of the title plus user-provided notes:

media_vector, embedding_path = self._embed_file(...)
annotation_vector = self._embed_text(f"{title}. {notes}", "task: retrieval document")
vector = _blend_vectors(media_vector, annotation_vector)  # 68% media / 32% text

This is a small but very effective trick: native multimodal embeddings are great at semantic content, but humans often search by the label they gave the file.

Search: cosine similarity per chunk, deduplicated per source

def search(self, query: str, top_k: int = 6) -> dict[str, Any]:
    query_vector = self._embed_text(query, "task: question answering | query")
    source_vectors = self._source_vectors()
    projections = self._pca_projection({**source_vectors, query_id: query_vector})
    ...
    for chunk in self.chunks:
        score = round(_cosine(query_vector, chunk.vector), 4)
        current = source_matches.get(chunk.source_id)
        if not current or score > current["score"]:
            source_matches[chunk.source_id] = { ... }
    matches = sorted(source_matches.values(), key=lambda m: m["score"], reverse=True)[:top_k]

Three subtle decisions here: we score every chunk but keep only the best chunk per source, we project source vectors and the query vector together so the 3D view shares the same basis, and we return a fully-formed space snapshot so the frontend never has to ask twice.

PCA projection in pure Python

The _pca_projection method runs power iteration to find the top three principal components and projects every vector into 3D — no NumPy, no scikit-learn. That keeps the dependency list short and the projection deterministic per request.

The retrieval payload

The agent doesn't see raw chunks; it sees a clean, model-friendly payload:

def retrieval_payload(self, results):
    return {
        "provider": self.embedding_provider,
        "matches": [
            {
                "citation": m["id"],
                "source": m["title"],
                "modality": m["modality"],
                "similarity": m["score"],
                "evidence": m["text"],
            }
            for m in results["matches"]
        ],
    }

This is the exact same packet that /ask returns to the frontend, which is how we guarantee the answer and the citation panel never drift.

3. The ADK Agent (`agentic_rag_agent/agent.py`)

A short, sharp ADK agent with two tools and a focused instruction:

def retrieve_relevant_context(query: str, top_k: int = 5) -> dict:
    """Retrieve the most relevant multimodal source evidence for a user question."""
    return RAG_STORE.retrieval_tool(query=query, top_k=top_k)


def inspect_embedding_space() -> dict:
    """Inspect current sources, modalities, dimensions, and embedding provider."""
    return RAG_STORE.space_tool()


def build_agent(retrieval_tool=retrieve_relevant_context) -> Agent:
    return Agent(
        name="multimodal_agentic_rag_agent",
        model="gemini-3-flash-preview",
        description="Agentic RAG coordinator for a multimodal Gemini Embedding 2 workspace.",
        instruction="""
You are the Google ADK coordinator for a multimodal agentic RAG workspace.

For every user question:
1. Use inspect_embedding_space to understand the current workspace.
2. Use retrieve_relevant_context with the user's question before answering.
3. Ground the answer in the retrieved evidence. Do not invent facts...
4. Do not include raw citation ids, source ids, bracket citations...
5. Start with a clear direct answer in 2-3 sentences.
6. If helpful, add a short "Key points:" section with simple hyphen bullets.
""",
        tools=[inspect_embedding_space, retrieval_tool],
        generate_content_config=genai_types.GenerateContentConfig(
            temperature=0.25,
            max_output_tokens=900,
        ),
    )

build_agent accepts an injectable retrieval_tool. That's how server.py swaps in a closure that returns the already-computed retrieval packet, instead of letting the agent embed the query a second time.

4. The FastAPI Server (`server.py`)

The endpoint surface is small and predictable:

Method	Endpoint	What it does
`GET`	`/health`	Liveness, ADK availability, dimensions, source counts
`GET`	`/space`	Current sources, points, events, projection metadata
`POST`	`/sources/text`	Add a text source
`POST`	`/sources/url`	Fetch and index a public URL (SSRF-protected)
`POST`	`/sources/file`	Upload PDF, image, audio, or video
`DELETE`	`/sources/{id}`	Remove a source and its chunks
`POST`	`/ask`	Retrieve once, run ADK answer flow, return citations

The key piece is /ask. It retrieves once, builds a clean payload, and injects a closure into the agent so it can't redo the embedding:

@app.post("/ask")
async def ask(req: AskRequest):
    retrieval = await run_in_threadpool(RAG_STORE.search, req.question, req.top_k)
    retrieval_payload = RAG_STORE.retrieval_payload(retrieval)
    answer = await _run_adk_agent(req.question, retrieval_payload)
    trace = [
        {"agent": "space_inspector",   "status": "complete", "detail": ...},
        {"agent": "retrieval_tool",    "status": "complete", "detail": ...},
        {"agent": "answer_synthesizer","status": "complete", "detail": ...},
    ]
    return {
        "answer": answer,
        "matches": retrieval["matches"],
        "query_point": retrieval["query_point"],
        "trace": trace,
        "space": retrieval["space"],
    }

And the closure injection inside _run_adk_agent:

async def _run_adk_agent(question: str, retrieval: dict[str, Any]) -> str:
    def retrieve_relevant_context(query: str, top_k: int = 6) -> dict:
        """Return the exact retrieval packet already embedded for this request."""
        return retrieval

    request_agent = build_agent(retrieve_relevant_context)
    request_runner = Runner(agent=request_agent, app_name=APP_NAME, session_service=session_service)
    session = await session_service.create_session(app_name=APP_NAME, user_id=USER_ID)
    content = genai_types.Content(
        role="user",
        parts=[genai_types.Part(text=f"Question: {question}\nUse the retrieval tool result for this exact question.")],
    )
    final_text = ""
    async for event in request_runner.run_async(user_id=USER_ID, session_id=session.id, new_message=content):
        text = _event_text(event)
        if text:
            final_text = text
    return final_text

The agent thinks it's calling a real retrieval tool. It is — the tool just returns a cached result. This is a clean way to keep agent semantics while skipping a redundant embedding round-trip.

A couple of safety details worth highlighting:

SSRF protection: _validate_fetch_url rejects non-HTTP schemes and resolves the hostname; if any returned IP is private, loopback, link-local, or reserved, ingestion fails. Set ALLOW_PRIVATE_URLS=true only when you really need it.
Threadpool offloading: every blocking call (text chunking, file reads, search, PCA) runs in run_in_threadpool so the FastAPI event loop stays responsive.
Configurable CORS: ALLOWED_ORIGINS is read from the env, defaulting to the Vite dev server.

5. The Frontend (very brief)

The frontend is a single React/Vite app (frontend/src/App.tsx) that wraps three panels: a source manager for adding text/URLs/files, a Q&A panel that calls /ask and renders the answer plus a separate citations list, and a 3D embedding view built on Three.js that uses the projection coordinates returned by the backend. Every source is one colored point (color encodes modality), and after a question the query point and the cited sources are highlighted in the same PCA basis.

Running the App

With our code in place, it's time to launch the app.

Start the backend

python server.py

The backend listens on http://localhost:8897.

Start the frontend in a second terminal:

cd multimodal_agentic_rag/frontend
npm install
npm run dev -- --port 5177

If your backend lives on a different port, point the frontend at it:

VITE_API_URL=http://localhost:8897 npm run dev -- --port 5177

Add a few sources — try a paragraph of text, a public URL, a PDF, and an image.
Watch them appear as colored points in the embedding view.
Ask a question in the Q&A panel.
Inspect the answer, the cited sources, and the agent trace.
Notice the orange query point land near the sources the agent cites.

A quick health check from the terminal:

curl http://localhost:8897/health

Expected response shape on a fresh start (the store begins empty):

{
  "status": "ok",
  "adk": true,
  "setup_error": "",
  "sources": 0,
  "chunks": 0,
  "dimensions": 768,
  "provider": "gemini-embedding-2",
  "modalities": {},
  "chunk_modalities": {},
  "projection": "pca_3d"

Working Application Demo

Conclusion

You've now built a multimodal agentic RAG app that puts text, URLs, PDFs, images, audio, and video into a single Gemini Embedding 2 space, retrieves with cosine similarity over chunked vectors, and uses a tightly-scoped Google ADK agent to write grounded, citation-friendly answers, without a separate vector database, in a few hundred lines of Python.

A few directions worth exploring from here:

Swap the in-memory store for a managed vector DB (pgvector, Qdrant, Vertex AI Vector Search) and persist the chunk metadata.
Add re-ranking with a cross-encoder or a Gemini reranker between cosine retrieval and the agent.
Background ingestion with a queue (Celery, RQ, or a simple async worker) so large videos don't block the API.
Evals: wire a small eval set with question/answer pairs and track citation precision and answer faithfulness over changes.
Auth + multi-tenancy so different users see different workspaces.
Observability: log the retrieval packet alongside the final answer; the single-retrieval contract makes faithfulness audits straightforward.

Keep experimenting with different configurations and features to build more sophisticated AI applications.

We share hands-on tutorials like this 2-3 times a week, to help you stay ahead in the world of AI. If you're serious about leveling up your AI skills and staying ahead of the curve, subscribe now and be the first to access our latest tutorials.

Don’t forget to share this tutorial on your social channels and tag Unwind AI (X, LinkedIn, Threads) to support us!

Reply

or to participate.