Skip to main content
Indexers keep a room’s vector database fresh so other agents can run retrieval-augmented generation (RAG) queries. They watch for new content, chunk and embed it, then store both the raw text and embeddings in a table that RagToolkit powered agents can search.

When to Use

  • Automate RAG over room storage without manually rebuilding embeddings
  • Keep crawled websites synchronized with a room before a chat session
  • Pair with ChatBot or VoiceBot so they can answer questions using indexed documents
  • Pre-process content for long-running workflows (document review, data labeling, etc.)

How Indexers Work

  1. Detect content: listen for storage events or take a crawl input.
  2. Transform: convert the source into plain text (customizable via read_file).
  3. Chunk: split text with a Chunker (defaults use chonkie.SemanticChunker).
  4. Embed: call the configured Embedder (defaults use OpenAI embedding models).
  5. Persist: upsert rows into a room database table and build full-text / vector indexes.
Once data is in the table, add RagToolkit(table="…") to any conversational agent and the results become searchable.

Built-in Indexers

StorageIndexer

StorageIndexer extends SingleRoomAgent and watches room.storage events. Whenever a file is uploaded, updated, or deleted it:
  • calls read_file(path=…) to obtain text (you override this to run MarkitDown, OCR, etc.)
  • chunks and embeds the content
  • writes rows into the configured database table (defaults to storage_index)
  • maintains vector and full-text indexes so downstream searches stay fast
Constructor Parameters
ParameterTypeDescription
namestr | NoneDeprecated. Agent identity comes from the participant token; if provided, it is only used to default title.
title / descriptionstr | NoneOptional metadata displayed in Studio. If title is omitted and you set name, it defaults to that value.
requireslist[Requirement] | NoneToolkits or schemas to install before indexing (e.g., MarkitDown).
labelslist[str] | NoneTag for discovery/filtering.
chunkerChunker | NoneSplits text into chunks; defaults to ChonkieChunker.
embedderEmbedder | NoneProduces embeddings; defaults to OpenAIEmbedding3Large.
tablestrDatabase table used to store rows and embeddings.
Tip: Implement async read_file(self, *, path: str) -> str | None to pull text from storage. Returning None skips indexing that file.

SiteIndexer

SiteIndexer is a TaskRunner that orchestrates crawls with the FireCrawl toolkit. Invoke it from MeshAgent Studio (“Toolkits…”) or via room.agents.invoke_tool to populate a specified table:
  1. Sends a crawl job to the FireCrawl queue.
  2. Streams page results back through the room queue.
  3. Chunks, embeds, and writes rows into the provided table (creating vector / FTS indexes automatically).
You can select the queue, destination table, and starting URL when you call the task. Pair the resulting table with RagToolkit the same way you would with StorageIndexer.

RagToolkit

Add RagToolkit(table="your_table") to a ChatBot, VoiceBot, or TaskRunner to surface the rows produced by either indexer. The toolkit exposes a rag_search tool that optionally accepts a custom embedder, letting you reuse the same model for both indexing and querying.

Example

You typically host a StorageIndexer alongside a chatbot that uses RagToolkit. The example below shows a MarkitDown-powered indexer paired with a RAG-enabled ChatBot.

Prerequisites and Dependencies

  • An embedding provider (default uses an OpenAI embedder, but you can set your own).
  • chonkie for semantic chunking (installed with the meshagent agents package).
  • Optional toolkits such as meshagent.markitdown or meshagent.firecrawl depending on which indexer you deploy.

Step 1: Create a RAG based ChatBot and RAG tool

import asyncio
import logging
from meshagent.api import RequiredToolkit, RequiredSchema
from meshagent.agents.schemas.document import document_schema
from meshagent.tools.document_tools import (
    DocumentAuthoringToolkit,
    DocumentTypeAuthoringToolkit,
)
from meshagent.agents.chat import ChatBot
from meshagent.openai import OpenAIResponsesAdapter
from meshagent.openai.proxy import get_client
from meshagent.agents.indexer import RagToolkit, StorageIndexer, OpenAIEmbedder
from meshagent.api.services import ServiceHost
from meshagent.markitdown.tools import MarkItDownToolkit
from meshagent.tools import ToolContext
from meshagent.otel import otel_config

otel_config(service_name="storage-rag") 
log = logging.getLogger("storage-rag")

service = ServiceHost()

@service.path(path="/agent", identity="meshagent.chatbot.storage_rag")
class RagChatBot(ChatBot):
    def __init__(self):
        self._rag_toolkit = None
        super().__init__(
            title="Storage RAG chatbot",
            description="an simple chatbot that does rag, pair with an indexer",
            llm_adapter=OpenAIResponsesAdapter(),
            rules=[
                "after performing a rag search, do not include citations",
                "output document names MUST have the extension .document, automatically add the extension if it is not provided",
                "after opening a document, display it, before writing to it",
            ],
            requires=[
                RequiredToolkit(
                    name="ui", tools=["ask_user", "display_document", "show_toast"]
                ),
            ],
            toolkits=[
                MarkItDownToolkit(),
                DocumentAuthoringToolkit(),
                DocumentTypeAuthoringToolkit(
                    schema=document_schema, document_type="document"
                ),
            ],
            annotations=["chatbot", "rag"],
        )
    async def start(self, *, room):
        rag_toolkit = RagToolkit(
            table="rag-index",
            embedder=OpenAIEmbedder(
                size=3072,
                max_length=8191,
                model="text-embedding-3-large",
                openai=get_client(room=room),
            ),
        )
        self._rag_toolkit = rag_toolkit
        self._toolkits.append(rag_toolkit)
        await super().start(room=room)


@service.path(path="/indexer", identity="storage_indexer")
class MarkitDownFileIndexer(StorageIndexer):
    def __init__(
        self,
        *,
        title="storage indexer",
        description="watch storage and index any uploaded pdfs or office documents",
        annotations=["watchers", "rag"],
        chunker=None,
        embedder=None,
        table="rag-index",
    ):
        self._markitdown = MarkItDownToolkit()

        super().__init__(
            title=title,
            description=description,
            annotations=annotations,
            chunker=chunker,
            embedder=embedder,
            table=table,
        )

    async def read_file(self, *, path: str):
        context = ToolContext(
            room=self.room,
            caller=self.room.local_participant,
        )
        response = await self._markitdown.execute(
            context=context,
            name="markitdown_from_file",
            arguments={"path": path},
        )
        return getattr(response, "text", None)


asyncio.run(service.run())

Step 2: Run the service and test locally

Now that we’ve defined our services, let’s start the room and connect the chatbot and indexer.
bash
meshagent setup # authenticate if not already
meshagent service run "main.py" --room=rag
Now we can interact with our agent in MeshAgent Studio
  1. Open the studio
  2. Go to the room rag
  3. Upload a document to the room for processing. The indexer will chunk and embed the document. You will see logs in your terminal related to the document processing.
  4. Ask the agent about the document once it’s been processed!

Step 3: Package and deploy the service

To deploy your RAG Indexer and RAG agent permanently, you’ll package your code with a meshagent.yaml file that defines the service configuration and a container image that MeshAgent can run. For full details on the service spec and deployment flow, see Packaging Services and Deploying Services. MeshAgent supports two deployment patterns for containers:
  1. Runtime image + code mount (recommended): Use a pre-built MeshAgent runtime image (like python-sdk-slim) that contains Python and all MeshAgent dependencies. Mount your lightweight code-only image on top. This keeps your code image tiny (~KB), eliminates dependency installation time, and allows your service to start quickly.
  2. Single Image: Bundle your code and all dependencies into one image. This is good when you need to install additional libraries, but can result in larger images and slower pulls. If you build your own images we recommend optimizing them with eStargz.
This example uses the runtime image + code mount pattern with the public python-docs-examples code image so you can run the documentation sample without building your own image. If you build your own code image, follow the steps below and update the storage.images entry in meshagent.yaml. Prepare your project structure This example organizes the agent code and configuration in the same folder, making each agent self-contained:
your-project/
├── Dockerfile                    # Shared by all samples
├── storage_rag/
   ├── indexer.py
   └── meshagent.yaml           # Config specific to this sample
└── another_sample/              # Other samples follow same pattern
    ├── another_sample.py
    └── meshagent.yaml
Note: If you’re building a single agent, you only need the storage_rag/ folder. The structure shown supports multiple samples sharing one Dockerfile.
Step 3a: Build a Docker container If you want a code-only image, create a scratch Dockerfile and copy the files you want to run. This creates a minimal image that pairs with the runtime image + code mount pattern.
FROM scratch

COPY . /
Build and push the image with docker buildx:
bash
docker buildx build . \
  -t "<REGISTRY>/<NAMESPACE>/<IMAGE_NAME>:<TAG>" \
  --platform linux/amd64 \
  --push
Note: Building from the project root copies your entire project structure into the image. For a single agent, this is fine - your image will just contain one folder. For multi-agent projects, all agents will be in one image, but each can deploy independently using its own meshagent.yaml.
Step 3b: Package the service Define the service configuration in a meshagent.yaml file. This will be used to deploy the indexer and RAG ChatBot.
kind: Service
version: v1
metadata:
  name: storage-rag
  description: "A storage RAG chatbot with a paired storage indexer"
  annotations:
    meshagent.service.id: "storage-rag"
agents:
  - name: meshagent.chatbot.storage_rag
    description: "A chatbot that uses RAG over room storage"
    annotations:
      meshagent.agent.type: "ChatBot"
  - name: storage_indexer
    description: "Watches storage and indexes documents for RAG"
    annotations:
      meshagent.agent.type: "Indexer"
ports:
- num: "*"
  endpoints:
  - path: /agent
    meshagent:
      identity: meshagent.chatbot.storage_rag
  - path: /indexer
    meshagent:
      identity: storage_indexer
container:
  image: "us-central1-docker.pkg.dev/meshagent-public/images/python-sdk:{SERVER_VERSION}-esgz"
  command: python /src/storage_rag/indexer.py
  storage:
    images:
      # Replace this image tag with your own code-only image if you build one.
      - image: "us-central1-docker.pkg.dev/meshagent-public/images/python-docs-examples:{SERVER_VERSION}"
        path: /src
        read_only: true

How the paths work:
  • Your code image contains storage_rag/indexer.py
  • It’s mounted at /src in the runtime container
  • The command runs python /src/storage_rag/indexer.py
Note: The default YAML in the docs uses us-central1-docker.pkg.dev/meshagent-public/images/python-docs-examples so you can test this example immediately without building your own image first. Replace this with your actual image tag when deploying your own code.
Step 3c: Deploy the service Next from the CLI in the directory where your meshagent.yaml file is run:
bash
meshagent service create --file "meshagent.yaml" --room=quickstart
Now the agent and indexer will always be available for us to use!

Next Steps

Understand other MeshAgent agents Learn more about deploying agents with MeshAgent