Building a Production-Ready RAG Knowledge Base with n8n, Qdrant, and Ollama

Introduction

Retrieval-Augmented Generation (RAG) has become the most reliable way to build AI systems that answer questions accurately using your own documents. Instead of relying only on an LLM’s training data, RAG retrieves relevant knowledge from a vector database and injects it into the prompt at runtime.

In this article, we’ll walk through a production-ready RAG architecture built using:

n8n for orchestration
Qdrant for vector storage and semantic search
Ollama for embeddings and LLM inference

This setup works perfectly for PDFs, Word documents, and website content, and can be deployed fully on-premise or in Docker.

What Problem Does RAG Solve?

Large Language Models (LLMs) are powerful, but they have limitations:

They don’t know your private documents
They may hallucinate answers
They can’t stay up-to-date with changing data

RAG solves this by:

Storing your documents as vector embeddings
Searching them semantically at query time
Feeding only the most relevant chunks into the LLM

High-Level RAG Architecture

The complete flow looks like this:

User asks a question
Question is converted into an embedding
Similar document chunks are retrieved from Qdrant
Retrieved chunks are injected into the LLM prompt
LLM answers strictly based on provided context

This guarantees accuracy, traceability, and control.

Why n8n + Qdrant + Ollama?

n8n (Workflow Orchestration)

Visual, low-code pipeline
Easy HTTP integrations
Perfect for RAG chaining

Qdrant (Vector Database)

Fast cosine similarity search
Rich metadata filtering
Ideal for multi-tenant knowledge bases

Ollama (Local AI Runtime)

Run LLMs and embedding models locally
No external API dependency
Works great with models like llama3.1 and nomic-embed-text

Step-by-Step Workflow Breakdown

1. Webhook: Accept the User Question

The workflow starts with an HTTP webhook that receives:

{
  "message": "What is the invoice approval workflow?"
}

The webhook is configured with:

Response Mode: Last Node

This means the final node automatically returns the response, keeping the workflow clean and warning-free.

2. Input Normalization

The user input is normalized into a clean field:

question

This avoids confusion later when passing data between nodes.

3. Generate Embeddings with Ollama

The question is sent to Ollama’s embeddings API using a single, consistent model:

{
  "model": "nomic-embed-text",
  "prompt": "<user question>"
}

⚠️ Using the same embedding model for both ingestion and search is mandatory for accurate results.

4. Semantic Search in Qdrant

The generated embedding is used to search Qdrant:

{
  "vector": [embedding],
  "limit": 5,
  "with_payload": true
}

Each matched point contains:

The vector similarity score
A rich payload with document metadata

5. Payload Design (The Most Important Part)

Each vector stored in Qdrant should include actual chunk text:

{
  "payload": {
    "text": "Approval requests must be reviewed by the finance team...",
    "title": "Invoice Approval Workflow",
    "section": "Approval Rules",
    "source_type": "pdf",
    "source_name": "invoice-guide.pdf",
    "document_id": "doc_101",
    "chunk_id": 4
  }
}

✅ payload.text is what gets sent to the LLM
❌ Storing only titles or summaries breaks RAG accuracy

6. Build Structured RAG Context

Instead of dumping raw text, the workflow builds structured context:

Source: invoice-guide.pdf
Title: Invoice Approval Workflow
Content:
Approval requests must be reviewed by the finance team...

Chunks are separated using delimiters to preserve meaning and readability.

7. Generate the Final Answer with Ollama

The final prompt enforces strict grounding:

Use ONLY the context below to answer the question.
If the answer is not present, say:
"I don't know based on the provided information."

This eliminates hallucinations and ensures compliance-grade responses.

Why This Workflow Is Production-Ready

✔ No unused nodes
✔ No hallucinations
✔ Fully local & private
✔ Metadata-aware retrieval
✔ Easy to scale (multi-tenant / multi-collection)
✔ Works for PDFs, DOCX, and websites

Common Mistakes This Design Avoids

❌ Embedding entire documents as one vector
❌ Not storing chunk text in Qdrant
❌ Mixing embedding models
❌ Sending raw Qdrant JSON to the LLM
❌ Returning responses without grounding

Recommended Enhancements

Once this base is live, you can easily add:

Source citations in answers
Document-level or tenant-level filters
Hybrid search (vector + keyword)
Semantic chunking at ingestion time
Retry and fallback logic

Conclusion

By combining n8n, Qdrant, and Ollama, you get a powerful, cost-effective, and private RAG system that works reliably in real production environments.

This architecture is ideal for:

Internal knowledge bases
SaaS help centers
Legal / financial document search
AI chatbots grounded in real data

If you’re building a serious AI knowledge system, this stack gives you full control without sacrificing accuracy.

Tagged with

#n8n #Ollama #RAG #Qdrant

R

Written by Rohit Vakhariya

Passionate about sharing knowledge and insights on web development, technology, and best practices.

Author