Technology

Building a Production-Ready RAG Knowledge Base with n8n, Qdrant, and Ollama

R
Rohit Vakhariya
4 min read
Building a Production-Ready RAG Knowledge Base with n8n, Qdrant, and Ollama

Introduction

Retrieval-Augmented Generation (RAG) has become the most reliable way to build AI systems that answer questions accurately using your own documents. Instead of relying only on an LLM’s training data, RAG retrieves relevant knowledge from a vector database and injects it into the prompt at runtime.

In this article, we’ll walk through a production-ready RAG architecture built using:

  • n8n for orchestration
  • Qdrant for vector storage and semantic search
  • Ollama for embeddings and LLM inference

This setup works perfectly for PDFs, Word documents, and website content, and can be deployed fully on-premise or in Docker.

What Problem Does RAG Solve?

Large Language Models (LLMs) are powerful, but they have limitations:

  • They don’t know your private documents
  • They may hallucinate answers
  • They can’t stay up-to-date with changing data

RAG solves this by:

  1. Storing your documents as vector embeddings
  2. Searching them semantically at query time
  3. Feeding only the most relevant chunks into the LLM

High-Level RAG Architecture

The complete flow looks like this:

  1. User asks a question
  2. Question is converted into an embedding
  3. Similar document chunks are retrieved from Qdrant
  4. Retrieved chunks are injected into the LLM prompt
  5. LLM answers strictly based on provided context

This guarantees accuracy, traceability, and control.

Why n8n + Qdrant + Ollama?

n8n (Workflow Orchestration)

  • Visual, low-code pipeline
  • Easy HTTP integrations
  • Perfect for RAG chaining

Qdrant (Vector Database)

  • Fast cosine similarity search
  • Rich metadata filtering
  • Ideal for multi-tenant knowledge bases

Ollama (Local AI Runtime)

  • Run LLMs and embedding models locally
  • No external API dependency
  • Works great with models like llama3.1 and nomic-embed-text

Step-by-Step Workflow Breakdown

1. Webhook: Accept the User Question

The workflow starts with an HTTP webhook that receives:

{
  "message": "What is the invoice approval workflow?"
}

The webhook is configured with:

Response Mode: Last Node

This means the final node automatically returns the response, keeping the workflow clean and warning-free.

2. Input Normalization

The user input is normalized into a clean field:

  • question

This avoids confusion later when passing data between nodes.

3. Generate Embeddings with Ollama

The question is sent to Ollama’s embeddings API using a single, consistent model:

{
  "model": "nomic-embed-text",
  "prompt": "<user question>"
}

⚠️ Using the same embedding model for both ingestion and search is mandatory for accurate results.

4. Semantic Search in Qdrant

The generated embedding is used to search Qdrant:

{
  "vector": [embedding],
  "limit": 5,
  "with_payload": true
}

Each matched point contains:

  • The vector similarity score
  • A rich payload with document metadata

5. Payload Design (The Most Important Part)

Each vector stored in Qdrant should include actual chunk text:

{
  "payload": {
    "text": "Approval requests must be reviewed by the finance team...",
    "title": "Invoice Approval Workflow",
    "section": "Approval Rules",
    "source_type": "pdf",
    "source_name": "invoice-guide.pdf",
    "document_id": "doc_101",
    "chunk_id": 4
  }
}

✅ payload.text is what gets sent to the LLM
 ❌ Storing only titles or summaries breaks RAG accuracy

6. Build Structured RAG Context

Instead of dumping raw text, the workflow builds structured context:

Source: invoice-guide.pdf
Title: Invoice Approval Workflow
Content:
Approval requests must be reviewed by the finance team...

Chunks are separated using delimiters to preserve meaning and readability.

7. Generate the Final Answer with Ollama

The final prompt enforces strict grounding:

Use ONLY the context below to answer the question.
If the answer is not present, say:
"I don't know based on the provided information."

This eliminates hallucinations and ensures compliance-grade responses.

Why This Workflow Is Production-Ready

✔ No unused nodes
 ✔ No hallucinations
 ✔ Fully local & private
 ✔ Metadata-aware retrieval
 ✔ Easy to scale (multi-tenant / multi-collection)
 ✔ Works for PDFs, DOCX, and websites

Common Mistakes This Design Avoids

  • ❌ Embedding entire documents as one vector
  • ❌ Not storing chunk text in Qdrant
  • ❌ Mixing embedding models
  • ❌ Sending raw Qdrant JSON to the LLM
  • ❌ Returning responses without grounding

Recommended Enhancements

Once this base is live, you can easily add:

  • Source citations in answers
  • Document-level or tenant-level filters
  • Hybrid search (vector + keyword)
  • Semantic chunking at ingestion time
  • Retry and fallback logic

Conclusion

By combining n8n, Qdrant, and Ollama, you get a powerful, cost-effective, and private RAG system that works reliably in real production environments.

This architecture is ideal for:

  • Internal knowledge bases
  • SaaS help centers
  • Legal / financial document search
  • AI chatbots grounded in real data

If you’re building a serious AI knowledge system, this stack gives you full control without sacrificing accuracy.


Tagged with

R

Written by Rohit Vakhariya

Passionate about sharing knowledge and insights on web development, technology, and best practices.

Author