7 min read
Just now
--
How I combined Neo4j, pgvector, and Gemma4:31b to solve the local document management nightmare — and the technical dead-ends I hit along the way.
Introduction
Over the past few years, I have followed the evolution of Generative AI closely, experimenting extensively with tools like Copilot, Gemini, and Claude. While academic publications provided a solid theoretical foundation, my studies in Applied Machine Learning at EPFL taught me one fundamental truth: true insight into AI’s possibilities and limitations only materializes through hands-on implementation.
I set out to find a project that would force me to navigate the entire Generative AI pipeline, exploring Prompt Engineering, Retrieval-Augmented Generation (RAG), and Knowledge Graphs through a practical, unforgiving lens.
Simultaneously, I was losing a very personal battle: keeping track of my local documents. Amidst a constant influx of PDFs, managing insurance policies, notice periods, and medical reports felt impossible. Quickly finding a specific clause from a two-year-old contract was a nightmare.
Generative AI was the obvious solution, but it posed a massive dilemma. Was I genuinely willing to hand over a lifetime of sensitive, unredacted personal data to “Over-the-Top” players like OpenAI or Google? The answer was no. Thus, Project Citadel was born.
Press enter or click to view image in full size
In this article, I will walk you through the architecture of my completely sovereign, entirely local Hybrid GraphRAG document management system. I will detail the transition from simple semantic search to a dual-engine vector/graph pipeline, explain the sequential prompting logic that eliminated hallucinations, and share the engineering dead-ends I encountered.
The Hardware Strategy: The Compute vs. Memory Wall
In local AI communities, discussions almost exclusively revolve around inference speed (tokens per second). For Project Citadel, I discarded that metric and prioritized Model Capacity.
While a standard 32GB setup executes small models swiftly, it hits a hard “memory wall” when workflows require high-level reasoning, complex JSON extraction, and strict ontology adherence. The foundational decision of this project was securing a massive memory ceiling. By upgrading an AMD Mini-PC (GEEKOM A9 Max, AMD Ryzen™ AI 9 HX 370), I established a 128GB memory foundation for under $2,500. This allowed me to host nuanced, 31B+ parameter models (like gemma4:31b) and keep a complex orchestration pipeline (Neo4j, PostgreSQL, n8n) loaded in memory without disk-swapping crashes.
Solving the Latency Bottleneck: Initially, while background ingestion was stable, answer speeds in the chat frontend could take up to 5 minutes. To tackle this, I introduced a hybrid compute model. I connected an eGPU dock via USB4 and slotted in a used Nvidia RTX 3090. This single 24GB VRAM upgrade completely transformed the pipeline, handling heavy visual PDF parsing and reducing inference times from minutes to seconds. Meanwhile, the 128GB system RAM remains the bedrock for keeping the databases highly stable.
Architectural Overview: The Pillars of the Citadel
I designed the architecture around distinct functional modules, orchestrated entirely by n8n.
Phase 1: The Gatekeeper (Ingestion & Classification)
Press enter or click to view image in full size
The Gatekeeper intercepts all incoming PDFs, ensuring only high-quality, structured data enters the databases.
- Deduplication: The system calculates a cryptographic hash of the PDF to prevent redundant data.
- Semantic Extraction: I employ IBM Docling for text extraction. This was a critical pivot: traditional OCR dumps unstructured strings, while Vision LLMs are slow and hallucinate. Docling utilizes the RTX 3090 to extract the semantic structure, translating tables and paragraphs into clean Markdown.
- Tamper-Proof Backup: The
.mdfiles sync directly to a local Obsidian vault, establishing a human-readable backup independent of database logic. - Metadata Extraction:
gemma4:31bclassifies the document. To ensure absolute consistency, I mandate a strict JSON response format via prompt engineering, mapping documents to strict dictionary categories.
JSON
{
"date": "YYYY-MM-DD",
"subject": "String",
"sender": "String",
"document_type": "String",
"suggested_folder": "String",
"suggested_filename": "String",
"summary": "String"
}Before any data reaches the databases, it passes through a simple Budibase frontend where I validate the extracted metadata, ensuring only verified content is pushed forward.
Get Denis Nguyen’s stories in your inbox
Join Medium for free to get updates from this writer.
Remember me for faster sign in
Key Learnings from Phase 1:
- Structure is king: File naming conventions and a rigidly defined folder structure are critical for consistency.
- Semantic text extraction (Docling) beats both classic OCR and Vision Model capabilities.
- A Unique ID is essential to reference documents across the Knowledge Graph, Vector Database, and physical file location.
Phase 2: The Librarian (Dual-Engine Storage)
Relying on a single database limits retrieval capabilities. Therefore, I engineered a dual-engine architecture: pgvector for semantic prose search, and Neo4j for structured entity relationships.
Press enter or click to view image in full size
1. The High-Precision Vector Engine (pgvector)
This operates on three core principles:
- The “Anti-Orphan” Strategy: In standard RAG, chunk #450 loses the context mentioned on page one. I explicitly stamp every text chunk with its relational identity (Metadata Repetition). Consequently, queries about “cholesterol” never conflate my results with my wife’s.
- Semantic Context Preservation: Rigidly slicing characters destroys meaning. I utilize a Recursive Character Text Splitter that splits at natural boundaries (paragraphs, newlines), preserving the semantic unit for accurate embeddings.
- Hybrid Meta-Filtering: Because chunks carry hard metadata, the system can filter via SQL first:
SQL
SELECT chunk_text, embedding <=> '[...]' AS distance
FROM vectors
WHERE metadata->>'subject' = 'Denis'
AND (metadata->>'date')::date > '2024-01-01'
ORDER BY distance ASC LIMIT 5;This merges the precision of relational databases with semantic intelligence.
2. The Knowledge Graph (Neo4j) & The 3-Step Reflection Pipeline
While vectors handle prose, Neo4j serves as the “Record of Truth” for hard facts. Building this involved significant friction. Initially, asking a model to directly output JSON nodes resulted in a “Graph Hairball” of hallucinated edges. Furthermore, I hit the “Model Laziness” limitation: confronted with dense medical reports, the model would extract the first few entities perfectly and then stop.
To rectify both syntax errors and omissions, I engineered a strict 3-Step Agentic Reflection Pipeline:
Step A: Initial Extraction. The LLM parses the document and outputs a freeform Markdown list of entities. This allows the model to utilize its full cognitive capacity without battling JSON syntax. We strictly separate attributes (identity-defining data) from observation nodes (variable measurements).
Step B: The QA Auditor. This is the cure for LLM laziness. A second agent acts as an Auditor, comparing the original text with the extraction to find the delta (missing entities or relationships).
Step C: Schema Alignment. Finally, a deterministic JSON parser merges the initial extraction with the supplemental QA data, forcing it into a strict JSON ontology ready for Cypher injection.
Why a Strict Ontology is the Backbone of GraphRAG: Limiting labels normalizes data at ingestion and prevents schema drift. A highly granular ontology forces the LLM to hold too many rules, leading to decision fatigue and failed downstream Cypher queries. By capturing high-level categories as nodes and granular nuances as properties, you maintain graph simplicity while preserving 100% of the depth.
Phase 3: The Navigator (Taming Agentic Routing)
Orchestrating how a 31B-parameter LLM interacts with a dual-engine memory system proved to be the most complex challenge. Giving an LLM autonomous access to databases and a web browser leads to chaos. The model would crash the Graph with vague queries or suffer from “Attention Dilution” when dumped with massive vector results.
Instead of autonomous guessing, I engineered a highly disciplined AI Agent using strict Prompt Engineering and structured Tool Calling.
- PATH A (Web Search Mode): Triggered explicitly by the user, the Agent bypasses the internal Citadel and uses
Brave_Searchto fetch live data, keeping internal databases secure from irrelevant noise. - PATH B (Internal Data Mode): The Agent is forced into a multi-step framework.
- Step 1 (Graph Anchor): The agent queries Neo4j first. The Graph acts as the gatekeeper of truth, returning exact mathematical nodes and the specific Document UID (
doc_id).
System Prompt:
- Step 2 (Bifurcated Vector Retrieval): Armed with context, the Agent dynamically queries pgvector. It chooses between
broad_search(a traditional Top-K search for general queries) anddeep_dive. Indeep_dive, it uses thedoc_idas a hard SQL filter to retrieve the entirety of a specific document, allowing the 31B model to read it front-to-back without noise. - Step 3 (Synthesis & Linking): The agent synthesizes the structured skeleton with the semantic text and appends clickable Obsidian URIs to every extracted fact.
Conclusion: Sovereignty is No Longer a Luxury
Project Citadel started as a desperate attempt to organize my personal life, but it evolved into a definitive proof of concept: absolute data sovereignty is no longer restricted by consumer hardware limitations. We no longer have to trade our most sensitive financial, medical, and legal histories to Big Tech ecosystems in exchange for utility.
By strategically bifurcating the workload — leveraging 128GB of system RAM for complex orchestration, while offloading vision parsing and heavy LLM inference to a dedicated 24GB eGPU — local AI ceases to be a novelty and becomes a reliable daily driver.
More importantly, the software architecture dictates the success of local models. A 31B-parameter model is phenomenally capable, but it will still hallucinate if left to roam free. By constraining the LLM through a dual-engine architecture (Neo4j for truth, pgvector for semantics) and enforcing a strict agentic routing protocol, we can effectively cure “model laziness” and eradicate the infamous Graph Hairball.
The era of “chunk and dump” RAG is over. The future of personal knowledge management is local, hybrid, and agentic.
I am continuing to refine Project Citadel, exploring automated graph-pruning techniques and expanding the Navigator’s toolset. If you are building local GraphRAG pipelines or battling the transition from vector-only to hybrid retrieval, I’d love to hear about your approaches. What has been your biggest bottleneck in keeping your AI workflows entirely local? Let’s connect and discuss in the comments.
