Back to Blog
AI & RAG6 min read

I Ran a Private LLM on Oracle ERP Data — Here's What Actually Happened

Testing Llama 3.3 70B + RAG against 8,596 Oracle Fusion Cloud documents on a dedicated GPU. The results changed how we think about enterprise AI.

By MattFebruary 10, 2026

Everyone's talking about AI for ERP. Most demos use ChatGPT with zero context about your actual Oracle setup. I wanted to see what happens when you give an open-source LLM your real Oracle documentation.

The idea is simple: instead of sending your Oracle data to OpenAI or Azure, you run your own model on your own hardware with your own documentation. No data leaves your network. No per-query costs. No prompt injection risks from third-party models.

But does it actually work? That's what I spent the last few weeks finding out.

The Setup

The stack is straightforward: a DigitalOcean GPU droplet running an NVIDIA L40S with 48GB VRAM, Llama 3.3 70B served via Ollama, PostgreSQL with pgvector for semantic search, and OpenAI's text-embedding-3-small for generating embeddings.

The search layer uses hybrid retrieval — 70% semantic similarity + 30% keyword matching with PostgreSQL full-text search. We indexed 8,596 Oracle Fusion Cloud documents across REST API specs, table schemas, script patterns, and FBDI templates.

Infrastructure cost: around $1,168/month for a dedicated GPU running a 70B model. No per-query fees, no data leaving your network, unlimited queries. Compare that to API pricing where 50,000 queries per month can run $500-5,000 — and your data passes through someone else's servers.

The Test — 0% to 80%

Here's the prompt I used: "Generate a SQL query joining Oracle Projects with Contracts." Simple enough. Let's see what each approach produces.

Without RAG — 0% Accuracy

The model confidently generated SQL using PA_PROJECTS_ALL and PA_CONTRACT_AGREEMENTS_ALL — tables from Oracle EBS R12. These tables don't exist in Fusion Cloud. Wrong table names, wrong column names, completely unusable.

Every public LLM does this. They were trained on EBS documentation from the 2000s, not Fusion Cloud. Without context about YOUR Oracle instance, they default to outdated schemas.

With RAG — 80% Accuracy

Same model, same prompt, but with Oracle Fusion documentation injected via RAG. Now it uses PJF_PROJECTS_ALL_B, OKC_K_HEADERS_ALL_B, and correctly joins through PJB_CNTRCT_PROJ_LINKS. Correct Fusion Cloud tables, proper join conditions, usable SQL.

It missed a couple of production filters — VERSION_TYPE = 'PUBLISHED' and dynamic language handling — but the structure, table names, and joins are all correct. A developer can review and polish the output in minutes instead of writing everything from scratch.

That 0% to 80% jump isn't about the model being smarter. It's the same model. The only difference is having the right Oracle documentation in context.

Model Comparison — 4 Models Tested

I tested four open-source models on the same Oracle SQL generation tasks:

Llama 3.3 70B — Winner. Best code quality, correct Fusion Cloud tables, proper JOINs, good structure. Runs at roughly 5 tokens/second on the L40S. Fully open source under Apache 2.0.

DeepSeek R1 (671B distilled) — Good. Strong reasoning capabilities but verbose. Spends many tokens on chain-of-thought before producing code. Accurate but 3-4x slower than Llama, and less token-efficient.

Qwen3 32B — Decent. Smaller and faster, but less accurate on complex multi-table Oracle joins. Could work as a "fast model" for simple single-table queries alongside Llama for complex ones.

GPT-oss 120B — Failed. Used all 2,000 output tokens on reasoning and planning. Produced zero actual code. Not suitable for code generation tasks.

The biggest surprise? Model size doesn't determine quality. Llama at 70B parameters outperformed the 671B distilled model on Oracle-specific tasks.

The RAG Quality Journey

Here's the part nobody talks about: getting the RAG database right was harder than choosing the model. The model is the easy part. Data quality is the hard part.

Phase 1 — The Messy Start

We started with 12,273 documents. Sounds impressive until you realize 65% were duplicates. We had 47 different module name variants for the same 7 modules. REST API docs averaged just 329 characters — basically just endpoint URLs with no field definitions. Zero FBDI template content. Zero error code references.

RAG accuracy: roughly 60%.

Phase 2 — Removing the Noise

Deduplicated by SHA-256 content hash, removing 6,822 exact copies. Normalized 47 module name variants down to 7 standard codes. Re-chunked 273 oversized documents with sentence-boundary splitting. Removed 46 tiny fragments under 100 characters.

Down to 7,984 documents but much cleaner. RAG accuracy: roughly 70%.

Phase 3 — Adding the Signal

Enriched 21 REST APIs with full field definitions (1,308 attributes total). Added 10 FBDI template specs with column-level detail. Added 35 real automation script patterns from production environments. Indexed over 200 Oracle error codes with resolution steps. Added context headers to 3,128 continuation chunks to improve embedding quality.

Final count: 8,596 documents. RAG accuracy: roughly 80%.

The lesson: removing 6,822 duplicates wasn't just housekeeping. When 65% of your index is noise, every query returns noise.

What Surprised Me

RAG quality matters more than model size. Llama 70B with good RAG beat DeepSeek 671B with bad RAG. Every time. The model can only work with what you give it, and clean, well-chunked documentation beats a bigger brain every time.

Deduplication alone improved results. After cleanup, the same queries returned more diverse, relevant results because the signal-to-noise ratio improved dramatically.

The cost math actually works. A dedicated GPU running a 70B model costs around $1,168/month. At API pricing, 50,000 queries per month would cost $500-5,000 depending on model and context length — and your data passes through their servers. At scale, private inference is cheaper AND more secure.

Open-source models are genuinely good now. Llama 3.3 70B produces Oracle SQL that I would review and approve for production. Two years ago, open-source models couldn't write coherent SQL. The gap has closed faster than anyone expected.

Key Takeaways

Private LLMs are production-viable for enterprise ERP. Not in two years. Now.

RAG transforms generic AI into domain-expert AI. The same model went from 0% to 80% accuracy by adding Oracle documentation.

Data quality matters more than model size. Clean your index before you upgrade your GPU.

Open-source models plus your data equals competitive advantage. You're not paying for the model. You're paying for the context.

The gap between generic AI and your AI is your documentation. That's the moat. Not the model, not the infrastructure — your proprietary knowledge base.

What's Next

This is where it gets interesting. Now that the foundation works, there's a clear roadmap:

  • **More Oracle modules.** We currently cover 7 modules (FIN, HCM, SCM, PPM, PRC, CX, CMN). There are roughly 25 more to index.
  • **More models to benchmark.** Mistral Large, Command R+, and the new Llama 4 variants are all candidates.
  • **Automated quality benchmarks.** Building a proper eval suite with scored test cases.
  • **Customer-specific customization.** Taking a customer's specific Oracle configuration — their custom fields, flex segments, and setups — and tailoring the RAG knowledge base to it.
  • If you're running Oracle Fusion Cloud and want AI that actually knows your system — not generic suggestions from a model that thinks you're still on EBS R12 — we can help you get there. Reach out and let's talk about what a private LLM setup looks like for your environment.

    Private LLMRAGOracle Fusion CloudLlama 3.3Enterprise AIpgvectorAI for ERP

    Ready to modernize your Oracle ERP?

    Tell us about your project and get a scope in minutes.

    Tell Us What You Need