How I built a RAG system that actually remembers what you like—using reinforcement learning instead of fine-tuning
What kept me up at night
Look, I'm just going to say it: building a production RAG system in early 2025 felt like being forced to choose between getting robbed or getting punched in the face.
Option A: GPT-4 or Claude Opus Yeah, they're incredible. Beautiful responses. They actually listen to your instructions. But holy hell, the bills. And the latency. And that nagging feeling that you're building your entire product on someone else's API that could change pricing tomorrow.
Option B: Local models like Gemma 3:1b or Phi-3 On paper? Perfect. Fast, cheap, runs on your hardware, complete control. The dream.
I chose Option B. Spoiler alert: I wanted to throw my laptop out the window within a week.
When "What is AI?" became my personal challenge
I asked Gemma 3:1b the simplest question I could think of: "What is AI?"
What I got back was... 2-3 pages. TWO TO THREE PAGES. For a definition question.
It was like talking to someone who doesn't understand the concept of brevity. Every query was a complete lottery. Sometimes short. Sometimes an essay. No rhyme or reason. No pattern I could exploit.
Then I had an even worse realization
Okay, so the model can't follow basic instructions. That's bad enough.
But then it hit me: this isn't just about me. Different people want completely different things from AI responses.
I started testing different query styles:
When I'm in "executive mode" (mornings, busy): I want ONE sentence. Just tell me what I need to know so I can move on.
When I'm in "learning mode" (afternoons, curious): I want depth. Examples. Context. Don't dumb it down.
When I'm in "implementation mode" (late night, coding): I want step-by-step procedures. Not theory. Steps.
And Gemma was giving me the same unpredictable output regardless of what I actually needed. Sometimes I'd get a novel when I needed a bullet point. Sometimes I'd get a one-liner when I needed to understand something deeply.
The model couldn't adapt. It was just... doing its thing, completely oblivious to what I actually wanted.
And I realised: if I can't get this to work for myself with different needs at different times, how would this work for an actual product with different types of users?
"Just fine-tune it!" everyone said.
Oh sure, let me just:
- Spend weeks collecting training data
- Pay thousands of dollars for a compute
- Retrain the entire model
- Do this again every time my preferences change
- Oh, and do this separately for every user type
Yeah, no. There had to be a better way.
The 3 AM realization that changed everything
I was lying in bed, frustrated, when it hit me.
I was solving the wrong problem.
I kept trying to make Gemma "smarter." But what if Gemma doesn't need to be smarter? What if Gemma just needs a better boss?
What if, instead of training the language model, I trained a personal AI brain that learns how to manage the language model?
A small neural network—just for you—that figures out:
- "Oh, this user always wants one-sentence answers for definitions"
- "This query type? They prefer detailed explanations"
- "When they ask 'how-to,' they want step-by-step procedures"
Here's the beautiful part: This isn't a language problem. It's a decision-making problem.
And decision-making under uncertainty? That's literally what reinforcement learning was invented for.
I didn't need to teach Gemma to write better. I needed to teach a strategy selector to ask Gemma better questions.
Presenting at Oaisys Conf. 2025
This whole journey culminated in something I never expected: presenting this research at the Oaisys Conf. 2025 AI Conference. Standing in front of an audience of AI researchers and engineers, explaining how I accidentally solved the small model problem by not touching the model at all.

The questions from the audience were incredible. People immediately got why this mattered—we're all tired of choosing between expensive hosted models and unpredictable local ones. Someone asked, "Wait, so each user gets their own neural network?"
Yes. Exactly. That's the whole point.
.webp)
The energy in that room was electric. Researchers wanted to know about the Q-value explosion problem. Engineers wanted to know about cold-start performance. Everyone wanted to know: "Does it actually work?"
Spoiler: Yes. But let me show you.
What I built: your AI gets its own personal trainer
Here's the architecture I came up with, and honestly, it's kind of elegant:
I don't touch Gemma at all. Not one parameter. Instead, every single user gets their own tiny neural network that learns their preferences.
Think of it like this:
- Brain 1 (Gemma 3:1b): The writer. Frozen. Never changes. Just does what it's told.
- Brain 2 (Your Personal DQN): The strategist. Learns. Adapts. Figures YOU out.
Brain 2's entire job is to solve one problem: "Given this query from this specific user, which response strategy will make them happy?"
I gave it 5 strategies to choose from:
- Concise: One-sentence answers (for the "just tell me" crowd)
- Detailed: Comprehensive 4-6 sentence explanations (for deep-divers)
- Structured: Numbered lists and bullet points (for the organised folks)
- Example-driven: Explanations with concrete examples (for visual learners)
- Analytical: Deep analysis with 6+ sources (for the "I need ALL the context" people)
Every time you click thumbs up or thumbs down, your neural network updates. It's literally learning from your feedback, in real-time, while you use it.
No datasets. No fine-tuning. Just you, teaching your AI how you like to work.
The transformation was wild
After about 50-75 interactions with my personal model, I asked the same question that started this whole mess:
"What is AI?"
Before training: Buckle up, here comes a novel...
"Artificial Intelligence (AI) refers to the simulation of human intelligence processes by machines, especially computer systems. These processes include learning (the acquisition of information and rules for using it), reasoning (using rules to reach approximate or definite conclusions), and self-correction. AI can be categorised into narrow or weak AI, which is designed to perform a narrow task (like facial recognition or internet searches), and general or strong AI, which exhibits human-like intelligence across a wide range of tasks. The field of AI research was founded in 1956 and has gone through several cycles of optimism followed by disappointment and loss of funding..."
[continues for 2-3 pages]
After training: "AI is the simulation of human intelligence by machines, enabling them to perform tasks like learning, reasoning, and decision-making."
ONE. CLEAN. SENTENCE.
That's it. That's what I wanted. And the system learned that's what I want.
Meanwhile, my researcher colleague asks the same question, and his system gives him three paragraphs with examples and historical context. Because that's what he wants.
Same question. Same model. Different strategies. Personalised responses.
This is what I'm talking about.
The problems that nearly broke me
Okay, let me be real with you: building this was not smooth sailing. I ran into problems that made me question my approach multiple times.
Problem 1: The "Why is this thing so random?" Phase
Your first 10-15 queries are basically random. The neural network is just guessing. It's in learning mode.
I had to add a big, friendly "Learning Your Preferences: 5/15 queries" indicator so users would understand the system is in its training phase and not give up before the magic happened.
Problem 2: Not all questions are created equal
"What is X?" is fundamentally different from "How does X work?", which is different from "Compare X vs Y."
I couldn't just treat everything the same. I had to classify every query across three dimensions—intent, depth, and scope—creating 108 different query clusters. Because apparently, I love making my life complicated.
Problem 3: The Q-Value explosion incident
Oh man, this one. Early in training, my Q-values just... exploded. Like, shooting to ±100+. The neural network was having a complete meltdown.
Turns out I was handling terminal states wrong and bootstrapping incorrectly. Three days of debugging later, I had to implement proper terminal state handling, gradient clipping, and a Double DQN architecture.
Fun times.
Problem 4: People change their minds
Here's something nobody tells you about personalisation: your preferences drift over time.
Maybe you wanted concise answers two weeks ago. Now you want detailed explanations. Your neural network needs to detect this "concept drift" and adapt.
I implemented statistical monitoring (z-scores!) to detect when your reward patterns change, then boost exploration to re-learn your new preferences. Usually takes 10-15 queries.
Problem 5: The exploration-exploitation challenge
Classic RL problem: Should the system try new strategies (explore) or stick with what works (exploit)?
Too much exploration: "Why does it keep trying things I don't like?"
Too much exploitation: "It's stuck using the same strategy even when my needs changed."
I ended up with adaptive epsilon-greedy that starts at 95% exploration (query 1) and gradually drops to 10% (query 150+). The system learns when to be curious and when to be confident.
What you'll find in the full deep-dive
If you're thinking "okay, this sounds cool, but how does it actually work?"—the full technical article breaks down everything:
- The Architecture: How I turned your query into a 424-dimensional feature vector (yes, 424. I have reasons.)
- The Neural Network: Why Dueling DQN crushes standard Q-learning for this problem
- The Learning Magic: Prioritised experience replay, or "why the system focuses on its biggest mistakes"
- Reward Shaping: Binary rewards (±1) don't work. Here's what does.
- The Hybrid Memory System: Fast cluster-based lookup + slow neural network learning working together
- Strategy-Specific Everything: Each strategy doesn't just change the prompt—it changes retrieval, context size, and the entire pipeline
- A Complete Interaction Walkthrough: I literally walk you through one single query from start to finish, showing every decision point
- The Real Challenges: All the bugs, explosions, and "oh god why isn't this working" moments (and how I fixed them)
It's technical. It's detailed. It's got code. It's got math. It's got my tears.
If you're into this stuff, you'll love it.
The numbers (because everyone asks)
After full implementation, here's what actually happens:
- Cold start: Most users hit 70% satisfaction within 15 queries
(Translation: Your AI stops being annoying pretty fast) - Convergence: 85-90% satisfaction by queries 75-100
(Translation: Your AI actually "gets" you) - Adaptation speed: Preferences change? System adapts in 10-15 queries
(Translation: It keeps learning. Always.) - Latency: End-to-end response in under 500ms
(Translation: Fast enough that you won't notice) - Cost: Basically zero after setup
(Translation: Runs on your laptop. No API bills.)
But honestly? The most important metric isn't in that list.
Every user gets AI that actually learns their style. Not "personas." Not "user segments." Not "demographic targeting." YOUR style. Your preferences. Your way of working.
And it keeps getting better every time you use it.
Why this actually matters
Look, I'm not just showing off a cool engineering trick here. This is a fundamentally different approach to making AI work for real people.
The standard playbook is broken:
- Fine-tuning: Expensive, slow, rigid. Need to retrain for every change.
- RLHF: Requires thousands of examples. Good luck getting that from real users.
- Prompt engineering: One-size-fits-all. Ignores that different people want different things.
What we get instead:
- Personal neural networks: One per user. Your AI, not "an AI."
- Learn from YOUR interactions: 50-100 of your queries. Not millions of synthetic examples.
- Real-time adaptation: Your preferences change? Your AI changes. Within 10-15 queries.
- Complete transparency: You can see exactly why it chose each strategy.
- Privacy by design: Everything runs locally. Your data never leaves your machine.
This isn't about making AI "smarter." It's about making AI yours.
Your AI should work the way you think. Respond the way you prefer. Adapt as you grow.
Not the other way around.
Ready to see how it works?
The full technical article goes deep. Really deep.
I walk through:
- Every architectural decision (and why I made it)
- The complete mathematical foundations (Dueling DQN, prioritized replay, reward shaping)
- Real code implementations (not pseudocode, actual running code)
- Every challenge I hit (and the 3 AM solutions that actually worked)
- Performance benchmarks (with real numbers)
- A complete step-by-step example of one interaction (all 11 steps)
If you're into RL, RAG systems, personalization, or just want to see how to make small models actually useful—this is for you.
The future of AI isn't about bigger models. It's about smarter systems.
Systems that learn from you. Adapt to you. Work the way you want them to.
Let's build it.
Full code available here.
Presented at Oaisys Conf. 2025 AI Conference
Questions? Thoughts? Found a bug? Hit me up. I actually want to hear from you.
(1).png)