AI Engineering t ≈ 14 min

How to Build Self-Improving AI Apps with Claude Code and Supabase

Build AI systems that detect bad responses and fix their own prompts. No manual intervention required.

yfx(m)

yfxmarketer

December 27, 2025

🔄

Your AI systems can improve themselves. Not metaphorically. Literally. A chatbot that detects its own bad responses and fixes its own prompts. An automation that runs, breaks, detects the failure, and repairs itself.

This used to be science fiction. Now you build it with Claude Code and Supabase. The pattern is a self-improvement feedback loop where the AI evaluates its own performance against a rubric, decides when changes are needed, and updates its own system prompt.

TL;DR

Connect Claude Code to Supabase via MCP server. Store prompts, responses, and reflection logs in the database. Create an LLM-as-judge that evaluates the chatbot’s performance against a rubric. When scores drop below threshold, the system updates its own prompt. Version control everything so you revert bad changes.

Key Takeaways

  • Self-improving AI uses a second AI to judge the first AI’s performance
  • The judge evaluates against a rubric you define: completeness, depth, tone, scope
  • Prompt updates only trigger when scores drop below your threshold
  • Version control lets you revert to any previous prompt
  • Supabase MCP enables Claude Code to build and test database changes autonomously

The Self-Improvement Feedback Loop

Traditional AI app development is linear. Write a prompt. Pray it works. Test it. Tweak it. Go back and forth in circles. Even in production, you manually review interactions and decide when changes are needed. A user complaint or a pattern of bad responses triggers a manual update.

The self-improvement loop inverts this. You create an app where different parts of the database track metadata, conversation flow, and performance scores. A separate AI proactively evaluates conversations, generates suggestions, and implements changes when thresholds are crossed.

The result: AI meta-prompting itself. Using AI to write and evaluate its own prompts. The AI is usually a better prompt engineer than you are.

Action item: Identify one AI system you operate that requires frequent prompt updates. This is your candidate for the self-improvement pattern.

The Two-Tool Stack

Building this requires two things. Claude Code for development. Supabase for persistence. Connect them via the Supabase MCP server.

The MCP connection creates an open phone line between Claude Code and your database. Claude Code creates tables, writes functions, tests queries, and deploys edge functions without you copying and pasting errors back and forth. The feedback loop between the AI and database is seamless.

Supabase stores everything: prompts, responses, timestamps, reflection logs, rubric configuration, user sessions. Edge functions handle micro-interactions like receiving messages, calling the Anthropic API, and triggering reflections.

Action item: Set up a Supabase project and enable the MCP server before starting your Claude Code session. The connection should be ready before you write your first prompt.

The Evaluation Layer

The evaluation layer is the core innovation. You have one AI handling the chat. You have a second, separate AI whose sole job is to monitor the first AI’s behavior and provide feedback.

Think of it like performance evaluations at a job. An employer evaluates you periodically. Sometimes you evaluate yourself. In this system, the AI evaluates itself against a rubric you define. If it honestly concludes it performed poorly, it creates a plan to improve.

The rubric defines what good looks like. Common criteria include:

  • Response completeness: Did it answer the full question?
  • Response depth: Was the answer substantive?
  • Tone appropriateness: Did it match the context?
  • Scope adherence: Did it stay on topic?
  • Missed opportunities: What could it have done better?

Each criterion gets a score from one to five. If the aggregate score drops below your threshold, the system triggers a prompt update.

Action item: Define your rubric before building. List five criteria that matter for your use case. Decide what score threshold should trigger an update.

How the Feedback Loop Works

The user asks a question. The chatbot responds. If a trigger fires, the evaluation AI looks at the recent exchanges and scores them against the rubric. Scores get saved to the database.

If nothing needs to change, the loop continues. This should be the status quo. A system prompt changing constantly creates an unstable, reactive app that overreacts to nuanced conversations.

If the score drops below threshold, the system updates the prompt automatically. The new prompt gets versioned and stored. The loop continues with the updated prompt.

The key insight: stability is the default. Updates are the exception. You want the system to maintain good performance, not chase every edge case.

Action item: Set your threshold conservatively at first. A score of 3.5 out of 5 is a reasonable starting point. Adjust based on how often updates trigger in testing.

Building the Database Schema

Claude Code designs the database schema based on your requirements. The core tables include:

  • Users: ID, email, creation timestamp
  • Sessions: Chat sessions tied to users
  • Messages: All messages with timestamps and session IDs
  • System prompts: Current and historical prompts with version numbers
  • Reflection logs: Every evaluation with scores, analysis, and decisions
  • Suggestions: Feedback that is not severe enough to trigger updates

The messages table is critical. You need to track message counts over time horizons. The reflection system looks at the last N messages or the last N minutes of conversation.

The system prompts table enables version control. Every prompt change gets a new version number. You revert to any previous version if an update makes things worse.

Action item: Let Claude Code design the initial schema. Review it for completeness. Make sure you have tables for everything you want to track and audit.

Edge Functions for Micro-Interactions

Supabase edge functions handle discrete behaviors throughout the app. The key functions include:

  • Chat handler: Receives messages, fetches conversation history, calls the Anthropic API, stores responses
  • Reflection loop: Fetches recent messages, runs the evaluation, scores against the rubric, decides whether to update
  • Prompt updater: Generates a new system prompt based on evaluation feedback, versions and stores it

Edge functions run on Supabase infrastructure. Claude Code creates, tests, and deploys them directly through the MCP connection.

The reflection loop function is the most complex. It implements the decision framework: fetch messages, analyze with rubric, score each criterion, aggregate scores, compare to threshold, update if needed.

Action item: Plan your edge functions before building. List every discrete behavior your app needs. Each behavior becomes a candidate for an edge function.

The Mega Prompt That Starts Everything

The first Claude Code session establishes the foundation. The mega prompt explains the entire system and lets Claude Code design the implementation.

A strong starting prompt includes:

  • The goal: Build a self-improving chatbot
  • The database: Using Supabase for persistence
  • The model: Specify Claude 4.5 Haiku for both chat and reflection
  • The schema design: Tables for users, messages, sessions, prompts, reflections
  • The initial system prompt: What the chatbot should know and how it should behave
  • The reflection loop: How to fetch messages, analyze with rubric, decide on updates
  • The scoring criteria: Your rubric with one-to-five scales
  • The decision framework: When to update versus when to maintain
  • The guardrails: Safety nets and edge cases

Give Claude Code free rein to build features that make sense for a self-improving system. It will add things you did not think of, like cooldown periods between updates.

Action item: Write your mega prompt before starting. Cover every component listed above. Include examples of reflection logs so Claude Code understands the expected format.

Session Management and Context Limits

Building this system takes multiple Claude Code sessions. Each session caps out the context window. You need a strategy for continuity across sessions.

Create a handoff document that tracks:

  • What was completed in each session
  • What bugs were encountered
  • What phase the build is in
  • What remains to be investigated

Update this document at the end of every session. Start the next session by referencing it. This maintains context better than relying on the compact command, which loses micro-behaviors and pivotal details.

The handoff document becomes an artifact you reuse. Take it to new projects. The context and code patterns accelerate building other self-improving systems.

Action item: Create a handoff document template before your first session. Update it religiously. Reference it at the start of every new session.

Testing the Reflection System

The reflection system needs stress testing. You need to verify it works, not that it looks like it works.

Common issues during testing:

  • The system always passes. The rubric is too lenient or the AI is being too nice to itself.
  • The system never passes. The rubric is too strict or the threshold is too high.
  • The cooldown blocks testing. Built-in delays prevent rapid iteration.
  • Front-end and back-end disconnect. Changes on the UI do not propagate to the database.

To test properly, you need controls. A “reflect now” button that bypasses cooldowns. The ability to set the score threshold from the UI. Visibility into exactly which messages were evaluated and why.

The ruthless critic test: Create a reflection prompt that is impossibly critical. Set standards so high that nothing passes. If your system still passes, something is broken.

Action item: Build a “reflect now” button into your admin interface. Create a test rubric that is designed to fail. Verify the system triggers updates when scores drop.

The Admin Interface

The admin interface gives you control and visibility. Core features include:

  • Current prompt display with version number
  • Prompt history with ability to revert to any version
  • Reflection logs showing every evaluation, scores, and decisions
  • Suggestions tab for feedback that did not trigger updates
  • Settings panel for threshold, message count, and evaluation mode

The settings panel controls the reflection behavior. You set score threshold (what aggregate score triggers an update), message count (how many messages to include in each evaluation), evaluation mode (look at all messages or only unevaluated messages), and reflection interval (how often to run automatic evaluations).

Everything in the admin interface connects to the database. Changes on the front end propagate to Supabase. Do not build UI that looks functional but does not change anything.

Action item: Build the admin interface incrementally. Start with prompt display and reflection logs. Add settings controls once the core loop works.

Version Control for Prompts

Prompt version control is essential. The system will make bad updates. You need the ability to revert.

Every prompt change creates a new version. The database stores version number, full prompt text, timestamp of creation, trigger that caused the update (which reflection, which scores), and whether this version is currently active.

The admin interface shows version history. You click any version to see the full prompt. You revert to any version with one click.

This creates an audit trail. You trace how the prompt evolved over time. You see which user conversations triggered which changes. You learn what makes your AI perform better or worse.

Action item: Include version control in your initial schema design. Do not treat it as a later feature. You will need it during testing.

The Suggestions System

Not every observation warrants a prompt update. The suggestions system captures feedback that is useful but not urgent.

When the evaluation AI notices something worth mentioning but the scores are above threshold, it logs a suggestion instead of triggering an update. Suggestions include patterns across conversations, tone adjustments that might help, topics that come up frequently, and edge cases to consider.

The suggestions tab displays these with the ability to mark them addressed or hide them. Think of it as a backlog for prompt improvements. You review periodically and decide what to incorporate.

This gives you a thought partner. The AI notices things you might miss. You get insights without the system being reactive.

Action item: Review suggestions weekly. Look for patterns across multiple suggestions. Batch updates based on themes rather than reacting to individual items.

Common Pitfalls and Fixes

The AI Is Too Nice to Itself

The evaluation AI gives high scores even when responses are mediocre. This is the most common issue.

Fix: Make the reflection prompt more critical. Specify that the evaluator should look for specific failure modes. Provide examples of responses that should score low.

Updates Trigger Too Frequently

The prompt changes after every few conversations. The app feels unstable.

Fix: Raise the score threshold. Add a cooldown period between updates. Require multiple consecutive low scores before triggering an update.

Updates Never Trigger

The system maintains the same prompt forever even when responses are clearly bad.

Fix: Lower the score threshold. Check that the reflection function runs. Verify messages are being stored and fetched correctly.

Context Window Exhaustion

Claude Code sessions end before the build is complete.

Fix: Use the handoff document. Break the build into discrete phases. Complete one phase per session.

Action item: Build diagnostic logging into your reflection function. Log every evaluation, every score, every decision. Use logs to debug when behavior is unexpected.

Extending the Pattern

The self-improvement loop is not limited to chatbot prompts. The same pattern applies to:

  • Automation workflows that detect failures and adjust parameters
  • Content generation systems that learn from feedback
  • Classification systems that improve accuracy over time
  • Any AI system where performance is measurable and prompts are adjustable

You extend beyond prompt updates. The system proposes new features based on user behavior. It drafts changes to the app itself. The evaluation AI becomes a product manager that notices what users want.

The pattern scales to multiple feedback loops. One loop improves the chatbot. Another loop improves the evaluation rubric. A third loop improves the reflection prompt itself. Layers of self-improvement.

Action item: After your first self-improving system works, identify one more candidate. Apply the same pattern. Build your library of self-improving components.

The Build Sequence

Session 1: Foundation

Connect Supabase MCP. Create core tables. Build the basic chat interface. Verify messages store and retrieve correctly. Get the chatbot responding.

Session 2: Reflection Core

Implement the reflection function. Create the rubric. Run the first evaluations. Verify scores are being calculated and stored.

Session 3: Safety Nets

Add cooldown periods. Implement score thresholds. Create the admin interface. Add the reflect now button for testing.

Session 4: Version Control

Implement prompt versioning. Add revert functionality. Create the prompt history view. Test reverting after a bad update.

Session 5: Polish

Add the suggestions system. Refine the UI. Stress test with edge cases. Document the system for handoff.

Action item: Plan your sessions before starting. Know what each session should accomplish. Update the handoff document between sessions.

Why This Matters for Growth Teams

AI systems degrade. User expectations shift. Manual monitoring does not scale. The teams that build self-improving systems will operate AI at scale while others drown in support tickets and prompt tweaks.

A chatbot that took weekly maintenance now runs autonomously. An automation that broke monthly now self-repairs. The constraint shifts from maintenance capacity to imagination about what to automate next.

Every self-improving system you build compounds. The patterns transfer. The handoff documents accelerate future builds. The gap between AI-native teams and everyone else widens every quarter.

Action item: Start with one self-improving system this quarter. Document everything. Use the documentation to build the next one twice as fast.

Final Takeaways

Self-improvement requires separation of concerns. One AI does the work. Another AI judges the work. Keep them independent.

Stability is the default. Updates should be rare. A reactive system that changes constantly is worse than a static system.

Version control is mandatory. The system will make bad updates. You must be able to revert.

Test with the ruthless critic. Create an impossible rubric. Verify the system fails when it should fail.

The pattern generalizes. Chatbots are the example. Automations, classifiers, and content systems all benefit from self-improvement loops.

Build the admin interface. You need visibility and control. The system should be auditable and adjustable.

yfx(m)

yfxmarketer

AI Growth Operator

Writing about AI marketing, growth, and the systems behind successful campaigns.

read_next(related)