Open source · 2026-05-25
We open-sourced our production AI stack.
Five Apache 2.0 repositories. The actual infrastructure we ship inside every mid-market AI engagement. Here's what each does, why we built it, and how to use it on your own workflows.
The short version
Today we're releasing five repositories on GitHub under Apache 2.0. They are not toys. They are the actual infrastructure we run inside every mid-market AI engagement we ship — the eval harness, the MCP server, the audit log schema, the multi-LLM router, and the prompt library.
We open-sourced them because the artifact quality isthe product. If you can read our SOW template and our audit log schema before you pay a deposit, you already know whether we're the right partner. The engagement itself is the integration, customization, and operating discipline you pay for — not the abstract IP.
Why now
We have shipped sixteen production AI workflows over the last fifteen months. Every one of them ran on a stack we kept rebuilding from scratch each time — mostly because the open-source ecosystem either didn't exist or didn't meet the bar regulated mid-market customers need.
By month nine, the same five components kept showing up. Same eval harness shape. Same audit log fields. Same prompt patterns. Same cost-routing logic. We were copying code from one client's repo to another and renaming variables. That's when an internal toolset starts to want to be an open-source library.
So we extracted them. We hardened them for the public. We licensed them Apache 2.0. And we're publishing them today — the same week we're also announcing our team hiring roadmap (/about) and our SOC 2 Type II readiness timeline (/security).
What you can do with the five repos
1. eval-harness-template
Every production AI workflow needs an eval suite. Not the prompt-engineer-tests-by-hand kind — the labelled-test-set-runs-in-CI-on-every-change kind. The harness defines the test cases, the scoring rubric, the gating thresholds, and the regression detection that catches when your prompt change broke yesterday's working examples.
Our template ships with TypeScript + Python bindings, a sample evaluation flow against the OpenAI / Anthropic SDKs, and a GitHub Actions configuration that runs the suite on every PR. It is a starter, not a framework — fork it, replace the rubric with yours, plug it into your CI.
github.com/victorwhale/eval-harness-template
2. mcp-server-template
Model Context Protocol is becoming the 2026 standard for AI tool registries. Anthropic shipped the protocol, but production-grade server implementations are scarce — especially the ones that handle auth, rate limiting, and audit logging the way regulated mid-market customers need.
This template spins up an MCP server in ten minutes with the security defaults you actually need: API key auth, structured rate limiting, full audit logging on every tool call, schema validation, and the standard MCP capabilities pattern. Add your tools, deploy, done.
github.com/victorwhale/mcp-server-template
3. audit-log-spec
There is no industry-standard schema for AI inference audit logs. So every engagement we touch starts with the same conversation: what fields do we log, in what format, with what retention, in what storage. We wrote it down once. Then we published it.
The spec defines the exact fields every AI inference must log to be defensible to an auditor — input context fingerprint, retrieval bundle hash, model version, prompt hash, output, downstream action, reviewer disposition, timestamp, cost-per-call. It maps cleanly to HIPAA, FINRA, GDPR, and EU AI Act retention requirements. WORM-compatible storage assumed.
github.com/victorwhale/audit-log-spec
4. claude-multi-model-router
Every engagement we ship uses Claude. Most of them use more than one Claude model — Haiku for routine classification, Sonnet for the bulk of the work, Opus for the hard reasoning steps. Picking the right model per request is the difference between a $5K/month inference bill and a $35K/month inference bill at mid-market scale.
The router wraps the Anthropic SDK with cost-optimization defaults: complexity scoring, latency budgets, fallback chains, deterministic caching, model-version pinning. It cut one client's monthly Anthropic invoice from $42K to $14K in the first thirty days without dropping their eval scores. Same routing logic now ships as a public package.
github.com/victorwhale/claude-multi-model-router
5. prompt-library
Most prompt libraries are personal collections without evaluation data. Yet production prompts are an empirical artifact — the only honest way to know whether a prompt works is to score it against a labelled test set. So we published ours with the evals attached.
The library currently ships battle-tested prompts for the workflows we see most often at mid-market: customer service triage, document extraction, compliance review, lead qualification, executive reporting. Each prompt comes with its labelled test set and benchmark scores against Claude Sonnet 4.6, GPT-4o, and Gemini 2.5. Fork it, run it on your data, adopt only what works.
github.com/victorwhale/prompt-library
How this fits into our agency model
We are an AI-native agency. Projects start at $15k (/pricing). Every engagement ships against this stack, pre-hardened for production from day one.
We do not sell licenses or hosted versions of these repos. We do not gate features behind a paid tier. We do not require you to attribute us. The engagement is the integration work: picking which pieces apply to your workflow, customizing them to your data and constraints, deploying them into your infrastructure, and operating them with weekly KPI reporting and quarterly attestations.
If you want to skip the engagement and do it yourself, the repos are everything you need to start. Open an issue, send a PR, fork freely — we'll review contributions like any other maintained Apache 2.0 project.
What ships next
Five repos is the first cohort. Three more are queued for Q3 2026:
- retrieval-bundle-fingerprinter — content-addressed hashing for RAG retrieval bundles so audit logs replay exactly what the model saw.
- reviewer-queue-scaffold — production queue + SLA dashboard for human-in-the-loop AI workflows.
- compliance-attestation-pack — the quarterly attestation template we ship to regulated mid-market clients, with field mappings for HIPAA, FINRA, GDPR, and EU AI Act.
The honest framing
We are a small agency. The founder reviews every artifact before delivery, the specialist network reinforces each engagement, and the team is being grown carefully through 2026-2027 (public roadmap here).
Open-sourcing our stack does two things for us. First, it raises the floor on the quality bar — once code is public, contributions and bug reports keep it sharper than any private repo we'd maintain alone. Second, it makes our commercial promise testable before you pay a deposit: read the eval harness, read the audit log spec, decide whether the engagement is worth it. We'd rather lose a prospect because the artifact quality doesn't match what they need than waste anyone's time.
If you ship AI in production, fork them, break them, send PRs. If you want us to integrate this stack into your workflows, scope your project.
Quick links
- eval-harness-template — The eval suite scaffold we run on every workflow before it ships to production. github.com/victorwhale/eval-harness-template
- mcp-server-template — Production-ready Model Context Protocol server with auth, rate limiting, audit logging out of the box. github.com/victorwhale/mcp-server-template
- audit-log-spec — JSON Schema for AI inference audit logs. WORM-compatible. HIPAA/FINRA/GDPR aligned. github.com/victorwhale/audit-log-spec
- claude-multi-model-router — Cost router for Claude Opus/Sonnet/Haiku. Cuts inference bills 40-70% without dropping quality. github.com/victorwhale/claude-multi-model-router
- prompt-library — Versioned, evaluated prompt patterns for mid-market workflows. Each ships with the eval set attached. github.com/victorwhale/prompt-library
Want this stack inside your workflow?
Start an AI Project
Every engagement ships against this stack, pre-hardened for production from day one. No reinventing the wheel. Discovery in 2-3 weeks, production by week 7, or 50% of the deposit refunded.