Architecture That Ships

In 2019, I joined a project that had been in development for fourteen months. The team was talented — senior engineers, a sharp product manager, a dedicated architect. They had a service mesh. They had a CDC pipeline feeding a data lake. They had a custom feature flag system built in-house because the commercial options "didn't quite fit." They had an API gateway with a bespoke rate limiter that could handle ten thousand requests per second.

They had never shipped to a single user.

Fourteen months of building, and the product didn't exist yet. The roadmap was a list of capabilities — authentication, authorization, multi-tenancy, event sourcing, audit logging — each one delivered with engineering precision, each one waiting for the others before the whole thing could face a customer. The team called it "building the platform." What they were actually doing was deferring the riskiest question any product can face: does anyone want this?

I asked the architect what the minimum set of features was that could go to a pilot customer. He showed me a diagram with forty-two components. I asked which ones a single customer doing a single workflow would actually touch. He stared at the diagram for a long time and said, "Maybe seven."

We shipped those seven components in six weeks. The pilot customer found a workflow problem in the first hour of use that would have survived another year of platform building because no one had tested it against real human behavior. The CDC pipeline, the data lake, the custom rate limiter — none of them mattered yet. What mattered was that a real person tried to do a real thing and the system didn't support it.

That's the gap this article is about. Part 1 and Part 2 of this series covered how to build systems that scale and how to keep them alive when things break. This part is about a different discipline — the discipline of shipping. Of choosing what to build first. Of automating what matters and ignoring what doesn't. And of designing for a world where AI has fundamentally changed the build-versus-buy calculus.

The fourth and fifth forces from our framework — evolution and operational cost — drive every decision in this article. A system that can't evolve to meet what you learn from users is a system that's already behind. And every component you build is a component you operate, forever.

The MVP That Isn't Minimal Enough

Most teams understand the concept of a minimum viable product. Few teams are honest about what "minimum" actually means.

The word does real work in that phrase. Minimum means you shipped less than you wanted to. It means the product manager is uncomfortable. It means the designer is wincing at screens that don't match the vision. It means the engineer knows there's a shortcut in the data layer that will need to be replaced in three months.

That discomfort is the signal that you've actually found the minimum. If everyone's happy with what shipped, you almost certainly built too much.

The goal of an MVP isn't to impress users. It's to learn something you can't learn any other way — by watching a real person use real software to do a real task. Every feature you add before that moment is a bet placed without evidence.

Here's the filter I use when deciding what belongs in a first release:

Include	Exclude
The core workflow the user came for	Admin dashboards
The data model that supports that workflow	Reporting and analytics
Authentication (you need to know who's using it)	Fine-grained authorization (roles, permissions)
Basic error handling that doesn't lose data	Comprehensive error recovery
One integration that proves the system connects to the real world	Every integration on the roadmap
Logging sufficient to debug problems	Full observability stack

The right side of that table isn't unimportant. It's unimportant right now. The discipline is sequencing — building what teaches you something before building what completes the vision.

I've seen this go wrong in both directions. Teams that ship too little — a prototype that can't survive contact with real data — learn nothing because users can't get past the rough edges to reach the actual workflow. Teams that ship too much — a polished product with every edge case handled — learn nothing because they spent nine months building before asking the question.

The sweet spot is a product that's rough but functional. Users can complete the core task. When they hit an edge case, they get a clear error instead of silent corruption. The gaps are visible, documented, and deliberately chosen — not accidents.

Ship the smallest thing that teaches you the biggest lesson. Everything else is inventory.

Automate What You'll Do Twice

There's a rule of thumb in engineering: if you'll do something once, do it manually. If you'll do it twice, automate it. The reasoning is simple — automation has a cost, and that cost only pays off with repetition.

For deployment, you'll do it more than twice. You'll do it hundreds of times. Automate it immediately.

A deployment pipeline isn't a luxury for mature teams. It's a prerequisite for learning. If deploying takes thirty minutes of manual steps — SSH into the server, pull the latest code, run migrations, restart the service, check the logs — you deploy less often. When you deploy less often, each deployment carries more changes. When each deployment carries more changes, failures are harder to diagnose. When failures are harder to diagnose, you deploy even less often. It's a vicious cycle that ends with weekly releases and a team that's terrified of Fridays.

The minimum viable pipeline has four stages:

The minimum viable deployment pipeline — every stage is automated, every failure stops the line · Click chart to expand

Commit triggers the pipeline. No manual steps to "start a build."

Test runs the automated tests. Start with whatever you have — even a handful of integration tests that cover the critical path. The test suite doesn't have to be comprehensive on day one. It has to exist. A pipeline with a three-minute test suite that catches 60 percent of regressions is infinitely more valuable than no pipeline and a plan to add tests later.

Build produces a deployable artifact — a container image, a serverless package, a compiled binary. The artifact is immutable. What you tested is what you deploy. No "build on the server" steps that introduce drift between what was tested and what runs in production.

Deploy pushes the artifact to the target environment. For most teams starting out, this is a single environment. Blue-green and canary come later, when the deployment frequency justifies the complexity.

Verify runs a smoke test against the deployed environment. Can the application respond to a health check? Can it complete the core workflow? If verification fails, roll back automatically. This is the safety net that makes frequent deployment sustainable.

Start with GitHub Actions or GitLab CI and a single workflow file. A fifteen-line YAML file that runs tests and deploys on merge to main is worth more than a week spent evaluating CI/CD platforms. You can always migrate the pipeline. You can't recover the weeks you spent deploying manually while deciding which tool to use.

What you don't need on day one: parallel test execution across multiple runners, matrix builds for multiple environments, artifact caching strategies, deployment approval gates, Slack notifications for every build status. All of these are valuable. None of them are prerequisites.

The pattern I see repeatedly — in teams of five and teams of fifty — is that the pipeline grows organically once it exists. The first deployment pipeline is always embarrassingly simple. Six months later, the team has added linting, security scanning, performance benchmarks, and automated changelog generation, each one added because someone on the team felt the pain of not having it. That organic growth is exactly how it should work. The pipeline evolves in response to real problems, not hypothetical ones.

Build, Buy, or Prompt

For twenty years, the architectural decision for any new capability was binary: build it yourself or buy a product that does it. Custom code versus commercial software. The tradeoff was straightforward — build gives you control and fit, buy gives you speed and maintenance transfer. Teams evaluated both options against cost, timeline, and how central the capability was to their competitive advantage.

In 2026, there's a third option that has genuinely disrupted this calculus: prompt an AI model.

Need to classify support tickets by urgency? You could build a classification model — weeks of data labeling, training, and deployment infrastructure. You could buy a support platform with built-in classification — license negotiation, integration work, vendor lock-in. Or you could send the ticket text to an LLM with a well-crafted prompt and get a classification back in two seconds, with no training data and no procurement cycle.

This isn't hypothetical. I've watched teams replace months of custom development with a single API call to a language model. Document summarization, data extraction from unstructured text, code review assistance, test generation, customer intent detection — all of these were build-or-buy decisions two years ago. Today they're build-or-buy-or-prompt decisions, and "prompt" wins more often than most architects expect.

Capability	Build	Buy	Prompt
Text classification	Weeks (ML pipeline, training data, model serving)	Days (SaaS integration) + vendor lock-in	Hours (API call + prompt engineering)
Document summarization	Months (NLP pipeline, domain tuning)	Weeks (enterprise tool integration)	Hours (API call, works out of the box)
Data extraction from PDFs	Weeks (OCR + custom parsing rules)	Days (document AI platform)	Hours (multimodal model, handles layout natively)
Conversational interface	Months (dialog management, NLU, intent mapping)	Weeks (chatbot platform + customization)	Days (LLM + system prompt + guardrails)
Code review automation	Months (AST analysis, rule engine, false positive tuning)	Weeks (SaaS tool integration)	Days (model reads diff, returns structured feedback)

The "prompt" column is seductive, and it's important to understand where it breaks down. Prompting an LLM works brilliantly when:

The task is language-native (classification, summarization, extraction, generation)
Precision requirements are "good enough" (85-95 percent accuracy is acceptable)
Volume is moderate (thousands of calls per day, not millions)
The cost of a wrong answer is low (a misclassified support ticket, not a misclassified medical diagnosis)

Prompting breaks down when:

You need deterministic, reproducible outputs (the same input must always produce the exact same output)
Latency requirements are under 100 milliseconds (model inference takes seconds)
Volume demands would make API costs prohibitive
The task requires real-time access to your proprietary data that can't fit in a context window
Regulatory requirements demand explainable, auditable decision-making

The biggest risk with the "prompt" path isn't accuracy — it's dependency. If your core workflow depends on a model provider's API, you've introduced a dependency that you don't control, can't cache effectively, and can't run locally if the provider has an outage. Architect the AI capability as an enhancement with a degraded-but-functional fallback, not as the only path through the workflow. The circuit breaker patterns from Part 2 apply directly here.

The decision framework I use:

Build when the capability is your competitive advantage, when you need full control over the behavior, or when the performance requirements exceed what a third-party can deliver. Build the thing that makes you you.

Buy when the capability is commodity infrastructure — authentication, payment processing, email delivery, monitoring. Someone else has already solved these problems better than you will, and your engineering time is better spent elsewhere.

Prompt when the capability involves language understanding, generation, or reasoning, when "good enough" accuracy is acceptable, and when time-to-value matters more than long-term cost optimization. Prompt first, and only build or buy when you've learned enough from the prompted version to know exactly what you need.

That last point is where "prompt" changes the architectural strategy most profoundly. You can prototype with a prompted solution in hours, learn from real usage, and then decide whether to invest in a built or bought replacement. The prompted version becomes your MVP — the fastest path to learning whether the capability matters at all.

Designing for Extension

The systems that ship well are the systems that make the next feature cheap. Not through prediction — you can't know what the next feature will be — but through structural decisions that keep options open.

Three patterns that consistently pay off:

Configuration over code. When a behavior might change — pricing tiers, notification rules, workflow steps, feature availability — express it as configuration rather than code. A pricing rule in a JSON document can be changed by a product manager in minutes. A pricing rule embedded in application logic requires a developer, a code review, a deployment, and the quiet fear that changing it will break something else.

This doesn't mean building a generic rules engine. It means identifying the specific behaviors that change frequently and moving them outside the code path. Feature flags are the most common example — a boolean in a configuration service that enables or disables a capability without a deployment. But the pattern extends to any decision that the business changes faster than the engineering team can deploy.

Events over direct calls. When service A needs to tell service B that something happened, publish an event rather than making a direct call. "Order placed" is an event. "Hey service B, update the inventory and send an email and log an audit entry" is a command that couples A to B's implementation.

The event approach means service A doesn't know or care what happens downstream. Today, one service processes the event. Next month, three services process it. The change is adding subscribers, not modifying the publisher. That's extension without modification — the architecture supports new behavior without touching existing code.

Contracts at boundaries. Every integration point — between services, between your system and external APIs, between your backend and your frontend — should have an explicit contract. An API schema. A message format. A versioning strategy. The contract is the stable surface that both sides agree on, and it's what allows either side to change independently.

Extension points: API contracts for stability, events for downstream flexibility, configuration for business agility · Click chart to expand

The common thread across all three patterns is separation of decisions. The code decides how to process. The configuration decides what to process. The events decide who else cares. The contracts decide how components talk. Each decision can change independently, which means each decision can evolve at its own pace — and that's what makes the system extensible without being over-engineered.

AI as a First-Class Architectural Component

In Part 1 and Part 2, we discussed AI in context — AI-assisted security scanning, AI-driven canary analysis, AI observability copilots. Those were AI capabilities integrated into existing architectural concerns. This section is about something different: designing systems where AI is a primary component, not an enhancement.

An AI-native system treats the language model the way a traditional system treats the database — as a core dependency with its own performance characteristics, failure modes, and scaling requirements. The architectural patterns are still emerging, but several have stabilized enough to be useful.

The retrieval-augmented generation (RAG) pattern is the most common and the most immediately practical. Instead of fine-tuning a model on your data — expensive, slow, and stale the moment your data changes — you retrieve relevant documents at query time and include them in the model's context. The architecture looks like this:

RAG pattern: user query → retrieve relevant context → assemble prompt → generate response · Click chart to expand

The architectural decisions here are familiar if you've designed traditional systems: how to partition the vector store (by tenant, by document type, by recency), how to cache frequent queries, how to handle the latency budget when retrieval + generation can take 2-5 seconds, and how to degrade gracefully when the model provider is down.

The agent orchestration pattern is more complex and still evolving. An AI agent is a model that can take actions — query a database, call an API, execute code — in a loop until it achieves a goal. The architecture needs to handle:

Tool definition and permission boundaries — what the agent can and cannot do
State management — tracking the agent's progress through a multi-step workflow
Cost control — a runaway agent loop can burn through API credits in minutes
Observability — logging every step the agent takes, every tool it calls, every decision it makes

The permission model is the most critical architectural decision. An agent that can read your database and write to external APIs needs least privilege by default — explicit allow-lists for every capability, not broad access with after-the-fact auditing. The five forces framework applies directly: the constraint is trust (how much autonomy can you grant?), the tradeoff is capability versus safety, the failure mode is an agent taking an irreversible action with bad judgment, the evolution path is gradually expanding permissions as you build confidence, and the operational cost is the monitoring infrastructure needed to keep the agent accountable.

The build-vs-buy-vs-prompt framework applies recursively here. You can build your own agent orchestration framework, buy one (LangChain, CrewAI, Claude Agent SDK), or use a model's native tool-use capability directly. For most teams in 2026, starting with a commercial SDK and migrating to custom orchestration only when you hit its limits is the right call. The abstractions are still shifting too fast to invest in building your own.

The evaluation problem is the one most teams underestimate. Traditional software has deterministic tests — given input X, the output should be Y. AI components are probabilistic. The same input can produce different outputs, and "correct" is often a judgment call rather than an equality check.

The practical approach: build an evaluation suite of representative inputs with human-judged expected outputs. Run it on every model change, every prompt change, every retrieval pipeline change. Track accuracy, relevance, and safety scores over time. Treat a regression in evaluation scores the same way you'd treat a failing test — it blocks the deployment until someone investigates.

This is new architectural territory, and the patterns will look different in two years. What won't change is the underlying principle: AI components need the same architectural rigor as any other component — defined contracts, failure modes, scaling strategies, and observability. The model is powerful, but it's also a dependency. Treat it like one.

The Shipping Discipline

Everything in this article — MVP scope, deployment automation, the build-buy-prompt decision, extensibility, AI-native design — serves a single outcome: getting the system in front of users and learning from what happens.

The five forces converge here. Your constraints determine how much you can build before the first release — and the answer is always less than you think. Your tradeoffs determine what you defer — and deferring the right things is the hardest skill in architecture. Your failure modes now include a new one that doesn't appear in traditional resilience thinking: the failure of building the wrong thing, which no amount of redundancy or observability can fix. Your evolution strategy depends on shipping early enough to have time to evolve. Your operational cost is lowest when you're operating the smallest system that delivers value — every additional component is a tax on the team's capacity to learn and adapt.

The system I described at the beginning of this article — fourteen months, forty-two components, zero users — eventually shipped. The seven components we extracted worked. The pilot customer's feedback reshaped the product direction in ways the team hadn't anticipated. Three of the forty-two components turned out to be unnecessary. The custom rate limiter was replaced with a cloud provider's native offering. The CDC pipeline was deferred by a year and implemented in a fraction of the original scope because the team now understood what data actually needed to flow where.

The architecture that ships is the architecture that starts small, learns fast, and grows in response to evidence. The rest is inventory.

The most expensive architecture decision is building something nobody needs. No amount of technical excellence can fix a relevance problem.

In Part 4: The Numbers Behind the Architecture, we trace the full scaling journey — from a single server handling fifty requests per second to a distributed system serving half a million. Real queries, real configs, real incidents. Every stage of that journey involves the same forces we've been discussing, applied to increasingly complex constraints. The patterns get more sophisticated. The discipline stays the same.