I Test In Production: The Philosophy Behind the Meme

This post makes the case that testing in production is not a shortcut. When done deliberately, with observability, feature flags, and canary releases, it is the most honest testing strategy available.

"I test in prod" started as a developer meme. Something you posted after an incident postmortem, captioned with the kind of exhausted self-awareness that only comes from having been on-call for 36 hours. It was a confession wrapped in a shrug.

It turned out to be a philosophy.

In 2024, Testlio found that 60% of engineering organizations already use some form of production testing, whether they call it that or not. The other 40% maintain a different story: that staging is representative enough, that their test suite is comprehensive enough, that what ships to production has been properly verified. For some of them, that story is true. For most, it holds until the moment it doesn't.

Everyone Tests in Production. Most Just Don't Say It.

In 2024, Testlio's State of Testing report found that 60% of engineering organizations use some form of production testing. That includes canary releases, feature flags, A/B tests, synthetic monitoring, chaos engineering, and on-call-driven iteration. Most teams do at least one of these without naming it. The other 40% believe their staging environment is representative enough to skip production testing entirely. One of those groups is making a better bet.

What changed the conversation was not a framework or a tool. It was Charity Majors. In 2018, as CTO of Honeycomb.io, she articulated on the CoRecursive podcast what engineers had been practicing informally for years: you cannot understand a distributed system by reading about it. You have to run it. Production is the only place where the system is actually running.

The cultural shift that followed was not engineers getting more reckless. It was engineers getting more honest. Teams that admit they test in production can be deliberate about it. They can build the observability that makes it safe. Teams that deny they test in production are doing it anyway, just without the instrumentation to read what they're seeing. The confession is not the problem. The lack of tools is.

In 2024, Testlio found that 60% of engineering organizations already use some form of production testing. Teams that acknowledge this can instrument it deliberately: feature flags, canary releases, real-time alerting. Teams that deny it are doing it anyway, without the tools to detect or recover from failures.

Figure 1. Six in ten engineering organizations already use some form of production testing. The other four in ten maintain a staging-only posture. Data table follows.

Figure 1, as data: engineering teams and production testing adoption
Approach	Share of organizations
Use some form of production testing	60%
Do not use production testing	40%

Source: Testlio, State of Testing, October 2024.

Why Staging Lies

Staging environments fail to replicate production for a predictable set of reasons. Schema drift is the most common: production and staging databases accumulate divergent migration histories over months. Sequence ID collisions occur when staging data is seeded from production snapshots and new records start from overlapping IDs. Real traffic patterns,spike shape, geographic distribution, session concurrency,cannot be reproduced in a synthetic environment. These are not edge cases. They are structural features of how staging environments age.

The result is a category of failure with no good name: the staging-exclusive pass. The code runs clean in staging. The CI pipeline goes green. Confidence accumulates. Production fails anyway, at the exact moment where the divergence between the model and the real system was hiding.

In 2021, I shipped a feature that had passed staging review for six months on two separate projects. The staging database had not run the same migration sequence as production since 2019,not because anyone made a mistake, but because a hotfix had been applied directly to production two years earlier and never backfilled. The new feature queried a column that existed in staging and did not exist in production. It failed silently in production for 14 hours before anyone noticed, because the error path returned a cached value rather than an exception. Staging agreed with us the entire time. Production was running a different schema.

That specific failure would not have been possible with a feature flag and a 1% canary release. The column would have been missing on the first request, the error would have triggered on the first alert, and the rollback would have been a flag flip rather than an emergency migration.

The Categories of What Staging Gets Wrong

Data fidelity. Staging data is a snapshot of production from some point in the past, with modifications. It is not a live system. Any behavior that depends on data volume, row counts, index efficiency, or data shape will not reproduce accurately.

Integration behavior. Third-party services behave differently with staging credentials: rate limits are lower, sandbox modes return different error codes, and some services don't offer staging environments at all. Your staging integration test is testing a different service.

Load shape. Your load test is a model of real traffic, not real traffic. Real traffic has spikes, crawlers, session storms, malformed requests, and retry behavior that no synthetic test reproduces reliably. The long tail of user behavior is invisible until users are producing it.

Staging environments fail in documented, predictable ways: schema drift accumulates across divergent migration histories, sequence ID collisions arise when seed data overlaps production records, and synthetic load tests cannot replicate the actual shape of user traffic. The staging environment is a model of production,and models are wrong in proportion to how much they differ from the real system.

Charity Majors and the Philosophy That Followed

In 2018, Charity Majors articulated on the CoRecursive podcast, episode 39, what the observability community had been working toward: production is the only environment where a distributed system is actually running. Everything else is a simulation. The simulation is useful, but it is not the system.

Her argument was not that staging was useless. It was that staging confidence is bounded. At some point, the only way to know how the system behaves is to run it and observe it. The shift she was describing was not from testing to not testing. It was from prevention-oriented testing to observability-oriented testing. Instead of trying to prevent all failures before they reach production, you build the capability to detect them within seconds and recover within minutes.

In 2019, she expanded the argument in Increment, Issue 8, in a piece titled "Testing in Production, the Safe Way." The enabling layer for all of it is observability. Not logging more. Not monitoring more. Specifically: the ability to ask arbitrary questions about production state in real time, without knowing those questions in advance.

Which is why the observability gap matters. In 2024, Logz.io's State of Observability report, surveying 500 DevOps professionals, found that only 10% of engineering teams have achieved full observability across their systems (Logz.io State of Observability, 2024). Eighty-two percent of teams reported a mean incident response time exceeding one hour. Those teams are testing in production without the instrumentation to read what production is telling them. That is not a philosophy. That is guessing with consequences.

In 2024, Logz.io found that only 10% of engineering teams have achieved full observability across their systems, with 82% reporting incident response times exceeding one hour. Charity Majors argued in 2018 that without observability, every deploy is a production test you cannot read. The instrumentation is what separates deliberate production testing from accidental production testing.

Feature Flags, Canary Deploys, and Dark Launches Are All Production Tests

Feature flags, canary deployments, and dark launches are not alternatives to testing. They are the formalized version of "I test in prod." They shift the question from "did we test in production?",you did,to "what percentage of production traffic did we expose, and what did we observe?"

A feature flag ships the code to production and exposes it to 1% of users. You watch the metrics. If error rates climb, latency spikes, or conversion drops, you flip the flag. No rollback procedure required. No hotfix branch. No 3 AM deployment. You find out whether the code works in production because you ran it in production, under real load, with real data, with a kill switch.

A canary release routes a percentage of traffic to the new version, compares it against the baseline in real time, and promotes or rolls back based on observed behavior rather than predicted behavior. A dark launch ships a new service alongside the existing one, sends the same requests to both, discards the new response, and measures the new service's behavior under real production load before any user sees it. These are not workarounds for the lack of a good staging environment. They are a better test than staging can provide.

According to a Wakefield survey of 600 software professionals commissioned by LaunchDarkly, teams using feature flags recover from incidents within one day at an 86% rate. Teams without feature flags: 59%. That vendor-commissioned caveat is worth noting explicitly,the finding is directionally consistent with DORA's data on small batch sizes and recovery time, but the source has a commercial interest in the conclusion. Still, the directional argument holds: if you can turn a feature off in production without a deploy, your blast radius shrinks by an order of magnitude.

Figure 2. Teams with feature flags recover from incidents within one day at a 27-point higher rate than teams without. The flag is the kill switch, not just a rollout tool. Data table follows.

Figure 2, as data: incident recovery within one day,feature flags vs. no feature flags
Team type	Recovery within one day
Teams using feature flags	86%
Teams without feature flags	59%

Source: LaunchDarkly / Wakefield Research, State of Feature Management 2024. Vendor-commissioned survey of 600 software professionals.

The Spectrum from Informal to Deliberate

Informal: No flags. Deploy the code and watch Slack. If someone reports a bug, roll it back manually. Most teams start here. It is still production testing. It is just production testing without instrumentation.

Semi-deliberate: Traffic splits to a canary, manual rollback procedure documented somewhere, some alerting configured. Most mid-stage teams. Better, but rollback is still a multi-step process.

Fully deliberate: Feature flags with automated kill switches, A/B test instrumentation tied to business metrics, real-time SLO alerting that triggers the flag flip before anyone wakes up. This is what "I test in prod" looks like when it's not a confession. It is a system.

The difference between informal and fully deliberate is not intelligence or intent. It is investment in the infrastructure that makes production safe to test in. The meme and the mature practice look identical from the outside,same shirt, different setup.

Feature flags, canary releases, and dark launches are formalized production testing. According to a Wakefield survey of 600 software professionals commissioned by LaunchDarkly, feature flag users recover from incidents within one day at an 86% rate versus 59% for non-users. The tool is not the test,it is the off switch that bounds the blast radius of the test.

What DORA's Elite Teams Are Actually Doing

The 2024 DORA State of DevOps Report separates engineering teams into performance tiers on four metrics: deployment frequency, lead time for changes, change failure rate, and mean time to restore service. The tier gap is not incremental. It is structural.

In 2024, DORA elite performers deploy 182 times more frequently than low performers. They restore service 2,293 times faster. And their change failure rate is 5% versus 64% for low performers. Those deployment frequency numbers require production exposure. You cannot run 182 times more deployments through a staging gate without making staging the bottleneck.

The counterintuitive finding is the change failure rate. Elite teams deploy far more frequently and break things far less often. The reason is not that they are more careful. It is that their deploys are smaller. A one-line change is a production test you can read. A two-week sprint of accumulated changes is a production test that requires a postmortem to understand. The frequency is not the risk factor. The batch size is.

The teams with the safest production records are not the ones deploying the least. They are the ones deploying so frequently that each deploy is small enough to be observable and recoverable. The "never deploy" instinct protects teams from the blast radius of large batches. The actual protection comes from reducing the batch size until the blast radius shrinks below the threshold of consequence. The calendar is the wrong variable.

If you want to understand why Friday deploys are dangerous for most teams but routine for elite teams, the answer is in DORA's data, not in the day of the week. For a longer look at this question, and how the DORA evidence reads against the conventional rule.

Figure 3. Elite DORA performers have a 5% change failure rate versus 64% for low performers,while deploying 182 times more frequently. Safety comes from smaller batches, not from deploying less. Data table follows.

Figure 3, as data: DORA 2024 elite versus low performers
Metric	Elite performers	Low performers
Change failure rate	5%	64%
Deployment frequency (relative)	182× more frequent	Baseline
Mean time to restore service	2,293× faster	Baseline

Source: DORA, State of DevOps Report 2024; analysis via Octopus Deploy.

In 2024, DORA elite performers deployed 182 times more frequently than low performers and restored service 2,293 times faster, while maintaining a 5% change failure rate versus 64% for low performers. They achieve better safety not despite frequent production exposure, but because small-batch deploys are observable and recoverable in ways that large-batch, low-frequency releases are not.

Conclusion

Sixty percent of engineering teams already test in production. The question has never been whether,it has always been whether deliberately or accidentally.

Staging fails in documented, predictable ways. Schema drift accumulates. Traffic patterns diverge. Integration credentials behave differently. The staging environment that agrees with you for six months is running a different system than the one your users are on. That is not a flaw in staging. It is a structural property of how models age relative to the thing they model.

The teams with the safest production records are not the ones who avoid production testing. They are the ones who invest in the observability, feature flags, and deployment automation that make production the honest test environment. Small batches. Fast feedback. Flip the flag, not the pager.

"I test in prod" is not a confession. It is an epistemological position: production is the truth; everything else is a model.

The engineers who test in production deliberately have not lowered their standards. They have raised their observability until those standards can be enforced in the environment that actually matters. Browse the full git commit and mayhem t-shirt collection for the rest of the team.

The Shirts You Wear When You Have Stopped Pretending

At some point, every engineer stops insisting they have a complete staging environment and starts building the observability that makes production testing safe. The shirt is what you wear after that moment.

These are not shirts celebrating recklessness. They are recognition of a real position: that production is where the system lives, that staging is a useful model with documented limits, and that the engineers who say "I test in prod" are often the ones who have built the infrastructure to do it safely.

Techmerch makes three shirts for different points on this spectrum, all in the Commits & Mayhem collection.

I Test In Prod

The original,the sarcastic version, in Navy, Brown, and Black. For the engineer who has been doing this for years and is done apologizing for it.

Shop the shirt →

The broader range of git t-shirt covers the full deployment experience, from the push that started everything to the postmortem that ended it. The complete coding t-shirts catalog has something for the rest of the team.

All of these ship from the devops shirt collection.

DevOps & Cloud collection

Shirts for the engineers who run the system,from the feature flag that saved the on-call to the postmortem that followed.

I Test In Prod

I Test In Production

Always Test In Prod

Frequently Asked Questions

Is it safe to test in production?

Yes, when you have the tools that make it safe: feature flags with automated rollback, active monitoring with on-call coverage, canary traffic splits, and small-batch deploys that keep blast radius manageable. Dangerous without those tools, regardless of what day or environment you are deploying to. The tooling is the answer, not the environment.

Why do engineers say they test in production?

Because staging is a model of production, and models are wrong in proportion to how much they differ from the real system. Schema drift, integration credential divergence, and traffic pattern gaps mean staging passes code that production fails. Engineers who say they test in production are acknowledging that production is the authoritative environment for distributed system behavior,not a confession of laziness, but a statement about epistemology.

What is the difference between staging and production testing?

Staging tests a model of the system: approximate data, synthetic traffic, staging credentials for third-party integrations. Production tests the system itself. The gap is widest in three areas: data fidelity (staging data is a snapshot, not live), integration behavior (staging credentials behave differently), and load shape (synthetic traffic cannot replicate the long tail of real user behavior). These failure modes are documented and predictable, not random.

What are feature flags and how do they make production testing safer?

A feature flag ships code to production but exposes it to a controlled percentage of users,starting at 1% and increasing based on observed metrics. If something goes wrong, the flag flips off without a deployment. According to a Wakefield survey of 600 software professionals commissioned by LaunchDarkly, feature flag users recover from incidents within one day at an 86% rate versus 59% for teams without flags. The flag is the kill switch that bounds the blast radius.

Is there an "I test in prod" t-shirt?

Yes. Techmerch makes three: I Test In Prod in Navy, Brown, and Black; I Test In Production in Navy, Black, Asphalt, and Brown; and Always Test In Prod in Black, Navy, Brown, and Maroon. Find them in the devops shirt collection.

Your cart is empty

I Test In Prod: The Honest Engineering Philosophy Nobody Admits

Everyone Tests in Production. Most Just Don't Say It.

Why Staging Lies

The Categories of What Staging Gets Wrong

Charity Majors and the Philosophy That Followed