← Back to Home

When the Same Mess Keeps Appearing, the Problem Isn't Technical

When the Same Mess Keeps Appearing, the Problem Isn't Technical

It's not uncommon to inherit a mess at a large enterprise, but inheriting the same mess multiple times teaches you something different.

I've seen this pattern play out across different brands and industries: a portfolio of tightly coupled, bespoke websites where every brand has its own rules, every exception becomes a dependency, and shipping anything takes days. One brand needs a custom checkout flow. Another requires special product categorization. A third has unique pricing logic. Each customization made sense in isolation, but together they created a system where changing one thing could break three others in ways nobody could predict.

In my recent work, I've seen the same pattern showing up with products, platforms, and services. The surface manifestation changes—sometimes it's e-commerce sites, sometimes it's internal tools, sometimes it's customer-facing applications—but the underlying structure is identical. Teams become power users of workarounds, which we politely call "undocumented capabilities."

Everyone knows the shortcuts. Don't deploy on Fridays. Test in production because staging is too different to matter. Manually verify these three things after every release because the automation can't be trusted. Call Sarah in ops before touching anything payment-related because she's the only one who knows how that integration actually works.

The operational reality

The operational reality is untenable. If engineering is fixing a bug, nothing else can ship. If something does ship, things break. Integrations fail, user experience degrades, incorrect information goes live.

I've watched release cycles stretch from hours to days because the deployment process requires coordinating across seven teams, each with their own approval gates and testing requirements. I've seen critical security patches sit for weeks because the regression risk was too high. I've witnessed product launches delayed by months because nobody could confidently predict what would break.

The teams working in these systems aren't incompetent—they're heroic. They've developed elaborate mental maps of the dependencies. They know which changes are "safe" and which might cascade. They've built tooling to detect issues faster, communication channels to coordinate around problems, and a culture of mutual support when things inevitably go wrong.

But heroism is not a sustainable operational model.

The obvious solution

Each time, the path forward is obvious: build a platform. The analysis is always compelling. We could achieve 9× throughput, 99.999% uptime, and roughly 80% fewer bugs by decoupling the right things.

I've written this business case more times than I can count. The numbers are straightforward: quantify current operational costs, estimate platform development investment, project future efficiency gains. The ROI is typically measured in months, not years. Sometimes the analysis shows the platform would pay for itself in the first quarter just from reduced incident response time.

Funding gets approved. Everyone agrees the existing approach won't scale and is steadily bleeding money. Engineering leadership wants the platform—they're tired of the constant firefighting. Product leadership wants it—they're frustrated by how long capabilities take to ship. Finance wants it—they can see the operational costs compounding.

The decision should be trivial.

The paralysis pattern

And yet, the decision to go sits unresolved for months, consistently deferred in favor of the next emergency.

This is where it gets interesting, because the pattern of deferral is remarkably consistent:

A critical customer escalation requires immediate attention, so the platform kickoff meeting gets postponed. Then a competitive threat emerges that demands a rapid product response, so platform work gets deprioritized. Then Q4 planning consumes all leadership bandwidth. Then there's a hiring freeze and we can't staff the platform team properly. Then it's the holidays. Then there's a reorganization. Then the customer escalation starts again.

Always making a short-term rational decision that is misaligned with the long-term goal.

And here's what makes this so insidious: each individual decision is rational. The customer escalation really is urgent. The competitive threat really does matter. The Q4 planning really can't wait. Nobody is being foolish or short-sighted in the moment.

But the pattern—the consistent prioritization of urgent over important, tactical over strategic, visible over systemic—that pattern reveals something deeper.

Why smart people can't decide

It's not because people don't understand the problem. I've never walked into one of these situations and had to convince anyone that the current state is unsustainable. People understand the problem.

People understand the problem, but are paralyzed because the decision system itself has broken down.

Let me be specific about what a broken decision system looks like:

Ownership is unclear. Who actually has the authority to say "we're doing this"? Engineering thinks product should prioritize it. Product thinks it's an engineering decision about technical debt. The CTO thinks the CEO needs to weigh in because it affects roadmap. The CEO thinks this is exactly the kind of technical decision that should be delegated. How many times have you tried to collaboratively create a RACI everyone agrees to? It devolves into debate about what "consulted" versus "informed" really means while the actual accountability question stays unanswered.

Different leaders optimize for different risks. The VP of Engineering sees existential technical risk—this system will collapse under scale. The VP of Product sees market risk—competitors are moving faster and we can't afford to pause feature development. The CFO sees financial risk—platform investment shows up as cost center spend with delayed revenue impact. Nobody's wrong, but the organization has no mechanism to weigh these risks against each other.

Assumptions conflict but stay implicit. Engineering assumes platform work will take six months and require full team focus. Product assumes it can happen in parallel with feature development using spare capacity. Marketing assumes the rebrand can launch on the new platform in Q2. Finance assumes we'll see operational savings immediately. These assumptions are incompatible, but they never surface explicitly until someone starts building and everything comes crashing down.

Certain data carries outsized weight, not because it matters most, but because of who presents it. The customer escalation gets priority not because it shows larger revenue risk than platform failure, but because the account executive presenting it has the CEO's ear. The competitive capability gets resourced not because the market research supports it, but because the exec who commissioned the research has more political capital than the platform advocates.

The logjam-breaker role

Eventually, someone brings me in to break the logjam.

What I've learned is that my value isn't in having better analysis—the organization usually already has excellent analysis. It's not in technical expertise—they have talented engineers who understand the solution. It's in being the external party with enough credibility to force the implicit to become explicit.

I surface the conflicting assumptions. I make people articulate their risk tolerance explicitly rather than defaulting to "we can't afford any risk." I create frameworks that force real trade-offs instead of letting everyone pretend we can have everything. I assign decision rights clearly enough that someone can actually be held accountable.

Sometimes this looks like facilitation—running workshops that extract the mental models and make conflicts visible. Sometimes it looks like analysis—building decision frameworks that weight risks systematically. Sometimes it looks like political—creating coalitions among stakeholders who previously saw themselves as opposed.

But the core pattern is always the same: restore the decision system's ability to make decisions.

The outcome

And when I'm done, the platform gets built. Not because I wrote better code or created better architecture—the internal team does that—but because the organizational blockers that prevented the decision have been cleared.

The platform is built to scale, supporting multiple products, enabling white-label flexibility, and adapting to whatever's next on the roadmap. Work that once required five people is handled by one. Faster, safer, and with better outcomes.

The technical transformation is often remarkable. Deployment frequency increases by 10×. Incident rates drop by 80%. Team velocity doubles. New capabilities that used to take months ship in weeks.

But the organizational transformation is more profound. Teams that were working around the system start working with it. Knowledge that was trapped in individuals becomes encoded in platform capabilities. Decisions that required executive escalation become routine team-level calls.

What I actually learned

Here's what I learned: The bottleneck isn't the analysis piece. It's whether an organization has a functioning decision system—one that can surface assumptions, force real trade-offs, and assign accountability before drift sets in.

Most leaders I work with know what needs to happen. They just can't get their organization to actually do it.

This realization completely changed how I think about consulting. Early in my career, I believed my job was to provide better technical solutions. Then I evolved to thinking it was about better analysis and recommendations.

Now I understand: my job is to diagnose and repair decision systems that have become dysfunctional. The technical and analytical work matters, but it's in service of something more fundamental—helping organizations regain the ability to commit to a direction and execute on it.

Why I work with startups now

That's why I now work with small businesses and startups. They don't have the luxury of six months of analysis paralysis. They need to make the call, build the thing, and move.

The constraints are clarifying. When you have 18 months of runway and three competitors breathing down your neck, the decision system can't afford to be broken. There's no room for implicit assumptions or unclear ownership. The stakes force clarity.

The decision systems I help them build are lean, fast, and designed to get them from "we should do this" to "it's done" without the enterprise theater.

This doesn't mean startups make better decisions—they often make worse ones, optimizing for speed over thoroughness in ways that create technical debt. But they make decisions. They commit. They learn from outcomes rather than debating hypotheticals indefinitely.

And here's the interesting part: the decision frameworks that work for startups often scale better than the heavyweight processes enterprises build. Because they're designed around the fundamental question—"what decision needs to be made, by whom, based on what criteria?"—rather than around risk mitigation and political consensus.

Turns out, just being able to ship is a pretty solid advantage.

The broader implication

This pattern has made me reconsider what "engineering leadership" actually means. We talk a lot about technical vision, architectural decisions, team building. All of that matters.

But the difference between organizations that execute and organizations that stall often isn't technical. It's whether the decision system can absorb complexity, surface trade-offs, and produce commitment.

I've seen brilliant CTOs fail because they couldn't navigate the organizational dynamics that blocked their technical vision. I've seen mediocre technical leaders succeed because they excelled at building decision systems that moved fast.

This is why my consulting practice focuses on decision science for engineering leaders. The technical problems are often straightforward. The organizational dynamics—that's where execution lives or dies.

The question I'm left with

What's the longest you've seen a decision sit in limbo?

For me, the record is 18 months. Eighteen months of analysis, debate, pilot projects, revised proposals, and committee meetings. The decision that eventually got made was essentially identical to the recommendation from month two. But it took 16 additional months of organizational process to build enough consensus and clarity to actually commit.

The cost wasn't just the delayed value. It was the opportunity cost of what could have been built instead, the organizational cynicism from teams who watched leadership unable to decide, and the talented people who left because they were tired of the paralysis.

That's the pattern I'm trying to break—not just in individual engagements, but by helping leaders build decision systems that operate independently of external consultants.

Because the technical problems will keep evolving. But the meta-problem—how organizations make hard choices under uncertainty—that's universal.