10. Infrastructure and cloud due diligence: scalability, cost, and risk
A diligence framework to test whether the hosting stack can carry the deal plan, where cloud economics are hiding, and when infrastructure risk should change price, terms, or timing.
The target called itself “cloud-native.” The banker deck said infrastructure would scale easily for the growth plan, and the model included margin improvement from “cloud optimization” starting in quarter three.
Then the buyer asked for three simple cuts: the top 20 workloads by monthly spend, the split between production and non-production, and the last disaster recovery test.
The target could not produce them. Spend was sitting in shared accounts with weak tagging. Production and development workloads were mixed. Backups existed, but failover had not been tested in more than a year. Nothing was on fire. But the business was not as changeable, recoverable, or cost-transparent as the model assumed.
That is usually how infrastructure and cloud issues show up in deals. Not as a dramatic outage in diligence. As a set of hidden constraints that delay integration, create mandatory cash, and turn a “scalable platform” into the speed limit.
The primary decision is this:
Can the current infrastructure and cloud estate support the deal thesis on the required clock, or do you need to fund remediation, slow the plan, or protect the downside through deal structure?
Why infrastructure due diligence changes deal math
Infrastructure findings rarely matter because a server is old or a cloud account looks messy. They matter because they hit one of three levers:
- Clock: connectivity, migration, TSA exit, and new product or geography launches take longer than planned
- One-time cash: identity cleanup, network redesign, backup and recovery work, tenant separation, and re-platforming become mandatory before the value plan can start
- Run-rate: cloud waste, duplicated tooling, unmanaged data egress, and support-heavy environments keep IT cost above the model
If the thesis assumes fast integration or near-term EBITDA expansion, infrastructure is often the gate before applications become the issue.
The common mistake: mistaking “modern” for “under control”
Deal teams see AWS, Azure, Kubernetes, Terraform, SaaS-heavy tooling, and assume the hard part is done.
That is the wrong test. Modern components can still be badly governed. The real questions are more basic:
- Can the estate absorb growth or integration change without breaking?
- Can management explain where the money goes and which costs move with scale?
- Can the business recover if a critical platform fails, a region goes down, or a control issue forces a change?
If the answer is unclear, the buyer is not underwriting technology. The buyer is underwriting hope.
A practical lens: scale, spend, survivability
You do not need a six-week infrastructure program to get signal. In most deals, you can reach a useful answer by testing three things.
1) Scale: what happens if demand or change volume jumps?
What to check:
- top revenue-supporting workloads and where they run
- current utilization patterns and auto-scaling behavior
- network, identity, and environment design for adding users, entities, or regions
- deployment frequency and whether infrastructure changes are repeatable
What usually breaks:
- workloads can handle day-to-day traffic but not post-close integration load, reporting spikes, or a new country rollout
- production and non-production are mixed, so urgent testing or migration work adds operational risk
- environments were built for a single business unit, not for the buyer’s control model or combined-company access patterns
The issue is not “does the system run today?” The issue is whether it can take on deal-driven change without a stability penalty.
2) Spend: can anyone tie cloud cost to business activity?
What to check:
- monthly cloud spend by workload, environment, and owner
- reserved commitment exposure, minimum commits, and unused capacity
- cost allocation quality: tags, account structure, chargeback logic
- third-party managed services and embedded infrastructure costs sitting outside the cloud invoice
What usually breaks:
- the target quotes a single cloud total but cannot separate production from experimentation, project spend, or stranded capacity
- “optimization upside” is really a one-time cleanup effort that competes with integration work
- cloud costs appear low because observability, security tooling, or managed support are booked elsewhere
If cost cannot be tied to workloads, you cannot underwrite margin improvement with confidence.
3) Survivability: how does the business respond when something fails?
What to check:
- backup and recovery design for crown-jewel systems
- last DR test date, scope, and actual recovery performance
- privileged access model, logging coverage, and environment segregation
- operational ownership: who runs incidents, who approves change, who can recover core services
What usually breaks:
- backups exist but have never been restored at the scale the business needs
- a cloud account or tenant design makes separation, access changes, or forensic tracing hard
- control improvements required by the buyer expose how little of the environment is standardized
This is where “cloud-first” stories fail. The problem is not the platform. It is the operating discipline around it.
What to ask for in diligence: five artifact pulls that create real signal
The fastest way to cut through confident narratives is to ask for evidence that operating teams actually use.
1) A workload inventory for the top 20 revenue, reporting, and customer-critical services
Ask for:
- workload name and owner
- hosting location and cloud account or subscription
- monthly run cost
- environment split: production, non-production, disaster recovery
- dependency on shared services
Why it matters:
This shows whether management knows what it is running, where it runs, and what it costs. If they cannot produce it quickly, governance is weaker than the architecture diagram suggests.
2) Twelve months of cloud billing exports and the commitment position
Ask for:
- monthly spend by service and account
- reserved instances, savings plans, committed-use discounts, or similar commitments
- credits, promotional offsets, and one-off anomalies
- top cost spikes with explanation
Why it matters:
Spend patterns tell you whether cost is elastic, seasonal, or simply unmanaged. Credits and expiring commitments can make current run-rate look better than the post-close reality.
3) The disaster recovery record, not the DR policy
Ask for:
- last DR and backup-restore test results
- actual recovery time achieved versus target
- systems excluded from testing
- unresolved findings and owners
Why it matters:
Deals add change and scrutiny. If recovery has not been tested on the workloads that matter to revenue and close reporting, you are buying continuity risk.
4) The infrastructure change record
Ask for:
- the last 6-12 months of major incidents tied to hosting, networking, identity, or platform services
- release/change windows for infrastructure
- rollback history and the common failure modes
Why it matters:
This is your change clock. A team with high change-failure rates or slow rollback cannot absorb aggressive Day-1 and Day-100 plans, no matter how good the target state looks on paper.
5) The control map for identity, logging, and environment boundaries
Ask for:
- privileged access model
- logging coverage for critical accounts and workloads
- segmentation between production and non-production
- infrastructure-as-code coverage for critical environments
Why it matters:
These controls determine whether the buyer can connect environments safely, investigate incidents, and make repeatable changes during integration or separation.
Decision triggers that should change price, terms, or timing
Not every infrastructure weakness deserves a red box in the diligence report. Some do. The useful question is which findings force an explicit decision.
Trigger 1: Cloud spend is not allocable enough to support the value case
If management cannot explain where roughly 80-90% of cloud spend goes by workload or environment, treat any near-term cloud synergy or margin-improvement assumption as weak.
What it changes:
- remove or delay the savings line from the first-year plan
- fund a cost-baseline and tagging cleanup as mandatory one-time cash
- stress-test whether IT run-rate is understated
Trigger 2: A critical workload has no tested recovery path
If the systems behind order-to-cash, customer service, or month-end close have not been through a meaningful restore or failover test in the last 12 months, continuity risk is real.
What it changes:
- fund recovery testing and remediation before aggressive migration work
- do not rely on early consolidation or tenant moves for those workloads
- consider structuring protection if the business is highly outage-sensitive
Trigger 3: Environment design blocks integration or separation
If production systems sit in shared tenants, shared accounts, or flat network designs that the buyer cannot govern cleanly, the estate may be stable but still unfit for the deal.
What it changes:
- delay connectivity and identity integration
- increase one-time cash for tenant/account restructuring
- reset TSA exit timing where shared infrastructure is involved
Trigger 4: The savings story depends on optimization capacity the team does not have
If the same small platform team is expected to stabilize operations, support Day 1, separate or integrate environments, and also remove cloud waste, the savings case is on the wrong clock.
What it changes:
- separate stabilization work from optimization work
- add outside capacity or remove early savings from the model
- sequence value capture after control and ownership are in place
Trigger 5: The estate is multi-cloud or hybrid without standard control patterns
Multiple clouds are not a red flag by themselves. But if the target is running a mix of cloud providers, data centers, and managed hosting with different identity models, logging patterns, and deployment methods, complexity rises fast.
What it changes:
- expand the integration timeline
- avoid assuming rapid tooling consolidation
- fund standardization first, then rationalization
What best teams do before signing
Strong teams do not try to redesign infrastructure during diligence. They force three outputs that help the deal team choose the right posture.
1) A workload heatmap that shows what really sets the clock
The heatmap should rank the top workloads on two axes:
- business criticality
- change friction
This quickly separates what is safe to touch early from what needs stabilization first.
2) A cloud cost bridge from reported spend to true run-rate
This bridge should show:
- recurring run cost
- temporary project or migration cost
- credits and offsets that disappear
- stranded commitments or support cost
Without that bridge, the deal model is using accounting convenience instead of operating truth.
3) A funded Day-100 infrastructure posture
The posture should answer four questions:
- what must be stabilized before Day 1 or immediately after close
- what can be connected safely in the first 100 days
- what must wait until control gaps are closed
- what one-time cash is mandatory to get there
That is enough to change the investment committee discussion from “cloud looks fine” to “here is the real clock and the real cash.”
Where teams get trapped after close
Three patterns show up repeatedly.
First, they inherit a cost problem disguised as a tooling problem. The instinct is to launch optimization sprints. The harder truth is that cost visibility, ownership, and environment hygiene were never built. Savings arrive only after governance exists.
Second, they discover that shared infrastructure makes separation or integration slower than application work. The application plan may be ready, but identity, network boundaries, logging, and account structure are not. The result is delay without visible progress.
Third, they ask the platform team to do too much at once. Keep the business stable. Support finance close. Enable new controls. Migrate environments. Reduce spend. Improve resilience. Good teams make a sequence choice. Weak teams call all of it priority one and then miss the quarter.
What to do in the next two weeks (owners included)
If you want infrastructure due diligence to change the outcome, force it into the deal pack now.
- Build the top-20 workload inventory (tech DD lead + target platform lead). Include owner, monthly cost, hosting location, and recovery posture.
- Create the cloud cost bridge (tech DD lead + finance + cloud operations lead). Separate recurring run cost from credits, project spikes, and stranded commitments.
- Pressure-test recoverability (security/infrastructure lead + target operations lead). Review the last DR test and identify which critical services have not been restored or failed over in the last 12 months.
- Map the control gaps that affect Day 1 (integration or separation lead + identity/network lead). Focus on privileged access, logging, tenant/account boundaries, and production segregation.
- Make one explicit timetable call (deal lead + IC sponsor). Decide which infrastructure-dependent value levers stay in year one, which move right, and what mandatory cash must be funded before close or immediately after.
Infrastructure due diligence is not a tour of the hosting stack. It is a test of whether the platform can carry the deal agenda without breaking the clock, the cost case, or the business.