10. Infrastructure and cloud due diligence: scalability, cost, and risk

The target called itself “cloud-native.” The banker deck said infrastructure would scale easily for the growth plan, and the model included margin improvement from “cloud optimization” starting in quarter three.

Then the buyer asked for three simple cuts: the top 20 workloads by monthly spend, the split between production and non-production, and the last disaster recovery test.

The target could not produce them. Spend was sitting in shared accounts with weak tagging. Production and development workloads were mixed. Backups existed, but failover had not been tested in more than a year. Nothing was on fire. But the business was not as changeable, recoverable, or cost-transparent as the model assumed.

That is usually how infrastructure and cloud issues show up in deals. Not as a dramatic outage in diligence. As a set of hidden constraints that delay integration, create mandatory cash, and turn a “scalable platform” into the speed limit.

The primary decision is this:

Can the current infrastructure and cloud estate support the deal thesis on the required clock, or do you need to fund remediation, slow the plan, or protect the downside through deal structure?

Why infrastructure due diligence changes deal math

Infrastructure findings rarely matter because a server is old or a cloud account looks messy. They matter because they hit one of three levers:

Clock: connectivity, migration, TSA exit, and new product or geography launches take longer than planned
One-time cash: identity cleanup, network redesign, backup and recovery work, tenant separation, and re-platforming become mandatory before the value plan can start
Run-rate: cloud waste, duplicated tooling, unmanaged data egress, and support-heavy environments keep IT cost above the model

If the thesis assumes fast integration or near-term EBITDA expansion, infrastructure is often the gate before applications become the issue.

The common mistake: mistaking “modern” for “under control”

Deal teams see AWS, Azure, Kubernetes, Terraform, SaaS-heavy tooling, and assume the hard part is done.

That is the wrong test. Modern components can still be badly governed. The real questions are more basic:

Can the estate absorb growth or integration change without breaking?
Can management explain where the money goes and which costs move with scale?
Can the business recover if a critical platform fails, a region goes down, or a control issue forces a change?

If the answer is unclear, the buyer is not underwriting technology. The buyer is underwriting hope.

A practical lens: scale, spend, survivability

You do not need a six-week infrastructure program to get signal. In most deals, you can reach a useful answer by testing three things.

1) Scale: what happens if demand or change volume jumps?

What to check:

top revenue-supporting workloads and where they run
current utilization patterns and auto-scaling behavior
network, identity, and environment design for adding users, entities, or regions
deployment frequency and whether infrastructure changes are repeatable

What usually breaks:

workloads can handle day-to-day traffic but not post-close integration load, reporting spikes, or a new country rollout
production and non-production are mixed, so urgent testing or migration work adds operational risk
environments were built for a single business unit, not for the buyer’s control model or combined-company access patterns

The issue is not “does the system run today?” The issue is whether it can take on deal-driven change without a stability penalty.

2) Spend: can anyone tie cloud cost to business activity?

What to check:

monthly cloud spend by workload, environment, and owner
reserved commitment exposure, minimum commits, and unused capacity
cost allocation quality: tags, account structure, chargeback logic
third-party managed services and embedded infrastructure costs sitting outside the cloud invoice

What usually breaks:

the target quotes a single cloud total but cannot separate production from experimentation, project spend, or stranded capacity
“optimization upside” is really a one-time cleanup effort that competes with integration work
cloud costs appear low because observability, security tooling, or managed support are booked elsewhere

If cost cannot be tied to workloads, you cannot underwrite margin improvement with confidence.

3) Survivability: how does the business respond when something fails?

What to check:

backup and recovery design for crown-jewel systems
last DR test date, scope, and actual recovery performance
privileged access model, logging coverage, and environment segregation
operational ownership: who runs incidents, who approves change, who can recover core services

What usually breaks:

backups exist but have never been restored at the scale the business needs
a cloud account or tenant design makes separation, access changes, or forensic tracing hard
control improvements required by the buyer expose how little of the environment is standardized

This is where “cloud-first” stories fail. The problem is not the platform. It is the operating discipline around it.

What to ask for in diligence: five artifact pulls that create real signal

The fastest way to cut through confident narratives is to ask for evidence that operating teams actually use.

1) A workload inventory for the top 20 revenue, reporting, and customer-critical services

Ask for:

workload name and owner
hosting location and cloud account or subscription
monthly run cost
environment split: production, non-production, disaster recovery
dependency on shared services

Why it matters:

This shows whether management knows what it is running, where it runs, and what it costs. If they cannot produce it quickly, governance is weaker than the architecture diagram suggests.

2) Twelve months of cloud billing exports and the commitment position

Ask for:

monthly spend by service and account
reserved instances, savings plans, committed-use discounts, or similar commitments
credits, promotional offsets, and one-off anomalies
top cost spikes with explanation

Why it matters:

Spend patterns tell you whether cost is elastic, seasonal, or simply unmanaged. Credits and expiring commitments can make current run-rate look better than the post-close reality.

3) The disaster recovery record, not the DR policy

Ask for:

last DR and backup-restore test results
actual recovery time achieved versus target
systems excluded from testing
unresolved findings and owners

Why it matters:

Deals add change and scrutiny. If recovery has not been tested on the workloads that matter to revenue and close reporting, you are buying continuity risk.

4) The infrastructure change record

Ask for:

the last 6-12 months of major incidents tied to hosting, networking, identity, or platform services
release/change windows for infrastructure
rollback history and the common failure modes

Why it matters:

This is your change clock. A team with high change-failure rates or slow rollback cannot absorb aggressive Day-1 and Day-100 plans, no matter how good the target state looks on paper.

5) The control map for identity, logging, and environment boundaries

Ask for:

privileged access model
logging coverage for critical accounts and workloads
segmentation between production and non-production
infrastructure-as-code coverage for critical environments

Why it matters:

These controls determine whether the buyer can connect environments safely, investigate incidents, and make repeatable changes during integration or separation.

Decision triggers that should change price, terms, or timing

Not every infrastructure weakness deserves a red box in the diligence report. Some do. The useful question is which findings force an explicit decision.

Trigger 1: Cloud spend is not allocable enough to support the value case

If management cannot explain where roughly 80-90% of cloud spend goes by workload or environment, treat any near-term cloud synergy or margin-improvement assumption as weak.

What it changes:

remove or delay the savings line from the first-year plan
fund a cost-baseline and tagging cleanup as mandatory one-time cash
stress-test whether IT run-rate is understated

Trigger 2: A critical workload has no tested recovery path

If the systems behind order-to-cash, customer service, or month-end close have not been through a meaningful restore or failover test in the last 12 months, continuity risk is real.

What it changes:

fund recovery testing and remediation before aggressive migration work
do not rely on early consolidation or tenant moves for those workloads
consider structuring protection if the business is highly outage-sensitive

Trigger 3: Environment design blocks integration or separation

If production systems sit in shared tenants, shared accounts, or flat network designs that the buyer cannot govern cleanly, the estate may be stable but still unfit for the deal.

What it changes:

delay connectivity and identity integration
increase one-time cash for tenant/account restructuring
reset TSA exit timing where shared infrastructure is involved

Trigger 4: The savings story depends on optimization capacity the team does not have

If the same small platform team is expected to stabilize operations, support Day 1, separate or integrate environments, and also remove cloud waste, the savings case is on the wrong clock.

What it changes:

separate stabilization work from optimization work
add outside capacity or remove early savings from the model
sequence value capture after control and ownership are in place

Trigger 5: The estate is multi-cloud or hybrid without standard control patterns

Multiple clouds are not a red flag by themselves. But if the target is running a mix of cloud providers, data centers, and managed hosting with different identity models, logging patterns, and deployment methods, complexity rises fast.

What it changes:

expand the integration timeline
avoid assuming rapid tooling consolidation
fund standardization first, then rationalization

What best teams do before signing

Strong teams do not try to redesign infrastructure during diligence. They force three outputs that help the deal team choose the right posture.

1) A workload heatmap that shows what really sets the clock

The heatmap should rank the top workloads on two axes:

business criticality
change friction

This quickly separates what is safe to touch early from what needs stabilization first.

2) A cloud cost bridge from reported spend to true run-rate

This bridge should show:

recurring run cost
temporary project or migration cost
credits and offsets that disappear
stranded commitments or support cost

Without that bridge, the deal model is using accounting convenience instead of operating truth.

3) A funded Day-100 infrastructure posture

The posture should answer four questions:

what must be stabilized before Day 1 or immediately after close
what can be connected safely in the first 100 days
what must wait until control gaps are closed
what one-time cash is mandatory to get there

That is enough to change the investment committee discussion from “cloud looks fine” to “here is the real clock and the real cash.”

Where teams get trapped after close

Three patterns show up repeatedly.

First, they inherit a cost problem disguised as a tooling problem. The instinct is to launch optimization sprints. The harder truth is that cost visibility, ownership, and environment hygiene were never built. Savings arrive only after governance exists.

Second, they discover that shared infrastructure makes separation or integration slower than application work. The application plan may be ready, but identity, network boundaries, logging, and account structure are not. The result is delay without visible progress.

Third, they ask the platform team to do too much at once. Keep the business stable. Support finance close. Enable new controls. Migrate environments. Reduce spend. Improve resilience. Good teams make a sequence choice. Weak teams call all of it priority one and then miss the quarter.

What to do in the next two weeks (owners included)

If you want infrastructure due diligence to change the outcome, force it into the deal pack now.

Build the top-20 workload inventory (tech DD lead + target platform lead). Include owner, monthly cost, hosting location, and recovery posture.
Create the cloud cost bridge (tech DD lead + finance + cloud operations lead). Separate recurring run cost from credits, project spikes, and stranded commitments.
Pressure-test recoverability (security/infrastructure lead + target operations lead). Review the last DR test and identify which critical services have not been restored or failed over in the last 12 months.
Map the control gaps that affect Day 1 (integration or separation lead + identity/network lead). Focus on privileged access, logging, tenant/account boundaries, and production segregation.
Make one explicit timetable call (deal lead + IC sponsor). Decide which infrastructure-dependent value levers stay in year one, which move right, and what mandatory cash must be funded before close or immediately after.

Infrastructure due diligence is not a tour of the hosting stack. It is a test of whether the platform can carry the deal agenda without breaking the clock, the cost case, or the business.

Why infrastructure due diligence changes deal math

The common mistake: mistaking “modern” for “under control”

A practical lens: scale, spend, survivability

1) Scale: what happens if demand or change volume jumps?

2) Spend: can anyone tie cloud cost to business activity?

3) Survivability: how does the business respond when something fails?

What to ask for in diligence: five artifact pulls that create real signal

1) A workload inventory for the top 20 revenue, reporting, and customer-critical services

2) Twelve months of cloud billing exports and the commitment position

3) The disaster recovery record, not the DR policy

4) The infrastructure change record

5) The control map for identity, logging, and environment boundaries

Decision triggers that should change price, terms, or timing

Trigger 1: Cloud spend is not allocable enough to support the value case

Trigger 2: A critical workload has no tested recovery path

Trigger 3: Environment design blocks integration or separation

Trigger 4: The savings story depends on optimization capacity the team does not have

Trigger 5: The estate is multi-cloud or hybrid without standard control patterns

What best teams do before signing

1) A workload heatmap that shows what really sets the clock

2) A cloud cost bridge from reported spend to true run-rate

3) A funded Day-100 infrastructure posture

Where teams get trapped after close

What to do in the next two weeks (owners included)

Related insights

20. Limits and risks of AI-driven due diligence

19. AI for cyber, data, and risk detection in deals

18. Using AI to accelerate application and codebase analysis