
Art Deco Street
How to decide, de‑risk, and deliver your own warehouse‑scale capability
TL;DR
Treat the whole site as one large computer —a warehouse‑scale computer (WSC)— run with heavy automation and clear reliability targets. If your top priorities are data control, provable security, and predictable costs, owning a WSC is usually the right call. Start with a small pilot pod, prove it with evidence, then scale.
Who this is for
CIOs, CTOs, CISOs, finance leaders, and heads of operations who want a plain‑English view of when to own a datacenter and how to do it without surprises.
The big idea (in simple terms)
Run your modern datacenter as one integrated system: compute, storage, network, power, cooling, security, and operations working together with automation. That approach (the WSC model) is how you get:
- Tighter control of sensitive data (including where it lives and which keys protect it).
- Security you can prove, not just promise (attestation, key custody, tamper‑evident evidence).
- Cost stability at scale by steering the levers that matter: energy, capacity, utilization, your Total Cost of Ownership (TCO) becomes predictable.
Glossary
- WSC (Warehouse‑Scale Computer): operating the entire site as one big computer.
- TCO (Total Cost of Ownership): all‑in cost to build, run, and refresh the platform.
- SLO (Service Level Objective): a reliability target users should feel.
- P95/P99/P999: the 95th/99th/99.9th percentile of response time, what slower users experience.
- PUE (Power Usage Effectiveness): building‑level power efficiency (lower is better).
- UPS (Uninterruptible Power Supply): backup power that rides through short outages.
- HSM (Hardware Security Module): a locked‑down safe for encryption keys.
- Pilot Pod: production‑like slice of future datacenter.
- Stranded power" refers to energy that is generated but goes unused because it cannot be transmitted to where it is needed.
A simple decision guide: Own, Colocate, or Hybrid?
(Assumes disciplined automation, meeting utilization targets, and a solid operating model.) Legend:
- Strong = best fit
- Moderate = workable with trade‑offs
- Limited = generally not the best route
| Decision Factor | Own (Your WSC) | Colocation (Provider Facility) | Hybrid (Own Core + Colo Tactically) |
|---|---|---|---|
| Data control & key custody | Strong – your policies, your keys | Moderate – depends on provider/model | Strong – keep “crown jewels” in core; use colo for edge/burst |
| Security you can prove | Strong – end‑to‑end evidence is yours | Moderate – shared controls; varied evidence | Strong – prove the core; rely on provider attestations for the rest |
| Speed to capacity | Limited/Moderate – build/gear lead times | Strong – fastest path to power & space | Moderate – quick in colo while core ramps |
| Cost predictability (3–5 yrs) | Strong – when utilization and oversubscription are governed | Moderate – opex predictable; margin applies | Strong – stable core, flexible edge for bursts |
| Availability & fault domains | Strong – you design zones/regions | Moderate – depends on topology/SLA | Strong – own regions + metro diversity via colo |
| Utilization leverage | Strong – you set the rules & scheduling | Moderate – provider constraints | Strong – optimize core; flex elsewhere |
| Energy & sustainability oversight | Strong – set and enforce targets (PUE, rack kW) | Moderate – provider reporting | Strong – own targets at core; measure at edge |
| Talent & ops load | Moderate/Limited – needs mature ops/SRE | Strong – facility ops offloaded | Moderate – SRE for core; outsource facility work for non‑critical |
Rule of thumb: If your top three priorities are data control, verifiable security, and predictable cost (and you will operate efficiently) owning your WSC is usually right. Use colocation where speed or geography matters (edge presence, quick capacity, metro diversity).
Five moves that change outcomes
1) Place work by business need, not by vendor
Put systems into two buckets:
- Latency‑sensitive (user‑facing, time‑critical): keep in your most reliable zones with strict safeguards.
- Throughput/batch (analytics, AI training, back‑office): run where you can safely push higher utilization with scheduling and power caps.
Write the placement rules. Keep exceptions rare and time‑bound.
2) Design energy and density first, it drives both cost and uptime
Decide early on power distribution, UPS strategy, and cooling, especially for high‑power AI racks. Set per‑rack power limits and enforce them. If you plan oversubscription (allowing more IT power than the absolute max on the floor), pair it with power capping and smart job scheduling so big jobs don’t trip breakers.
Helpful metrics
- PUE for building efficiency.
- IT‑side efficiency (how much power becomes useful work vs. heat).
- kW per rack and thermal headroom by row.
3) Run reliability like a process, not a hope
Define SLOs per user‑visible path. Don’t just track averages; watch *P95/P99/P999. Ship changes via canaries (small, safe rollout) and auto‑rollback if metrics degrade. Keep a drill book (power hiccup, cooling fault, network partition, brownout (controlled degrade)) and practice quarterly.
4) Make security provable
- Verified boot & attestation at boot and on schedule (prove each machine runs approved software).
- HSM for key custody with defined key ceremonies.
- Tamper‑evident logging so incident reconstruction stands up to scrutiny.
- Data‑integrity sweeps (periodic “health scans” to detect and fix silent corruption).
5) Pilot like you’re betting the company (because you are)
Build a small, production‑like pod first. Do not scale until you have evidence:
- Attestation and key procedures pass.
- SLOs hold under realistic load and failure drills.
- AI training spikes don’t overwhelm power/cooling (capping works).
- Actuals track the TCO model.
What to ask, so you don’t pay twice later
Power & Cooling
- “Walk me through the single‑line electrical diagram and the thermal model. Where are the weak spots?”
- “What’s our plan for high‑power AI racks (cooling, maintenance, safety)? What’s the per‑rack power limit, and how is it enforced?”
- “What oversubscription and power‑capping rules will we use, and what happens when several big jobs start at once?”
Reliability & Operations
- “What are our SLOs, and what’s our policy when we burn error budget (use more downtime than planned)?”
- “Which changes ship behind canaries? When do we auto‑rollback?”
- “Which failure drills are mandatory before go‑live and every quarter after?”
Security & Evidence
- “How much of the fleet is covered by verified boot and periodic attestation at go‑live? How do we detect drift?”
- “Where do our keys live, who runs key ceremonies, and how do we audit custody?”
- “Can we fully replay an incident with tamper‑evident logs?”
Economics & Capacity
- “What utilization are we targeting in year 1 and year 2, and how do we avoid stranded power?”
- “What drives TCO variance (utilization, energy, schedule, hardware refresh), and what thresholds force a re‑plan?”
- “How do we phase construction to match demand, and when do we use colocation as a relief valve?”
Your Starter plan
Guardrails & visibility
- Ratify non‑negotiables (data residency, SLO tiers, PUE targets, rack kW limits).
- Stand up program bodies (architecture, security, operations).
- Baseline telemetry: P95/P99, power, thermal, utilization.
Pilot pod
- Build the first racks with production‑like power/cooling/network.
- Enable verified boot, attestation, and tamper‑evident logs.
- Run the failure‑drill rehearsal and measure tail latency.
Prove and decide
- Validate AI step‑load behavior under power capping.
- Close gaps found in drills; publish the go/no‑go brief to scale the next envelope.
The monthly dashboard you should see
- Data & keys: residency compliance; key‑custody health; attestation coverage.
- Reliability: SLO results; P95/P99/P999; MTTR (Mean Time to Repair); drill pass rate.
- Energy & density: PUE, IT‑side efficiency, average kW/rack, thermal headroom.
- Capacity & utilization: server/accelerator utilization; stranded power; network hot‑spots.
- Integrity & security: integrity sweeps on time; incident‑replay tests passed.
- Economics: actuals vs TCO plan; top variance drivers.
Common traps (and the simple antidotes)
- Designing for averages: watch tails (P95/P99/P999) and design for fan‑out.
- Under‑engineering power/cooling: fix envelopes first; costs follow.
- Security by promise:: require attestation and tamper‑evident evidence.
- Pilot as a demo: it’s an acceptance test, not a showcase.
Bottom line
Owning your datacenter works when you run it like a single, well‑managed product: automate, keep it busy with the right work in the right place, measure what users actually feel, prove security with evidence, and rehearse failures until they’re boring. Do a serious pilot, insist on data, then scale. That’s how you get data control, provable security, and predictable costs, on your terms.