So you want to build your own data center Hackernews Viewer

So you want to build your own data center

(blog.railway.com)

596 points by dban 17 January 2025 | 276 comments

Comments

motoboi 18 January 2025

I’m my experience and based on writeups like this: Google hates having customers.

Someone decided they have to have a public cloud, so they did it, but they want to keep clients away with a 3 meter pole.

My AWS account manager is someone I am 100% certain would roll in the mud with me if necessary. Would sleep in the floor with us if we asked in a crisis.

Our Google cloud representatives make me sad because I can see that they are even less loved and supported by Google than us. It’s sad seeing someone trying to convince their company to sell and actually do a good job providing service. It’s like they are setup to fail.

Microsoft guys are just bulletproof and excel in selling, providing a good service and squeezing all your money out of your pockets and you are mortally convinced it’s for your own good. Also have a very strange cloud… thing.

As for the railway company going metal, well, I have some 15 years of experience with it. I’ll never, NEVER, EVER return to it. It’s just not worth it. But I guess you’ll have to discover it by yourselves. This is the way.

You soon discover what in freaking world is Google having so much trouble with. Just make sure you really really love and really really want to sell service to people, instead of building borgs and artificial brains and you’ll do 100x better.

toddmorey 18 January 2025

Reminds me of the old Rackspace days! Boy we had some war stories:

   - Some EMC guys came to install a storage device for us to test... and tripped over each other and knocked out an entire Rack of servers like a comedy skit. (They uh... didn't win the contract.)
   - Some poor guy driving a truck had a heart attack and the crash took our DFW datecenter offline. (There were ballards to prevent this sort of scenario, but the cement hadn't been poured in them yet.)
   - At one point we temporarily laser-beamed bandwidth across the street to another building
   - There was one day we knocked out windows and purchased box fans because servers were literally catching on fire.

Data center science has... well improved since the earlier days. We worked with Facebook on the OpenCompute Project that had some very forward looking infra concepts at the time.

ChuckMcM 18 January 2025

From the post: "...but also despite multi-million dollar annual spend, we get about as much support from them as you would spending $100." -- Ouch! That is a pretty huge problem for Google.

I really enjoyed this post, mostly because we had similar adventures when setting up the infrastructure for Blekko. For Blekko, a company that had a lot of "east west" network traffic (that is traffic that goes between racks vs to/from the Internet at large) having physically colocated services without competing with other servers for bandwidth was both essential and much more cost effective than paying for this special case at SoftLayer (IBM's captive cloud).

There are some really cool companies that will build an enclosure for your cold isle, basically it ensures all the air coming out of the floor goes into the back of your servers and not anywhere else. It also keeps not cold air from being entrained from the sides into your servers.

The calculations for HVAC 'CRAC' capacity in a data center are interesting too. In the first CoLo facility we had a 'ROFOR' (right of first refusal) on expanding into the floor area next to our cage, but when it came time to expand the facility had no more cooling capacity left so it was meaningless.

Once you've done this exercise, looking at the 0xide solution will make a lot more sense to you.

chatmasta 18 January 2025

This is how you build a dominant company. Good for you ignoring the whiny conventional wisdom that keeps people stuck in the hyperscalers.

You’re an infrastructure company. You gotta own the metal that you sell or you’re just a middleman for the cloud, and always at risk of being undercut by a competitor on bare metal with $0 egress fees.

Colocation and peering for $0 egress is why Cloudflare has a free tier, and why new entrants could never compete with them by reselling cloud services.

In fact, for hyperscalers, bandwidth price gouging isn’t just a profit center; it’s a moat. It ensures you can’t build the next AWS on AWS, and creates an entirely new (and strategically weaker) market segment of “PaaS” on top of “IaaS.”

jdoss 17 January 2025

This is a pretty decent write up. One thing that comes to mind is why would you write your own internal tooling for managing a rack when Netbox exists? Netbox is fantastic and I wish I had this back in the mid 2000s when I was managing 50+ racks.

https://github.com/netbox-community/netbox

sour-taste 18 January 2025

I used to work on machine repair automation at a big tech company. IMO repairs are one of the overlooked and harder things to deal with. When you run on AWS you don't really think about broken hardware it mostly just repairs itself. When you do it yourself you don't have that luxury. You need to have spare parts, technician to do repairs, a process for draining/undraining jobs off hosts, testing suites, hardware monitoring tools and 1001 more things to get this right. At smaller scales you can cut corners on some of these things but they will eventually bite you. And this is just machines! Networking gear has it's own fun set of problems, and when it fails it can take down your whole rack. How much do you trust your colos not to lose power during peak load? I hope you run disaster recovery drills to prep for these situations!

Wishing all the best to this team, seems like fun!

jpleger 18 January 2025

Makes me remember some of the days I had in my career. There were a couple really interesting datacenter things I learned by having to deploy tens of thousands of servers in the 2003-2010 timeframe.

Cable management and standardization was extremely important (like you couldn't get by with shitty practices). At one place where we were deploying hundreds of servers per week, we had a menu of what ops people could choose if the server was different than one of the major clusters. We essentially had 2 chassis options, big disk servers which were 2u or 1u pizza boxes. You then could select 9/36/146gb SCSI drives. Everything was dual processor with the same processors and we basically had the bottom of the rack with about 10x 2u boxes and then the rest was filled with 20 or more 1u boxes.

If I remember correctly we had gotten such an awesome deal on the price for power, because we used facility racks in the cage or something, since I think they threw in the first 2x 30 amp (240v) circuits for free when you used their racks. IIRC we had a 10 year deal on that and there was no metering on them, so we just packed each rack as much as we could. We would put 2x 30s on one side and 2x 20s on another side. I have to think that the DC was barely breaking even because of how much heat we put out and power consumption. Maybe they were making up for it in connection / peering fees.

I can't remember the details, will have to check with one of my friends that worked there around that time.

maxclark 18 January 2025

There's places where it makes sense to be on the cloud, and places where it doesn't. The two best examples I can give are for high bandwidth, or heavy disk intensive applications.

Take Netflix. While almost everything is in the cloud the actual delivery of video is via their own hardware. Even at their size I doubt this business would be economically feasible if they were paying someone else for this.

Something I've seen often (some numbers changed because...)

20 PB Egress at $0.02/GB = $400,000/month

20 PB is roughly 67 Gbps 95th Percentile

It's not hard to find 100 Gbps flat rate for $5,000/month

Yes this is overly simplistic, and yes there's a ton more that goes into it than this. But the delta is significant.

For some companies $4,680,000/year doesn't move the needle, for others this could mean survival.

sitkack 17 January 2025

It would be nice to have a lot more detail. The WTF sections are the best part. Sounds like your gear needs "this side towards enemy" sign and/or the right affordances so it only goes in one way.

Did you standardize on layout at the rack level? What poke-yoke processes did you put into place to prevent mistakes?

What does your metal->boot stack look like?

Having worked for two different cloud providers and built my own internal clouds with PXE booted hosts, I too find this stuff fascinating.

Also take utmost advantage of a new DC when you are booting it to try out all the failure scenarios you can think of and the ones you can't through randomized fault injection.

Bluecobra 18 January 2025

Good writeup! Google really screws you when you are looking for 100G speeds, it's almost insulting. For example redundant 100G dedicated interconnects are about $35K per month and that doesn't include VLAN attachments, colo x-connect fees, transit, etc. Not only that, they max out on 50G for VLAN attachments.

To put this cost into perspective, you can buy two brand new 32 port 100G switches from Arista for the same amount of money. In North America, you can get 100G WAN circuits (managed Wavelength) for less than $5K/month. If it's a local metro you can also get dark fiber for less and run whatever speed you want.

random_savv 18 January 2025

I guess there's another in-between step buying your own hardware, even when merely "leasing individual racks", and EC2 instances: dedicated bare metal providers like Hetzner.

This lets one get closer to the metal (e.g. all your data is on your specific disk, rather than an abstracted block storage, such as EBS, not shared with other users, cheaper, etc) without having to worry about the staff that installs the hardware or where/how it fits in a rack.

For us, this was a way to get 6x performance for 1/6 of the cost. (Excluding, of course our time, but we enjoyed it!)

winash83 18 January 2025

We went down this path over the last year, lots of our devs need local and dev/test environments and AWS was costing us a bomb, With about 7 Bare metals(Colocation) we are running about 200+ VMs and could double that number with some capacity to spare. For management, we built a simple wrapper over libvirt. I am setting up another rack in the US and will end up costing around $75Kper year for a similar capacity.

Our prod is on AWS but we plan to move everything else and it's expected to save at least a quarter of a million dollars per year

dban 17 January 2025

This is our first post about building out data centers. If you have any questions, we're happy to answer them here :)

blmt 18 January 2025

I am really thankful for this article as I finally get where my coworkers get "wrong" notions about three-phase power use in DC:

>The calculations aren’t as simple as summing watts though, especially with 3-phase feeds — Cloudflare has a great blogpost covering this topic.

What's written in the Cloudflare blogpost linked in the article holds true only of you can use a Delta config (as done in the US to obtain 208V) as opposed to the Wye config used in Europe. The latter does not give a substantial advantage: no sqrt(3) boost to power distribution efficiency and you end up adding Watts for three independent single phase circuits (cfr. https://en.m.wikipedia.org/wiki/Three-phase_electric_power).

linsomniac 18 January 2025

Was really hoping this was was actually about building your own data center. Our town doesn't have a data center, we need to go an hour south or an hour north. The building that a past failed data center was in (which doesn't bode well for a data center in town, eh?), is up for lease and I'm tempted.

But, I'd need to start off small, probably per-cabinet UPSes and transfer switches, smaller generators. I've built up cabinets and cages before, but never built up the exterior infrastructure.

Agingcoder 19 January 2025

They’re not building their own data center - they’re doing what lots of companies have been doing for years ( including where I work , and I specialize in hpc so this is all fairly standard ), which is buying space and power in a dc, and installing boxes in there. Yes, it’s possible to get it wrong. It is however not the same as building a DC …

nyrikki 18 January 2025

> This will likely involve installing some overhead infrastructure and trays that let you route fiber cables from the edge of your cage to each of your racks, and to route cables between racks

Perhaps I am reading this wrong, as you appear to be fiber heavy and do have space on the ladder rack for copper, but if you are commingling the two, be careful. A possible future iteration, would consider a smaller panduit fiberunner setup + a wire rack.

Co-mingling copper and fiber, especially through the large spill-overs works until it doesn't.

Depending on how adaptive you need to be with technology changes, you may run into this in a few years.

4x6 encourages a lot of people putting extra cable up in those runners, and sharing a spout with cat-6, cx-#, PDU serial, etc... will almost always end badly for some chunk of fiber. After those outages it also encourages people to 'upgrade in place'. When you are walking to your cage look at older cages, notice the loops sticking out of the tops of the trays and some switches that look like porcupines because someone caused an outage and old cables are left in place.

Congrats on your new cage.

renewiltord 17 January 2025

More to learn from the failures than the blog haha. It tells you what the risks are with a colocation facility. There really isn't any text on how to do this stuff. The last time I wanted to build out a rack there aren't even any instructions on how to do cable management well. It's sort of learned by apprenticeship and practice.

ksec 18 January 2025

I am just fascinated by the need of Datacenter. The scale is beyond comprehension. 10 years ago, before the word HyperScaler was even invented or popularised, I would have thought DC market to be on the decline or levelled off now or around this time. One reason being HyperScaler, AWS, Google, Microsoft, Meta, Apple, Tencent, Alibaba, to smaller ones like Oracle and IBM. They would all have their own DC, taking on much of the compute for themselves and others. While left over space would be occupied by third parties. Another reason being the compute, memory and storage density continue to increase, which means for the same amount of floor space we are offering 5 - 20x of the previous CPU / RAM / Storage.

Turns out we are building like mad and we are still not building enough.

dylan604 18 January 2025

My first colo box came courtesy of a friend of a friend that worked for one of the companies that did that (leaving out names to protect the innocent). It was a true frankenputer built out of whatever spare parts he had laying around. He let me come visit it, and it was an art project as much as a webserver. The mainboard was hung on the wall with some zip ties, the PSU was on the desk top, the hard drive was suspended as well. Eventually, the system was upgraded to newer hardware, put in an actual case, and then racked with an upgraded 100base-t connection. We were screaming in 1999.

pixelesque 18 January 2025

The date and time durations given seem a bit confusing to me...

"we kicked off a Railway Metal project last year. Nine months later we were live with the first site in California".

seems inconsistent with:

"From kicking off the Railway Metal project in October last-year, it took us five long months to get the first servers plugged in"

The article was posted today (Jan 2025), was it maybe originally written last year and the project has been going on for more than a year, and they mean that the Railway Metal project actually started in 2023?

scarab92 18 January 2025

Interesting that they call out the extortionate egress fees from the majors as a motivation, but are nevertheless also charging customers $0.10 per GB themselves.

esher 19 January 2025

I can relate.

We provide a small PaaS-like hosting service, kinda similar to Railway (but more niche). We have recently re-elaborated our choice for AWS (since $$$) as infra provider, but will now stick to it [1].

We started with colocation 20 years ago. For a tiny provider it was quite a hustle (but also an experience). We just had too many single point of failures and we found ourselves dealing with physical servers way too often. We also struggled to fade out and replace hardware.

Without reading all the comments thoroughly: For me, being on infra that runs on green energy is important. I think it's also a trend with customers, there even service for this [2]. I don't see it mentioned here.

[1] https://blog.fortrabbit.com/infra-research-2024 [2] https://www.thegreenwebfoundation.org/

j-b 17 January 2025

Love these kinds of posts. Tried railway for the first time a few days ago. It was a delightful experience. Great work!

hintymad 18 January 2025

Per my experience with cloud, the most powerful Infra abstraction that AWS offers is actually EC2. The simplicity of getting a cluster of machines up and running with all the metadata readily available via APIs is just liberating. And it just works: the network is easy to configure, the ASGs are flexible enough to customize, and the autoscaling offers strong primitives for advanced scaling.

Amazingly, few companies who run their own DCs could build anything comparable to EC2, even at a smaller scale. When I worked in those companies, I sorely missed EC2. I was wondering if there's any robust enough open-source alternatives to EC2's control-plane software to manage baremetals and offer VMs on top them. That'll be awesome for companies that build their own DCs.

matt-p 18 January 2025

If you’re using 7280-SR3 switches, they’re certainly a fine choice. However, have you considered the 7280-CR3(K) range? They're much better $/Gbps and more relevant edge interfaces.

At this scale, why did you opt for a spine-and-leaf design with 25G switches and a dedicated 32×100G spine? Did you explore just collapsing it and using 1-2 32×100G switches per rack, then employing 100G>4×25G AOC breakout cables and direct 100G links for inter-switch connections and storage servers?

Have you also thought about creating a record on PeeringDB?https://www.peeringdb.com/net/400940.

By the way, I’m not convinced I’d recommend a UniFi Pro for anything, even for out-of-band management.

coolkil 17 January 2025

Awesome!! Hope to see more companies go this route. I had the pleasure to do something similar for a company(lot smaller scale though)

It was my first job out of university. I will never forget the awesome experience of walking into the datacenter and start plugging cables and stuff

ThinkBeat 18 January 2025

1. Is the impression they decided to use a non datacenter location to put their datacenter, If so that is not a good idea.

2. Geographical distanced backups, if the primary fails. Without this you are already in trouble. What happens if the buildings burns down?

3. Hooking up with "local" ISPs That seems ok. As long as ISP failing is easily and autoamically dealt with.

4. I am a bit confused about what happens at the edge. On the one head it seems like you have 1 datacenter, and ISPs doing routing, other places I get the impression you have compute close to the edge. Which is it?

sometalk 17 January 2025

I remember talking to Jake a couple of years ago when they were looking for someone with a storage background. Cool dude, and cool set of people. Really chuffed to see them doing what they believe in.

cyberax 18 January 2025

It looked interesting, until I got to the egress cost. Ouch. $100 per TB is way too much if you're using bandwidth-intensive apps.

Meta-comment: it's getting really hard to find hosting services that provide true unlimited bandwidth. I want to do video upload/download in our app, and I'm struggling to find providers of managed servers that would be willing to provide me with fixed price for 10/100GB ports.

solarkraft 18 January 2025

Cool post and cool to see Railway talked about more on here.

I‘ve used their postgres offering for a small project (crucially it was accessible from the outside) and not only was setting it up a breeze, cost was also minimal (I believe staying within the free tier). I haven’t used the rest of the platform, but my interaction with them would suggest it would probably be pretty nice.

physhster 18 January 2025

Having done data center builds for years, mostly on the network side but realistically with all the trades, this is a really cool article.

a1o 18 January 2025

Excellent write up! This is not the first blog post I see in recent times on going to owning infrastructure direction, but it is certainly well written and I liked the use of Excel in it, a good use, although visually daunting!

yread 18 January 2025

Useful article. I was almost planning to rent a rack somewhere but it seems there's just too much work and too many things to go wrong and it's better to rent cheap dedicated servers and make it somebody elses problem

__fst__ 17 January 2025

Can anyone recommend some engineering reading for building and running DC infrastructure?

aetherspawn 17 January 2025

What brand of servers was used?

robertclaus 17 January 2025

I would be super interested to know how this stuff scales physically - how much hardware ended up in that cage (maybe in Cloud-equivalent terms), and how much does it cost to run now that it's set up?

whalesalad 18 January 2025

Cliffhanger! Was mostly excited about the networking/hypervisor setup. Curious to see the next post about the software defined networking. Had not heard of FRR or SONIC previously.

teleforce 18 January 2025

>despite multi-million dollar annual spend, we get about as much support from them as you would spending $100

Is it a good or a bad thing to have the same customer support across the board?

kolanos 18 January 2025

As someone who lost his shirt building a data center in the early 2000s, Railway is absolutely going about this the right way with colocation.

throwaway2037 18 January 2025

I promise my comment is not intended to troll. Why didn't you use Oxide pre-built racks? Just the power efficiency seems like a huge win.

nextworddev 17 January 2025

First time checking out railway product- it seems like a “low code” and visual way to define and operate infrastructure?

Like, if Terraform had a nice UI?

ramon156 17 January 2025

weird to think my final internship was running on one of these things. thanks for all the free minutes! it was a nice experience

lifeinthevoid 18 January 2025

Man, I get an anxiety attack just thinking about making this stuff work. Kudos to all the people doing this.

praveen9920 18 January 2025

Reliability stats aside, would have loved to see cost differences between on-prem and cloud.

Over2Chars 18 January 2025

I guess we can always try to re-hire all those "Sys Admins" we thought we could live without.

LOL?

Melatonic 18 January 2025

We're back to the cycle of Mainframe/Terminal --> Personal Computer

superq 18 January 2025

"So you want to build OUT your own data center" is a better title.

enahs-sf 18 January 2025

Curious why California when the kwh is so high here vs Oregon or Washington

Havoc 18 January 2025

Surprised to see pxe. Didn’t realise that was in common use in racks

concerndc1tizen 18 January 2025

@railway

What would you say are your biggest threats?

Power seems to the big one, especially when the AI power and electric vehicle demand will drive up kWh prices.

Networking seems another one. I'm out of the loop, but it seems to me like the internet is still stuck at 2010 network capacity concepts like "10Gb". If networking had progressed as compute power has (e.g. NVMe disks can provide 25GB/s), 100Gb would be the default server interface? And the ISP uplink would be measured in terabits?

How is the diversity in datacenter providers? In my area, several datacenters were acquired and my instinct would be that: the "move to cloud" has lost smaller providers a lot of customers, and the industry consolidation has given suppliers more power in both controlling the offering and the pricing. Is it a free market with plenty of competitive pricing, or is it edging towards enshittification?

exabrial 17 January 2025

I'm surprised you guys are building new!

Tons of Colocation available nearly everywhere in the US, and in the KCMO area, there are even a few dark datacenters available for sale!

cool project none-the-less. Bit jealous actually :P

mirshko 17 January 2025

y’all really need to open source that racking modeling tool, that would save sooooo many people so much time

technick 18 January 2025

I've spent more time than I care working in data centers and can tell you that your job req is asking for one person to perform 3 different roles, maybe 4. I guarantee you're going to find a "jack of all trades" and a master of none unless you break them out into these jobs.

Application Developer

DevOps Engineer

Site Reliability Engineer

Storage Engineer

Good luck, hope you pay them well.

jonatron 17 January 2025

Why would you call colocation "building your own data center"? You could call it "colocation" or "renting space in a data center". What are you building? You're racking. Can you say what you mean?