So you want to build your own data center

(blog.railway.com)

Comments

jonatron 6 hours ago
Why would you call colocation "building your own data center"? You could call it "colocation" or "renting space in a data center". What are you building? You're racking. Can you say what you mean?
toddmorey 4 hours ago
Reminds me of the old Rackspace days! Boy we had some war stories:

   - Some EMC guys came to install a storage device for us to test... and tripped over each other and knocked out an entire Rack of servers like a comedy skit. (They uh... didn't win the contract.)
   - Some poor guy driving a truck had a heart attack and the crash took our DFW datecenter offline. (There were ballards to prevent this sort of scenario, but the cement hadn't been poured in them yet.)
   - At one point we temporarily laser-beamed bandwidth across the street to another building
   - There was one day we knocked out windows and purchased box fans because servers were literally catching on fire.
Data center science has... well improved since the earlier days. We worked with Facebook on the OpenCompute Project that had some very forward looking infra concepts at the time.
ChuckMcM 1 hour ago
From the post: "...but also despite multi-million dollar annual spend, we get about as much support from them as you would spending $100." -- Ouch! That is a pretty huge problem for Google.

I really enjoyed this post, mostly because we had similar adventures when setting up the infrastructure for Blekko. For Blekko, a company that had a lot of "east west" network traffic (that is traffic that goes between racks vs to/from the Internet at large) having physically colocated services without competing with other servers for bandwidth was both essential and much more cost effective than paying for this special case at SoftLayer (IBM's captive cloud).

There are some really cool companies that will build an enclosure for your cold isle, basically it ensures all the air coming out of the floor goes into the back of your servers and not anywhere else. It also keeps not cold air from being entrained from the sides into your servers.

The calculations for HVAC 'CRAC' capacity in a data center are interesting too. In the first CoLo facility we had a 'ROFOR' (right of first refusal) on expanding into the floor area next to our cage, but when it came time to expand the facility had no more cooling capacity left so it was meaningless.

Once you've done this exercise, looking at the 0xide solution will make a lot more sense to you.

motoboi 33 minutes ago
I’m my experience and based on writeups like this: Google hates having customers.

Someone decided they have to have a public cloud, so they did it, but they want to keep clients away with a 3 meter pole.

My AWS account manager is someone I am 100% certain would roll in the mud with me if necessary. Would sleep in the floor with us if we asked in a crisis.

Our Google cloud representatives make me sad because I can see that they are even less loved and supported by Google than us. It’s sad seeing someone trying to convince their company to sell and actually do a good job providing service. It’s like they are setup to fail.

Microsoft guys are just bulletproof and excel in selling, providing a good service and squeezing all your money out of your pockets and you are mortally convinced it’s for your own good. Also have a very strange cloud… thing.

As for the railway company going metal, well, I have some 15 years of experience with it. I’ll never, NEVER, EVER return to it. It’s just not worth it. But I guess you’ll have to discover it by yourselves. This is the way.

You soon discover what in freaking world is Google having so much trouble with. Just make sure you really really love and really really want to sell service to people, instead of building borgs and artificial brains and you’ll do 100x better.

jdoss 5 hours ago
This is a pretty decent write up. One thing that comes to mind is why would you write your own internal tooling for managing a rack when Netbox exists? Netbox is fantastic and I wish I had this back in the mid 2000s when I was managing 50+ racks.

https://github.com/netbox-community/netbox

winash83 57 minutes ago
We went down this path over the last year, lots of our devs need local and dev/test environments and AWS was costing us a bomb, With about 7 Bare metals(Colocation) we are running about 200+ VMs and could double that number with some capacity to spare. For management, we built a simple wrapper over libvirt. I am setting up another rack in the US and will end up costing around $75Kper year for a similar capacity.

Our prod is on AWS but we plan to move everything else and it's expected to save at least a quarter of a million dollars per year

chatmasta 4 hours ago
This is how you build a dominant company. Good for you ignoring the whiny conventional wisdom that keeps people stuck in the hyperscalers.

You’re an infrastructure company. You gotta own the metal that you sell or you’re just a middleman for the cloud, and always at risk of being undercut by a competitor on bare metal with $0 egress fees.

Colocation and peering for $0 egress is why Cloudflare has a free tier, and why new entrants could never compete with them by reselling cloud services.

In fact, for hyperscalers, bandwidth price gouging isn’t just a profit center; it’s a moat. It ensures you can’t build the next AWS on AWS, and creates an entirely new (and strategically weaker) market segment of “PaaS” on top of “IaaS.”

sitkack 6 hours ago
It would be nice to have a lot more detail. The WTF sections are the best part. Sounds like your gear needs "this side towards enemy" sign and/or the right affordances so it only goes in one way.

Did you standardize on layout at the rack level? What poke-yoke processes did you put into place to prevent mistakes?

What does your metal->boot stack look like?

Having worked for two different cloud providers and built my own internal clouds with PXE booted hosts, I too find this stuff fascinating.

Also take utmost advantage of a new DC when you are booting it to try out all the failure scenarios you can think of and the ones you can't through randomized fault injection.

nyrikki 1 hour ago
> This will likely involve installing some overhead infrastructure and trays that let you route fiber cables from the edge of your cage to each of your racks, and to route cables between racks

Perhaps I am reading this wrong, as you appear to be fiber heavy and do have space on the ladder rack for copper, but if you are commingling the two, be careful. A possible future iteration, would consider a smaller panduit fiberunner setup + a wire rack.

Co-mingling copper and fiber, especially through the large spill-overs works until it doesn't.

Depending on how adaptive you need to be with technology changes, you may run into this in a few years.

4x6 encourages a lot of people putting extra cable up in those runners, and sharing a spout with cat-6, cx-#, PDU serial, etc... will almost always end badly for some chunk of fiber. After those outages it also encourages people to 'upgrade in place'. When you are walking to your cage look at older cages, notice the loops sticking out of the tops of the trays and some switches that look like porcupines because someone caused an outage and old cables are left in place.

Congrats on your new cage.

ch33zer 1 hour ago
I used to work on machine repair automation at a big tech company. IMO repairs are one of the overlooked and harder things to deal with. When you run on AWS you don't really think about broken hardware it mostly just repairs itself. When you do it yourself you don't have that luxury. You need to have spare parts, technician to do repairs, a process for draining/undraining jobs off hosts, testing suites, hardware monitoring tools and 1001 more things to get this right. At smaller scales you can cut corners on some of these things but they will eventually bite you. And this is just machines! Networking gear has it's own fun set of problems, and when it fails it can take down your whole rack. How much do you trust your colos not to lose power during peak load? I hope you run disaster recovery drills to prep for these situations!

Wishing all the best to this team, seems like fun!

dban 8 hours ago
This is our first post about building out data centers. If you have any questions, we're happy to answer them here :)
jpleger 2 hours ago
Makes me remember some of the days I had in my career. There were a couple really interesting datacenter things I learned by having to deploy tens of thousands of servers in the 2003-2010 timeframe.

Cable management and standardization was extremely important (like you couldn't get by with shitty practices). At one place where we were deploying hundreds of servers per week, we had a menu of what ops people could choose if the server was different than one of the major clusters. We essentially had 2 chassis options, big disk servers which were 2u or 1u pizza boxes. You then could select 9/36/146gb SCSI drives. Everything was dual processor with the same processors and we basically had the bottom of the rack with about 10x 2u boxes and then the rest was filled with 20 or more 1u boxes.

If I remember correctly we had gotten such an awesome deal on the price for power, because we used facility racks in the cage or something, since I think they threw in the first 2x 30 amp (240v) circuits for free when you used their racks. IIRC we had a 10 year deal on that and there was no metering on them, so we just packed each rack as much as we could. We would put 2x 30s on one side and 2x 20s on another side. I have to think that the DC was barely breaking even because of how much heat we put out and power consumption. Maybe they were making up for it in connection / peering fees.

I can't remember the details, will have to check with one of my friends that worked there around that time.

linsomniac 4 hours ago
Was really hoping this was was actually about building your own data center. Our town doesn't have a data center, we need to go an hour south or an hour north. The building that a past failed data center was in (which doesn't bode well for a data center in town, eh?), is up for lease and I'm tempted.

But, I'd need to start off small, probably per-cabinet UPSes and transfer switches, smaller generators. I've built up cabinets and cages before, but never built up the exterior infrastructure.

pixelesque 4 hours ago
The date and time durations given seem a bit confusing to me...

"we kicked off a Railway Metal project last year. Nine months later we were live with the first site in California".

seems inconsistent with:

"From kicking off the Railway Metal project in October last-year, it took us five long months to get the first servers plugged in"

The article was posted today (Jan 2025), was it maybe originally written last year and the project has been going on for more than a year, and they mean that the Railway Metal project actually started in 2023?

dylan604 3 hours ago
My first colo box came courtesy of a friend of a friend that worked for one of the companies that did that (leaving out names to protect the innocent). It was a true frankenputer built out of whatever spare parts he had laying around. He let me come visit it, and it was an art project as much as a webserver. The mainboard was hung on the wall with some zip ties, the PSU was on the desk top, the hard drive was suspended as well. Eventually, the system was upgraded to newer hardware, put in an actual case, and then racked with an upgraded 100base-t connection. We were screaming in 1999.
physhster 2 hours ago
Having done data center builds for years, mostly on the network side but realistically with all the trades, this is a really cool article.
matt-p 4 hours ago
If you’re using 7280-SR3 switches, they’re certainly a fine choice. However, have you considered the 7280-CR3(K) range? They're much better $/Gbps and more relevant edge interfaces.

At this scale, why did you opt for a spine-and-leaf design with 25G switches and a dedicated 32×100G spine? Did you explore just collapsing it and using 1-2 32×100G switches per rack, then employing 100G>4×25G AOC breakout cables and direct 100G links for inter-switch connections and storage servers?

Have you also thought about creating a record on PeeringDB?https://www.peeringdb.com/net/400940.

By the way, I’m not convinced I’d recommend a UniFi Pro for anything, even for out-of-band management.

Over2Chars 2 hours ago
I guess we can always try to re-hire all those "Sys Admins" we thought we could live without.

LOL?

j-b 6 hours ago
Love these kinds of posts. Tried railway for the first time a few days ago. It was a delightful experience. Great work!
robertclaus 5 hours ago
I would be super interested to know how this stuff scales physically - how much hardware ended up in that cage (maybe in Cloud-equivalent terms), and how much does it cost to run now that it's set up?
coolkil 6 hours ago
Awesome!! Hope to see more companies go this route. I had the pleasure to do something similar for a company(lot smaller scale though)

It was my first job out of university. I will never forget the awesome experience of walking into the datacenter and start plugging cables and stuff

aetherspawn 5 hours ago
What brand of servers was used?
sometalk 5 hours ago
I remember talking to Jake a couple of years ago when they were looking for someone with a storage background. Cool dude, and cool set of people. Really chuffed to see them doing what they believe in.
__fst__ 6 hours ago
Can anyone recommend some engineering reading for building and running DC infrastructure?
enahs-sf 2 hours ago
Curious why California when the kwh is so high here vs Oregon or Washington
Havoc 2 hours ago
Surprised to see pxe. Didn’t realise that was in common use in racks
nextworddev 6 hours ago
First time checking out railway product- it seems like a “low code” and visual way to define and operate infrastructure?

Like, if Terraform had a nice UI?

ramon156 6 hours ago
weird to think my final internship was running on one of these things. thanks for all the free minutes! it was a nice experience
whalesalad 1 hour ago
Cliffhanger! Was mostly excited about the networking/hypervisor setup. Curious to see the next post about the software defined networking. Had not heard of FRR or SONIC previously.
mirshko 5 hours ago
y’all really need to open source that racking modeling tool, that would save sooooo many people so much time
renewiltord 5 hours ago
More to learn from the failures than the blog haha. It tells you what the risks are with a colocation facility. There really isn't any text on how to do this stuff. The last time I wanted to build out a rack there aren't even any instructions on how to do cable management well. It's sort of learned by apprenticeship and practice.
cyberax 3 hours ago
It looked interesting, until I got to the egress cost. Ouch. $100 per TB is way too much if you're using bandwidth-intensive apps.

Meta-comment: it's getting really hard to find hosting services that provide true unlimited bandwidth. I want to do video upload/download in our app, and I'm struggling to find providers of managed servers that would be willing to provide me with fixed price for 10/100GB ports.

exabrial 6 hours ago
I'm surprised you guys are building new!

Tons of Colocation available nearly everywhere in the US, and in the KCMO area, there are even a few dark datacenters available for sale!

cool project none-the-less. Bit jealous actually :P

hintymad 1 hour ago
Per my experience with cloud, the most powerful Infra abstraction that AWS offers is actually EC2. The simplicity of getting a cluster of machines up and running with all the metadata readily available via APIs is just liberating. And it just works: the network is easy to configure, the ASGs are flexible enough to customize, and the autoscaling offers strong primitives for advanced scaling.

Amazingly, few companies who run their own DCs could build anything comparable to EC2, even at a smaller scale. When I worked in those companies, I sorely missed EC2. I was wondering if there's any robust enough open-source alternatives to EC2's control-plane software to manage baremetals and offer VMs on top them. That'll be awesome for companies that build their own DCs.