We were wrong about GPUs

(fly.io)

Comments

freedomben 14 February 2025
> The biggest problem: developers don’t want GPUs. They don’t even want AI/ML models. They want LLMs. System engineers may have smart, fussy opinions on how to get their models loaded with CUDA, and what the best GPU is. But software developers don’t care about any of that. When a software developer shipping an app comes looking for a way for their app to deliver prompts to an LLM, you can’t just give them a GPU.

I'm increasingly coming to the view that there is a big split among "software developers" and AI is exacerbating it. There's an (increasingly small) group of software developers who don't like "magic" and want to understand where their code is running and what it's doing. These developers gravitate toward open source solutions like Kubernetes, and often just want to rent a VPS or at most a managed K8s solution. The other group (increasingly large) just wants to `git push` and be done with it, and they're willing to spend a lot of (usually their employer's) money to have that experience. They don't want to have to understand DNS, linux, or anything else beyond whatever framework they are using.

A company like fly.io absolutely appeals to the latter. GPU instances at this point are very much appealing to the former. I think you have to treat these two markets very differently from a marketing and product perspective. Even though they both write code, they are otherwise radically different. You can sell the latter group a lot of abstractions and automations without them needing to know any details, but the former group will care very much about the details.

ryuuseijin 15 February 2025
My heart stopped for a moment when reading the title. I'm glad they haven't decided to axe GPUs, because fly GPU machines are FANTASTIC!

Extremely fast to start on-demand, reliable and although a little bit pricy but not unreasonably so considering the alternatives.

And the DX is amazing! it's just like any other fly machine, no new set of commands to learn. Deploy, logs, metrics, everything just works out of the box.

Regarding the price: we've tried a well known cheaper alternative and every once in a while on restart inference performance was reduced by 90%. We never figured out why, but we never had any such problems on fly.

If I'm using a cheaper "Marketplace" to run our AI workloads, I'm also not really clear on who has access to our customer's data. No such issues with fly GPUs.

All that to say, fly GPUs are a game changer for us. I could wish only for lower prices and more regions, otherwise the product is already perfect.

lifeisstillgood 15 February 2025
I have a timeline that I am still trying to work through but it goes like this :

2012 - moores law basically ends - nand gates do t get smaller just more cleverly wrapped. Single threaded execution more or less stops at 2 GHz and has remained there.

2012-2022 - no one notices single threaded is stalled because everything moves to VMs in the cloud - the excess parallel compute from each generation is just shared out in data centres

2022 - data centres realise there is no point buying the next generation of super chips with even more cores because you make massive capital investments but cannot shovel 10x or 100x processes in because Amdahls law means standard computing is not 100% parallel

2022 - but look, LLMs are 100% parallel hence we can invest capital once again

2024 - this is the bit that makes my noodle - wafer scale silicon. 900,000 cores with GBs SRAM - these monsters run Llama models 10x faster than A100s

We broke moores law and hardware just kept giving more parallel cores because that’s all they can do.

And now software needs to find how to use that power - because dammit, someone can run their code 1 million times faster than a competitor - god knows what that means but it’s got to mean something - but AI surely cannot be the only way to use 1M cores?

reilly3000 14 February 2025
I shelled out for a 4090 when they came out thinking it would be the key factor for running local llms. It turns out that anything worth running takes way more than 24GB VRAM. I would have been better off with 2+ 3090s and a custom power supply. It’s a pity because I thought it would be a great solution for coding and a home assistant, but performance and quality isn’t there yet for small models (afaik). Perhaps DIGITS will scratch the itch for local LLM developers, but performant models really want big metal for now, not something I can afford to own or rent at my scale.
serjester 14 February 2025
I respect them for being public about this.

With that said, this seems quite obvious - the type of customer that chooses Fly, seems like the last person to be spinning up dedicated GPU servers for extended periods of time. Seems much more likely they'll use something serverless which requires a ton of DX work to get right (personally I think Modal is killing it here). To compete, they would have needed to bet the company on it. It's way too competitive otherwise.

mmastrac 14 February 2025
It's really a shame GPU slices aren't a thing -- a monthly cost of $1k for "a GPU" is just so far outside of what I could justify. I guess it's not terrible if I can batch-schedule a mega-gpu for an hour a day to catch up on tasks, but then I'm basically still looking at nearly $50/month.

I don't know exactly what type of cloud offering would satisfy my needs, but what's funny is that attaching an AMD consumer GPU to a Raspberry Pi is probably the most economical approach for a lot of problems.

Maybe something like a system where I could hotplug a full GPU into a system for a reservation of a few minutes at a time and then unplug it and let it go back into a pool?

FWIW it's that there's a large number of ML-based workflows that I'd like to plug into progscrape.com, but it's been very difficult to find a model that works without breaking the hobby-project bank.

jameslk 15 February 2025
> The biggest problem: developers don’t want GPUs. They don’t even want AI/ML models. They want LLMs.

Fly.io seems to attract similar developers as Cloudflare’s Workers platform. Mostly developers who want a PaaS like solution with good dev UX.

If that’s the case, this conclusion seems obvious in hindsight (hindsight is a bitch). Developers who are used to having infra managed for them so they can build applications don’t want to start building on raw infra. They want the dev velocity promise of a PaaS environment.

Cloudflare made a similar bet with GPUs I think but instead stayed consistent with the PaaS approach by building Workers AI, which gives you a lot of open LLMs and other models out of box that you can use on demand. It seems like Fly.io would be in a good position to do something similar with those GPUs.

PeterStuer 15 February 2025
It also seems they got caught in the middle of the system integrator vs product company dilemma.

To me fly's offering reads like a system integrator"s solution. They assemble components produced mainly by 3rd parties into an offered solution. The business model of a system integrator thrives on doing the least innovation/custom work possible for providing the offering. You posotion yourself to take maximal advantage of investments and innovations driven by your 3rd party suppliers. You want to be squarely on their happy path.

Instead this artcle reads like fly, with good intention, was trying to divert their tech suppliers offer stream into niche edge cases outside of maistream support.

This can be a valid strategy for products very late into their maturity lifecycle where core innovation is stagnant, but for the current state of AI with extremely rapid innovation waves coarsing through the market, that strategy is doomed to fail.

chr15m 14 February 2025
Side note: "we were wrong" - are there any more noble and beautiful words in the English language?
tptacek 14 February 2025
We wrote all sorts of stuff this week and this is what gets to the front page. :P
hansvm 15 February 2025
> The biggest problem: developers don’t want GPUs. They don’t even want AI/ML models. They want LLMs.

I don't want GPUs, but that's not quite the reason:

- The SOTA for most use cases for most classes of models with smallish inputs is fast enough and more cost efficient on a CPU.

- With medium inputs, the GPU often wins out, but costs are high enough that a 10x markup isn't worth it, especially since the costs are still often low compared to networking and whatnot. Factor in engineer hours and these higher-priced machines, and the total cost of a CPU solution is often still lower (always more debuggable).

- For large inputs/models, the GPU definitely wins, but now the costs are at a scale that a 10x markup is untenable. It's cheaper to build your own cluster or pay engineers to hack around the deficits of a larger, hosted LLM.

- For xlarge models™ (fuzzily defined to be anything substantially bigger than the current SOTA), GPUs are fundamentally the wrong abstraction. We _can_ keep pushing in the current directions (transformers requiring O(params * seq^2) work, pseudo-transformers requiring O(params * seq) work but with a hidden, always-activated state space buried in that `params` term which has to increase nearly linearly in size to attain the same accuracy with longer sequences, ...), but the cost of doing so is exorbitant. If you look at what's provably required to do those sorts of computations, the "chuck it in a big slice of vRAM and do everything in parallel" strategy gets more expensive compared to theoretical optimality as model size increases.

I've rented a lot of GPUs. I'll probably continue to do so in the future. It's a small fraction of my overall spending though. There aren't many products I can envision which could be built on rented GPUs more efficiently than rented CPUs or in-house GPUs.

kristopolous 15 February 2025
They were double wrong. I work at a GPU cloud provider and we can't bring on the machines fast enough. Demand has been overwhelming.

People aren't going to fly.io to rent GPUs. That's the actual reality here.

They thought they could sidecar it to their existing product offering for a decent revenue boost but they didn't win over the prospect's mind.

Fly has compelling product offerings and boring shovels don't belong in their catalog

akoculu 14 February 2025
I spent a month setting up serverless endpoint for a custom model last year with Runpod. It was expensive and unreliable, in addition to long cold boot times. The product was unusable even as a prototype, to cover the costs, I'd have to raise money first.

In a different product, I was given some Google Cloud credits, which unlocked me to put the product in front of customer. This one also needed GPU but not as expensive as the previous. It works reliably and it's fast.

Personally, I had two use cases for GPU providers in past 3 months.

I think there's definitely demand for reliability and better pricing. Not sure Fly will be able to touch that market though as it's not known for both (stability & developer friendly pricing).

P.S If anyone is working on a serverless provider and want me to test their product, reach me out :)

jeffybefffy519 14 February 2025
I feel like these guys are missing a pretty important point in their own analysis. I tried setting up a ollama LLM on a fly.io GPU machine and it was near impossible because of fly.io limitations such as: 1. Their infrastructure doesnt support streaming responses well at all (which is important part of the LLM experience in my view) 2. The LLM itself is massive, and cant be part of the docker image I was building and uploading. Fly doesnt have a nice way around this, so I had to setup a whole heap of code to pull it in on the fly machines first invocation, which doesnt work well if you start to run multiple machines. It was messy and ended up with a long support ticket with them that didnt get it working any better so I gave up.
zacksiri 15 February 2025
Most developers avoid GPU because of pricing. It’s simply too expensive to run 24/7 and there is the overhead of managing / bootstrapping instances loading large models to do intermittent instances. That’s the gist of it I think.

Unless you have constant load that justify 24/7 deployments most devs will just use an API. Or find solutions that doesn’t need you to pay > $1 / hour.

keyle 15 February 2025
It feels to me that instead of quitting on it, you should double down.

The reason we don't want GPU, it's that renting is not priced well enough and the technology isn't quite there yet either for us to make consistently good use of it.

Removing the offer is just exacerbating the current situation. It feels both curves are about to meet.

In either case you'll have the experience to bring back the offer if you feel it's needed.

johntash 14 February 2025
I really liked playing around with fly gpus, but it's just too expensive for hobby-use. Same goes for the rest of fly.io honestly. The DX is great and I wish I could move all of my homelab stuff and public websites to it, but it'd be way too expensive :(
hoppp 14 February 2025
Gpus don't fit the usual "start with free tier then upgrade when monetizing" approach most devs have with these kinds of platforms.

For simple inference, its too expensive for a project that makes no money. Which is most projects.

nitwit005 15 February 2025
> The biggest problem: developers don’t want GPUs. They don’t even want AI/ML models. They want LLMs.

My current company has some finance products. There was machine learning used for things like fraud and risk before the recent AI excitement.

Our executives are extremely enthused with AI, and seemingly utterly uncaring that we were already using it. From what I can tell, they genuinely just want to see ChatGPT everywhere.

The fraud team announce they have a "new" AI based solution. I assume they just added a call to OpenAI somewhere.

devmor 15 February 2025
The emotion in this article hits home to me. There have been several points in my career where I worked hard for long hours to develop a strong, clever solution to a problem that ultimately was solved by something cheap because the ultimate consumer didn’t care about what we expected them to care about.

It sucks from a business perspective of course, but it also sucks from the perspective of someone who takes pride in their work! I like to call it “artisan’s regret”.

scosman 15 February 2025
> But inference latency just doesn’t seem to matter yet, so the market doesn’t care.

This is a very strange statement to make. They are acting like inference today happens with freshly spun up VMs and model access over remote networks (and their local switching could save the day). It’s actually hitting clusters of hot machines with the model of choice already loaded into VRAM.

In real deployments, latency can be small (if implemented well), and speed is comes down to the right GPU config for the model (why fly doesn’t offer).

People have built better shared resource usage inference systems for Loras (openAI, Fireworks, Lorax) - but it’s not VMs. It’s model aware, the right hardware for the base model, and optimizing caching/swapping the Loras.

I’m not sure the Fly/VM way will ever be the path for ML. Their VM cold start time doesn’t matter if the app startup requires loading 20GB+ of weights.

Companies like Fireworks are working fast Lora inference cold starts. Companies like Modal are working on fast serverless VM cold starts with a range of GPU configs (2xH100, A100, etc). These seem more like the two cloud primitives for AI.

aqueueaqueue 15 February 2025
> developers don’t want GPUs. They don’t even want AI/ML models. They want LLMs.

Is there not a market for the kind data science stuff where GPUs help but you are not using an LLM. Like statistical models on large amounts of data and so on.

Maybe fly.io customer base isn't that sort of user. But I was pushing a previous company to get AWS GPUs because it would save us money vs CPU for the workload.

latchkey 14 February 2025
There is no market for MIG in the cloud. People talk about it a lot, but in reality, nobody wants a partial GPU (at least not paying for it).

One interesting thing about all this is that 1 GPU / 1 VM doesn't work today with AMD GPUs like MI300x. You can't do pcie passthrough, but AMD is working on adding it to ROCm. We plan to be one of the first to offer this.

burnto 15 February 2025
I think they’re too early for their core market. It’s taking indie and 0-1 devs awhile to dig into ML because it’s a huge complex space. But some of us are starting to put together interesting little pipelines with real, solid applications.
hinkley 15 February 2025
I feel like one of the mistakes being made again and again in the virtualization space is not realizing there's a difference between a competitor possibly running containers on the same machine with your proprietary data, and Dave over in Customer Relations running a container on your same machine.

If Dave does something malicious, we know where Dave lives, and we can threaten his livelihood. If your competitor does it you have to prove it, and they are protected from snooping at least as much as you are so how are you going to do that? I insist that the mutually assured destruction of coworkers substantially changes the equation.

In a Kubernetes world you should be able to saturate a machine with pods from the same organization even across teams by default, and if you're worried that the NY office is fucking with the SF office in order to win a competition, well then there should be some non-default flags that change that but maybe cost you a bit more due to underprovisioning.

You got a machine where one pod needs all of the GPUs and 8 cores, great. We'll load up some 8 core low memory pods onto there until the machine is full.

taeric 15 February 2025
I was at another team making a similar bet. Felt off to me at the time, but I assumed I just didn't understand the market.

I also think the call that people want LLMs is slightly off. More correct to say people want a black box that gives answers. LLMs have the advantage that nobody really knows anything about tuning them. So, it is largely a raw power race.

Taking it back to ML, folks would love a high level interface that "just worked." Dealing with the GPUs is not that, though.

hankchinaski 15 February 2025
They should invest and focus on making their platform more reliable. Without that they will continue to be just a hobby toy to play with and nothing more
npn 15 February 2025
> The biggest problem: developers don’t want GPUs. They don’t even want AI/ML models. They want LLMs.

No, I want GPU. BERT models are still useful.

The point is your service is too expensive that only one or two months of renting is enough to build a PC from scratch and place it somewhere in your workplace to run 24/7. For applications that need GPU power, usually downtime or latency does not really matter. And you always add an extra server to ensure.

Kwpolska 15 February 2025
> Instead, we burned months trying (and ultimately failing) to get Nvidia’s host drivers working to map virtualized GPUs into Intel Cloud Hypervisor. At one point, we hex-edited the closed-source drivers to trick them into thinking our hypervisor was QEMU.

What do Nvidia’s lawyers think of this? There are some things that best not mentioned in a blog post, and this is one of them.

djhworld 15 February 2025
I get the impression that running LLMs is a pain in general, always seems to need the right incantation of nvidia drivers, linux kernel and a boat load of VRAM, along with making sure the Python ecosystem or whatever you are running for inference has the right set of libaries and then if you want multi-tenant processing across VMs - forget it or pay $$$ to nvidia.

The whole cloud computing world was built on hypervisors and CPU virtualization, I wonder if we'll see a similar set of innovations for GPUs at commodity level pricing. Maybe a completely different hardware platform will emerge to replace the GPU for these inference workloads. I remember reading about Google's TPU hardware and was thinking that would be the thing - but I've never seen anyone other than Google talk about it.

sylware 15 February 2025
GPU is all about performance. Nearly all the time, very high level languages have nothing to do there.

The CPU part of high level user applications will probably be written in very high level languages/runtimes with, sometimes, some other parts being bare metal accelerated (GPU or CPU).

Devs wanting hardcore performance should write their stuff directly in GPU assembly (I think you can do that only with AMD) or at best with a SPIR-V assembler.

Not to mention doing complex stuff around the linux closed source nvidia driver is just asking for trouble. Namely, either you deploy hardware/software nvidia did validade, or just prepare to suffer... it means 'middle-men' deploying nvidia validaded solutions have near 0 added value.

silisili 14 February 2025
I'm admittedly a complete LLM noob, so my question might not even make sense. Or it might exist and I haven't found it quite yet.

But have they considered pivoting some of said compute to some 'private, secure LLM in a box' solution?

I've lately been toying with the idea of training from extensive docs and code, some open, some not, for both code generation and insights.

I went down the RAG rabbit hole, and frankly, the amount of competing ideas of 'this is how you should do it', from personal blogs to PaaS companies, overwhelmed me. Vector dbs, ollama, models, langchain, and various one off tools linking to git repos.

I feel there has to be substantial market for whoever can completely simplify that flow for dummies like me, and not charge a fortune for the privilege.

the_king 15 February 2025
This is well written. I appreciated the line, "startups are a race to learn stuff."
jonathanlei 15 February 2025
It's as difficult as a serverless provider to grow as it was for CPUs before GPUs came along.

Many companies overinvest in fully-owned hardware, rather than renting from clouds. Owning hardware means you underwrite unrented inventory costs and prevents you from scaling. H100 pricing is now lower than any self-hosted option, even without factoring the TCO & headcount.

(Disclaimer: I work at a GPU cloud Voltage Park -- with 24k H100s as low as $2.25/hr [0] -- but Fly.io is not the only one I've noticed purchase hardware when renting might have saved some $$$)

[0] https://dashboard.voltagepark.com/

flockonus 15 February 2025
It feels like giving up on this a bit too soon? I mean, they realized the problem quite right.. their offering doesn't entirely makes sense for their audience when it comes to GPU.

_But_ the demand of open source models is just beginning. If they really have a big inventory of GPUs under-utilized and users want particular solutions on demand.... give it to them???

Like TTS STT video creation, real time illustration enhancement, deepseek and many others. You guys are great at devops, make useful offerings on demand, similar to what HuggingFace offers, no???

dathinab 14 February 2025
> A whole enterprise A100 is a compromise position for them; they want an SXM cluster of H100s.

For a lot of use-cases you need at lest two A100s with a very fast interconnect, potentially many more. This isn't even about scaling with requests but about running one single LLM instance.

Sure you will find all of ways how people managed to runt his or that on smaller platforms, problem is that quite often doesn't scale to what is needed in production for a lot of subtle and less subtle reasons.

yieldcrv 14 February 2025
Yes devs want LLMs, but also the price of inference compute plummeted 90% over the last 18 months, which is primarily in gpus

So it’s not just that openai and anthropic apis are good enough, they are also cheap enough, and still overpriced compared to the industry

Your GPU investment wont do as well as you thought, but also you are wasting time on security. If the end user and market doesnt care then you can consider not caring as well. Worst case you can pay for any settlement with …. more gpu credits.

KETpXDDzR 18 February 2025
> They want LLMs.

That's why NVIDIA has NIMs [0]. A super easy way to use various LLMs.

[0] https://developer.nvidia.com/nim

onli 14 February 2025
Not sure about this:

> like with our portfolio of IPv4 addresses, I’m even more comfortable making bets backed by tradable assets with durable value.

Is that referencing the gpus, the hardware? If yes, why should they have a durable value? Historically hardware like that deprecated fast and reaches a value of 0, energy efficiency alone kills e.g. old server hardware. Something different here?

olibaw 18 February 2025
What a great blog post. Hope you figure it out, Fly.io. "If you will it, it is no dream."
cyberax 14 February 2025
Hah. We're doing AI, but we're doing vision-based stuff and not LLMs. For us, the problem has been deploying models.

Google and AWS helpfully offered their managed LLM AI services, but they don't really have anything terribly more useful than just machines with GPUs. Which are expensive.

I'm going to check fly.io...

jonathanyc 15 February 2025
> The biggest problem: developers don’t want GPUs. They don’t even want AI/ML models. They want LLMs.

I considered using a Fly GPU instance for a project and went with Hetzner instead. Fly.io’s GPU offering was just way too expensive to use for inference.

siliconc0w 15 February 2025
They might just be early.

The smaller models are getting more and more capable, for high-frequency use-cases it'll probably be worth using local quantized models vs paying for API inference.

imcritic 15 February 2025
What a good and open and honest blog post. And I liked a lot the way it is interlinked with other interesting posts from that blog. I wish I'll have some time to read more articles from that blog.
hamandcheese 15 February 2025
> We were wrong about Javascript edge functions, and I think we were wrong about GPUs.

Actually, you're still wrong about JavaScript edge functions. CF Workers slap.

apineda 15 February 2025
My issue is that I may or may not understand what's going on, but I simply, for the most part, do not want to spend time maintaining any more than I have to.
bleemworks 15 February 2025
Article about GPUs, comments arguing over the definition of complexity in Kubernetes. This is what you call “learned helplessness.”
amelius 15 February 2025
In most cases developers don't want GPUs, they just want a way to express a computation graph, and let the system perform the computation.
a-r-t 14 February 2025
Off topic, but the font in the article is hard on the eyes.
doctorpangloss 14 February 2025
> GPUs terrified our security team.

Ha ha, it didn't terrify Modal. It ships with all those security problems, and pretends it doesn't have them. Sorry Eric.

mrcwinn 14 February 2025
Has service reliability improved at all? I tried Fly at two different points in time and I’ve never had a worse experience with a service.
sergiotapia 14 February 2025
you guys have all this juicy GPU and infrastructure. why not offer models as apis?

i would pay to have apis for:

sam2, florence, blip, flux 1.1, etc.

whatever use case I would have reached Fly for on GPU, i can't justify _not_ using Replicate. maybe Fly can do better offer premium queues for that with their juicy infra?

you're right! as a software dev I see dockerization and foisting these models as a burden, not a necessity.

Philpax 14 February 2025
I noticed quite a few spelling and grammar mistakes - could do with a bit of an edit pass?
VectorLock 14 February 2025
Out of curiosity, how much runway does fly.io have (without raising new funding?)
sgt 15 February 2025
Currently getting a 502 error when trying to access fly.io
abraxas 15 February 2025
If low cost GPUs are not what they are offering then what are they offering anymore that I wouldn't get a big cloud vender. This looks like self inflicted mortal wound.
inetknght 14 February 2025
Kudos to owning up to your failed bet on GPUs even if you are putting a lot of blame on Nvidia for it. And to be fair, you're not wrong. Nvidia's artificial market segmentation is terrible and their drivers aren't that great either.

The real problem is the lack of security-isolated slicing one or more GPUs for virtual machines. I want my consumer-grade GPU to be split up into the host machine and also into virtual machines, without worrying about resident neighbor cross-talk! Gosh that sounds like why I moved out of my apartment complex, actually.

The idea of having to assign a whole GPU via PCI passthrough is just asinine. I don't need to do that for my CPU, RAM, network, or storage. Why should I need to do it for my GPU?

iFire 14 February 2025
What GPUS services will you keep?
anotherhue 14 February 2025
Next week: fly introduces game streaming technology for indie game devs.
pier25 15 February 2025
"We started this company building a Javascript runtime for edge computing."

Wait... what?

I've been a Fly customer for years and it's the first time I hear about this.

andrewstuart 15 February 2025
Nvidia deliberately makes this hard.

Opportunity for Intel and AMD.