Classic story of a startup taking a "good enough" shortcut and then coming back later to optimize.
---
I have a similar story: Where I work, we had a cluster of VMs that were always high CPU and a bit of a problem. We had a lot of fire drills where we'd have to bump up the size of the cluster, abort in-progress operations, or some combination of both.
Because this cluster of VMs was doing batch processing that the founder believed should be CPU intense, everyone just assumed that increasing load came with increasing customer size; and that this was just an annoyance that we could get to after we made one more feature.
But, at one point the bean counters pointed out that we spent disproportionately more on cloud than a normal business did. After one round of combining different VM clusters (that really didn't need to be separate servers), I decided that I could take some time to hook up this very CPU intense cluster up to a profiler.
I thought I was going to be in for a 1-2 week project and would follow a few worms. Instead, the CPU load was because we were constantly loading an entire table, that we never deleted from, into the application's process. The table had transient data that should only last a few hours at most.
I quickly deleted almost a decade's worth of obsolete data from the table. After about 15 minutes, CPU usage for this cluster dropped to almost nothing. The next day we made the VM cluster a fraction of its size, and in the next release, we got rid of the cluster and merged the functionality into another cluster.
I also made a pull request that introduced a simple filter to the query to only load 3 days of data; and then introduced a background operation to clean out the table periodically.
> One complicating factor here is that raw video is surprisingly high bandwidth.
It's weird to be living in a world where this is a surprise but here we are.
Nice write up though. Web sockets has a number of nonsensical design decisions, but I wouldn't have expected that this is the one that would be chewing up all your cpu.
Is this really an AWS issue? Sounds like you were just burning CPU cycles, which is not AWS related. WebSockets makes it sound like it was a data transfer or API gateway cost.
>In a typical TCP/IP network connected via ethernet, the standard MTU (Maximum Transmission Unit) is 1500 bytes, resulting in a TCP MSS (Maximum Segment Size) of 1448 bytes. This is much smaller than our 3MB+ raw video frames.
> Even the theoretical maximum size of a TCP/IP packet, 64k, is much smaller than the data we need to send, so there's no way for us to use TCP/IP without suffering from fragmentation.
Just highlights that they do not have enough technical knowledge in house. Should spend the $1m/year saving on hiring some good devs.
Chromium already has a zero-copy IPC mechanism that uses shared memory built-in. It's called Mojo. That's how the various browser processes talk to each other. They could just have passed mojo::BigBuffer messages to their custom.process and not had to worry about platform-specific code.
But writing a custom ring buffer implementation is also nice, I suppose...
Love the transparency here. Would also love if the same transparency was applied to pricing for their core product. Doesn't appear anywhere on the site.
The problem is that the developers behind this way of streaming video data seem to have no idea of how video codecs work.
If they are in control of the headless chromium instances, the video streams, and the receiving backend of that video stream...why not simply use RDP or a similar video streaming protocol that is made exactly for this purpose?
This whole post reads like an article from a web dev that is totally over their head, trying to implement something that they didn't take the time to even think about. Arguing with TCP fragmentation when that is not even an issue, and trying to use a TCP stream when that is literally the worst thing you can do in that situation because of roundtrip costs.
But I guess that there is no JS API for that, so it's outside the development scope? Can't imagine any reason not to use a much more efficient video codec here other than this running in node.js, potentially missing offscreen canvas/buffer APIs and C encoding libraries that you could use for that.
I would not want to work at this company, if this is how they develop software. Must be horribly rushed prototypical code, everywhere.
Masking in the WebSocket protocol is kind of a funny and sad fix to the problem of intermediaries trying to be smart and helpful, but failing miserably.
Here they have a nicely compressed stream of video data, so they take that stream and... decode it. But they aren't processing the decoded data at the source of the decode, so instead they forward that decoded data, uncompressed(!!), to a different location for processing. Surprisingly, they find out that moving uncompressed video data from one location to another is expensive. So, they compress it later (Don't worry, using a GPU!)
At so many levels this is just WTF. Why not forward the compressed video stream? Why not decompress it where you are processing it instead of in the browser? Why are you writing it without any attempt at compression? Even if you want lossless compression there are well known and fast algorithms like flv1 for that purpose.
...and this is why I will never start a successful business.
The initial approach was shipping raw video over a WebSocket. I could not imagine putting something like that together and selling it. When your first computer came with 64KB in your entire machine, some of which you can't use at all and some you can't use without bank switching tricks, it's really really hard to even conceive of that architecture as a possibility. It's a testament to the power of today's hardware that it worked at all.
And yet, it did work, and it served as the basis for a successful product. They presumably made money from it. The inefficiency sounds like it didn't get in the way of developing and iterating on the rest of the product.
I can't do it. Premature optimization may be the root of all evil, but I can't work without having some sense for how much data is involved and how much moving or copying is happening to it. That sense would make me immediately reject that approach. I'd go off over-architecting something else before launching, and somebody would get impatient and want their money back.
We read through the WebSocket RFC, and Chromium's WebSocket implementation, dug through our profile data, and discovered two primary causes of slowness: fragmentation, and masking.
So they are only half way correct about masking. The RFC does mandate that client to server communication be masked. That is only enforced by web browsers. If the client is absolutely anything else just ignore masking. Since the RFC requires a bit to identify if a message is masked and that bit is in no way associated to the client/server role identity of the communication there is no way to really mandate enforcement. So, just don't mask messages and nothing will break.
Fragmentation is completely unavoidable though. The RFC does allow for messages to be fragmented at custom lengths in the protocol itself, and that is avoidable. However, TLS imposes message fragmentation. In some run times messages sent at too high a frequency will be concatenated and that requires fragmentation by message length at the receiving end. Firefox sometimes sends frame headers detached from their frame bodies, which is another form of fragmentation.
You have to account for all that fragmentation from outside the protocol and it is very slow. In my own implementation receiving messages took just under 11x longer to process than sending messages on a burst of 10 million messages largely irrespective of message body length. Even after that slowness WebSockets in my WebSocket implementation proved to be almost 8x faster than HTTP 1 in real world full-duplex use on a large application.
The title makes it sound like there was some kind of blowout, but really it was a tool that wasn't the best fit for this job, and they were using twice as much CPU as necessary, nothing crazy.
That's a good write-up with a standard solution in some other spaces. Shared memory buffers are very fast too. It's interesting to see them being used here. Nice write up. It wasn't what I expected: that they were doing something dumb with API Gateway Websockets. This is actual stuff. Nice.
You mentioned in the article that you went searching for an alternative to WebSocket for transporting the raw decoded video out of Chromium's Javascript environment. Have you also considered WebTransport?
> A single 1080p raw video frame would be 1080 * 1920 * 1.5 = 3110.4 KB in size
They seem to not understand the fundamentals of what they're working on.
> Chromium's WebSocket implementation, and the WebSocket spec in general, create some especially bad performance pitfalls.
You're doing bulk data transfers into a multiplexed short messaging socket. What exactly did you expect?
> However there's no standard interface for transporting data over shared memory.
Yes there is. It's called /dev/shm. You can use shared memory like a filesystem, and no, you should not be worried about user/kernel space overhead at this point. It's the obvious solution to your problem.
> Instead of the typical two-pointers, we have three pointers in our ring buffer:
You can use two back to back mmap(2) calls to create a ringbuffer which avoids this.
WebSockets has become popular only due to this "instant" mindset. IRL, only a handful of messages or notifications need a truly real-time priority such as Bank OTPs, transaction notifs, etc. Others can wait for a few seconds, and other architectures like middleware, client-side AJAX polling, etc. could be both cheaper and sufficient there.
I don't mean to be dismissive, but this would have been caught very early on (in the planning stages) by anyone that had/has experience in system-level development rather than full-stack web js/python development. Quite an expensive lesson for them to learn, even though I'm assuming they do have the talent somewhere on the team if they're able to maintain a fork of Chromium.
(I also wouldn't be surprised if they had even more memory copies than they let on, marshalling between the GC-backed JS runtime to the GC-backed Python runtime.)
I was coming back to HN to include in my comment a link to various high-performance IPC libraries, but another commenter already beat me linking to iceoryx2 (though of course they'd need to use a python extension).
SHM for IPC has been well-understood as the better option for high-bandwidth payloads from the 1990s and is a staple of Win32 application development for communication between services (daemons) and clients (guis).
They are presumably using the GPU for video encoding....
And the GPU for rendering...
So they should instead just be hooking into Chromium's GPU process and grabbing the pre-composited tiles from the LayerTreeHostImpl[1] and dealing with those.
> peek pointer: the address of the next frame to read
> read pointer: the address where data can be overwritten
What? If the "write pointer" is the "the next address to write to" then the "read pointer" had better be "the next address to read from".
The "peek pointer" should be the "read pointer", and the pointer to the end of the free sector should be the "stop pointer" or "unfreed pointer" or "in-use pointer" or literally anything else. Even "third pointer" would be less confusing!
I for one would like to praise the company for sharing their failure, hopefully next time someone Googles "transport video over websocket" theyll find this thread.
I've been toying around with a design for a real-time chat protocol, and was recently in a debate of WebSockets vs HTTP long polling. This should give me some nice ammunition.
WebSockets cost us $1M on our AWS bill
(recall.ai)360 points by tosh 6 November 2024 | 229 comments
Comments
---
I have a similar story: Where I work, we had a cluster of VMs that were always high CPU and a bit of a problem. We had a lot of fire drills where we'd have to bump up the size of the cluster, abort in-progress operations, or some combination of both.
Because this cluster of VMs was doing batch processing that the founder believed should be CPU intense, everyone just assumed that increasing load came with increasing customer size; and that this was just an annoyance that we could get to after we made one more feature.
But, at one point the bean counters pointed out that we spent disproportionately more on cloud than a normal business did. After one round of combining different VM clusters (that really didn't need to be separate servers), I decided that I could take some time to hook up this very CPU intense cluster up to a profiler.
I thought I was going to be in for a 1-2 week project and would follow a few worms. Instead, the CPU load was because we were constantly loading an entire table, that we never deleted from, into the application's process. The table had transient data that should only last a few hours at most.
I quickly deleted almost a decade's worth of obsolete data from the table. After about 15 minutes, CPU usage for this cluster dropped to almost nothing. The next day we made the VM cluster a fraction of its size, and in the next release, we got rid of the cluster and merged the functionality into another cluster.
I also made a pull request that introduced a simple filter to the query to only load 3 days of data; and then introduced a background operation to clean out the table periodically.
It's weird to be living in a world where this is a surprise but here we are.
Nice write up though. Web sockets has a number of nonsensical design decisions, but I wouldn't have expected that this is the one that would be chewing up all your cpu.
> Even the theoretical maximum size of a TCP/IP packet, 64k, is much smaller than the data we need to send, so there's no way for us to use TCP/IP without suffering from fragmentation.
Just highlights that they do not have enough technical knowledge in house. Should spend the $1m/year saving on hiring some good devs.
But writing a custom ring buffer implementation is also nice, I suppose...
The problem is that the developers behind this way of streaming video data seem to have no idea of how video codecs work.
If they are in control of the headless chromium instances, the video streams, and the receiving backend of that video stream...why not simply use RDP or a similar video streaming protocol that is made exactly for this purpose?
This whole post reads like an article from a web dev that is totally over their head, trying to implement something that they didn't take the time to even think about. Arguing with TCP fragmentation when that is not even an issue, and trying to use a TCP stream when that is literally the worst thing you can do in that situation because of roundtrip costs.
But I guess that there is no JS API for that, so it's outside the development scope? Can't imagine any reason not to use a much more efficient video codec here other than this running in node.js, potentially missing offscreen canvas/buffer APIs and C encoding libraries that you could use for that.
I would not want to work at this company, if this is how they develop software. Must be horribly rushed prototypical code, everywhere.
The linked section of the RFC is worth the read: https://www.rfc-editor.org/rfc/rfc6455#section-10.3
Here they have a nicely compressed stream of video data, so they take that stream and... decode it. But they aren't processing the decoded data at the source of the decode, so instead they forward that decoded data, uncompressed(!!), to a different location for processing. Surprisingly, they find out that moving uncompressed video data from one location to another is expensive. So, they compress it later (Don't worry, using a GPU!)
At so many levels this is just WTF. Why not forward the compressed video stream? Why not decompress it where you are processing it instead of in the browser? Why are you writing it without any attempt at compression? Even if you want lossless compression there are well known and fast algorithms like flv1 for that purpose.
Just weird.
The initial approach was shipping raw video over a WebSocket. I could not imagine putting something like that together and selling it. When your first computer came with 64KB in your entire machine, some of which you can't use at all and some you can't use without bank switching tricks, it's really really hard to even conceive of that architecture as a possibility. It's a testament to the power of today's hardware that it worked at all.
And yet, it did work, and it served as the basis for a successful product. They presumably made money from it. The inefficiency sounds like it didn't get in the way of developing and iterating on the rest of the product.
I can't do it. Premature optimization may be the root of all evil, but I can't work without having some sense for how much data is involved and how much moving or copying is happening to it. That sense would make me immediately reject that approach. I'd go off over-architecting something else before launching, and somebody would get impatient and want their money back.
So they are only half way correct about masking. The RFC does mandate that client to server communication be masked. That is only enforced by web browsers. If the client is absolutely anything else just ignore masking. Since the RFC requires a bit to identify if a message is masked and that bit is in no way associated to the client/server role identity of the communication there is no way to really mandate enforcement. So, just don't mask messages and nothing will break.
Fragmentation is completely unavoidable though. The RFC does allow for messages to be fragmented at custom lengths in the protocol itself, and that is avoidable. However, TLS imposes message fragmentation. In some run times messages sent at too high a frequency will be concatenated and that requires fragmentation by message length at the receiving end. Firefox sometimes sends frame headers detached from their frame bodies, which is another form of fragmentation.
You have to account for all that fragmentation from outside the protocol and it is very slow. In my own implementation receiving messages took just under 11x longer to process than sending messages on a burst of 10 million messages largely irrespective of message body length. Even after that slowness WebSockets in my WebSocket implementation proved to be almost 8x faster than HTTP 1 in real world full-duplex use on a large application.
"using WebSockets over loopback was ultimately costing us $1M/year in AWS spend"
then
"and the quest for an efficient high-bandwidth, low-latency IPC"
Shared memory. It has been there for 50 years.
Was it because they didn't want to use some multicast video server?
Always. Always ?!?
Article summary: if you're moving a lot of data, your protocol's structure and overhead matters. A lot.
They seem to not understand the fundamentals of what they're working on.
> Chromium's WebSocket implementation, and the WebSocket spec in general, create some especially bad performance pitfalls.
You're doing bulk data transfers into a multiplexed short messaging socket. What exactly did you expect?
> However there's no standard interface for transporting data over shared memory.
Yes there is. It's called /dev/shm. You can use shared memory like a filesystem, and no, you should not be worried about user/kernel space overhead at this point. It's the obvious solution to your problem.
> Instead of the typical two-pointers, we have three pointers in our ring buffer:
You can use two back to back mmap(2) calls to create a ringbuffer which avoids this.
(I also wouldn't be surprised if they had even more memory copies than they let on, marshalling between the GC-backed JS runtime to the GC-backed Python runtime.)
I was coming back to HN to include in my comment a link to various high-performance IPC libraries, but another commenter already beat me linking to iceoryx2 (though of course they'd need to use a python extension).
SHM for IPC has been well-understood as the better option for high-bandwidth payloads from the 1990s and is a staple of Win32 application development for communication between services (daemons) and clients (guis).
A more reasonable approach would be to have Chromium save the original compressed video to disk, and then use ffmpeg or similar to reencode if needed.
Even better not use Chromium at all.
Cheaper and more straightforward.
Their discussion of fragmentation shows they are clueless as to the details of the stack. All that shit is basically irrelevant.
As a point of comparison, how many TB per second of video does Netflix stream?
Are you sure about that? Atomics are not locks, and not all systems have strong memory ordering.
And the GPU for rendering...
So they should instead just be hooking into Chromium's GPU process and grabbing the pre-composited tiles from the LayerTreeHostImpl[1] and dealing with those.
[1]: https://source.chromium.org/chromium/chromium/src/+/main:cc/...
> write pointer: the next address to write to
OK
> peek pointer: the address of the next frame to read > read pointer: the address where data can be overwritten
What? If the "write pointer" is the "the next address to write to" then the "read pointer" had better be "the next address to read from".
The "peek pointer" should be the "read pointer", and the pointer to the end of the free sector should be the "stop pointer" or "unfreed pointer" or "in-use pointer" or literally anything else. Even "third pointer" would be less confusing!
that’s surprising to.. almost no one? 1TBPS is nothing to scoff at