Why is my CPU usage always 100%?

(downtowndougbrown.com)

Comments

WediBlino 13 January 2025
An old manager of mine once spent the day trying to kill a process that was running at 99% on Windows box.

When I finally got round to see what he was doing I was disappointed to find he was attempting to kill the 'system idle' process.

veltas 13 January 2025
It doesn't feel like reading 4 times is necessarily a portable solution, if there will be more versions at different speeds and different I/O architectures; or how this will work under more load, and whether the original change was done to fix some other performance problem OP is not aware of, but not sure what else can be done. Unfortunately many vendors like Marvell can seriously under-document crucial features like this. If anything it would be good to put some of this info in the comment itself, not very elegant but how else practically are we meant to keep track of this, is the mailing list part of the documentation?

Doesn't look like there's a lot of discussion on the mailing list, but I don't know if I'm reading the thread view correctly.

rbanffy 13 January 2025
In the late 1990's I worked in a company that had a couple mainframes in their fleet and once I looked into a resource usage screen (Omegamon, perhaps? Is it that old?) and noticed the CPU was pegged at 100%. I asked the operator if that was normal. His answer was "Of course. We paid for that CPU, might as well use it". Funny though that mainframes are designed for that - most, if not all, non-application work is offloaded to other processors in the system so that the CPU can run applications as fast as it can.
sneela 13 January 2025
This is a wonderful write-up and a very enjoyable read. Although my knowledge about systems programming on ARM is limited, I know that it isn't easy to read hardware-based time counters; at the very least, it's not as simple as the x86 rdtsc [1]. This is probably why the author writes:

> This code is more complicated than what I expected to see. I was thinking it would just be a simple register read. Instead, it has to write a 1 to the register, and then delay for a while, and then read back the same register. There was also a very noticeable FIXME in the comment for the function, which definitely raised a red flag in my mind.

Regardless, this was a very nice read and I'm glad they got down to the issue and the problem fixed.

[1]: https://www.felixcloutier.com/x86/rdtsc.

dmitrygr 13 January 2025
Curiously, instead of "set capture reg, wait for clock edge, read", the "read reg twice, until same result is obtained" approach is ignored. This is strange as it is usually much faster - reading a 3.25MHz counter at 200MHz+ twice is very likely to see the same value twice. For a 32KHz counter, it is basically guaranteed.

   u32 val;
   do {
       val = readl(...);
   } while (val != readl(...));

   return val;
compiles to a nice 6-instr little function on arm/thumb too, with no delays

   readclock:
     LDR  R2, =...
   1:
     LDR  R0, [R2]
     LDR  R1, [R2]
     CMP  R0, R1
     BNE  1b
     BX   LR
askvictor 13 January 2025
My recurring issue (on a variety of laptops, both Linux and Windows): the fans will start going full-blast, everything slows down, then as soon as I open a task manager CPU usage drops from 100% to something negligible.
steventhedev 14 January 2025
Aside from the technical beauty of this post, what is the practical impact of this?

Fan speeds should ideally be looking at temperature sensors, CPU idling is working albeit with interrupt waits as pointed out here. The only impact seems to be surprise that the CPU is working harder than it really is when looking at this number.

It's far better to look at the system load (which was 0.0 - already a strong hint this system is working below capacity). It has a formal definition (average waiting cpu task queue depth over 1, 5, 10 minutes) and succinctly captures the concept of "this machine is under load".

Many years ago, a coworker deployed a bad auditd config. CPU usage was below 10%, but system load was 20x the number of cores. We moved all our alerts to system load and used that instead.

thrdbndndn 13 January 2025
I don't get the fix.

Why reading it multiple times will fix the issue?

Is it just because reading takes time, therefore reading multiple time makes the needed time from writing to reading passes? If so, it sounds like a worse solution than just extending waiting delay longer like the author did initially.

If not, then I would like to know the reason.

(Needless to say, a great article!)

evanjrowley 13 January 2025
This headline reminded me of Mumptris, an implementation of Tetris in the old mainframe-oriented language MUMPS, which by design, uses 100% CPU to reduce latency: https://news.ycombinator.com/item?id=4085593
a1o 13 January 2025
This was very well written, I somehow read every single line and didn't skip to the end. Great work too!
RajT88 13 January 2025
TIL there are still Chumby's alive in the wild. My Insignia Chumby 8 didn't last.
rbohac 14 January 2025
This was a well written article! It was nice to read the process of troubleshooting with the rabbit holes included. Glad you stuck it out!
WalterBright 13 January 2025
I noticed that one time. Looked at the process list, and what was running was a program that enabled streaming. But since I wasn't streaming anything, I wondered what it was doing reading the disk drive.

So I uninstalled it.

Not having any programs that are not good citizens.

ndesaulniers 13 January 2025
Great read! Eerily similar to some bugs I've had, but the root cause has been a compiler bug. Debugging a kernel that doesn't boot is... interesting. QEMU+GDB to the rescue.
NotYourLawyer 13 January 2025
That’s an awful lot of effort to deal with an issue that was basically just cosmetic. I suspect at some point the author was just nerd sniped though.
g-b-r 13 January 2025
I expected it to be about holding down the spacebar :/
amelius 13 January 2025
To diagnose, why not run "time top" and look at the user and sys outputs?
markhahn 13 January 2025
very nice investigation.

shame about the unnecessary use of cat :)

InsomniacL 13 January 2025
> Chumby’s kernel did a total of 5 reads of the CVWR register. The other two kernels did a total of 3 reads.

> I opted to use 4 as a middle ground

reminded me of xkcd: Standards

https://xkcd.com/927/

TrickyReturn 13 January 2025
Probably running Slack...
Suppafly 13 January 2025
Isn't this one of those problems that switching to linux is supposed to fix?
begueradj 13 January 2025
Oops, this is not valid.