Hacker News

Chrome has been very conservative about enabling hardware acceleration features on Linux. Look under about://gpu to see a list. It is possible to force them via command line flags. That said, this is only part of the story.

There are different kinds of transistors that can be used when making chips. There are slow, but efficient transistors and fast, but leaky transistors. Getting an efficient design is a balancing act where you limit use of the fast transistors to only the most performance critical areas. AMD historically has more liberally used these high performance leaky transistors, which enabled it to reach some of the highest clock frequencies in the industry. Apple on the other hand designed for power efficiency first, so its use of such transistors was far more conservative. Rather than use faster transistors, Apple would restrict itself to the slower transistors, but use more of them, resulting in wider core designs that have higher IPC and matched the performance of some of the best AMD designs while using less power. AMD recently adopted some of Apple’s restraint when designing the Zen 5c variant of its architecture, but it is just a modification of a design that was designed for significant use of leaky transistors for high clock speeds:

https://www.tomshardware.com/pc-components/cpus/amd-dishes-m...

The resulting clock speeds of the M4 and the Ryzen AI 340 are surprisingly similar, with the M4 at 4.4GHz and the Ryzen AI 340 at 4.8GHz. That said, the same chip is used in the Ryzen AI 350 that reaches 5.0GHz.

There is also the memory used. Apple uses LPDDR5X on the M4, which runs at lower voltages and has tweaks that sacrifice latency to an extent for a big savings in power. It also is soldered on/close to the CPU/SoC for a reduction needed in power to transmit data to/from the CPU. AMD uses either LPDDR5X or DDR5. I have not kept track of the difference in power usage between DDR versions and their LP variants, but expect the memory to use at least half the power if not less. Memory in many machines can use 5W or more just at idle, so cutting memory power usage can make a big impact.

Additionally, x86 has a decode penalty compared to other architectures. It is often stated that this is negligible, but those statements began during the P4 era when a single core used ~100W where a ~1W power draw for the decoder really was negligible. Fast forward to today where x86 is more complex than ever and people want cores to use 1W or less, the decode penalty is more relevant. ARM, using fixed length instructions and having a fraction of the instructions, uses less power to decode its instructions, since its decoder is simpler. To those who feel compelled to reply to repeat the mantra that this is negligible, please reread what I wrote about it being negligible when cores use 100W each and how the instruction set is more complex now. Let’s say that the instruction decoder uses 250mW for x86 and 50mW for ARM. That 200mW difference is not negligible when you want sub-1W core energy usage. It is at least 20% of the power available to the core. It does become negligible when your cores are each drawing 10W like in AMD’s desktops.

Apple also has taken the design choice of designing its own NAND flash controller and integrating it into its SoC, which provides further power savings by eliminating some of the power overhead associated with an external NAND flash controller. Being integrated into the SoC means that there is no need to waste power on enabling the signals to travel very far, which gives energy savings, versus more standard designs that assume a long distance over a PCB needs to be supported.

Finally, Apple implemented an innovation for timer coalescing in Mavericks that made a fairly big impact:

https://www.imore.com/mavericks-preview-timer-coalescing

On Linux, coalescing is achieved by adding a default 50ms slack to traditional Unix timers. This can be changed, but I have never seen anyone actually do that:

https://man7.org/linux/man-pages/man2/pr_set_timerslack.2con...

That was done to retroactively support coalescing in UNIX/Linux APIs that did not support it (which were all of them). However, Apple made its own new API for event handling called grand central dispatch that exposed coalescing in a very obvious way via the leeway parameter while leaving the UNIX/BSD APIs untouched, and this is now the preferred way of doing event handling on MacOS:

https://developer.apple.com/documentation/dispatch/1385606-d...

Thus, a developer of a background service on MacOS that can tolerate long delays could easily set the slack to multiple seconds, which would essentially guarantee it would be coalesced with some other timer, while a developer of a similar service on Linux, could, but probably will not, since the scheduler slack is something that the developer would need to go out of his way to modify, rather than something in his face like the leeway parameter is with Apple’s API. I did check how this works on Windows. Windows supports a similar per timer delay via SetCoalescableTimer(), but the developer would need to opt into this by using it in place of SetTimer() and it is not clear there is much incentive to use it. To circle back not Chrome, it uses libevent, which uses the BSD kqueue on MacOS. As far as I know, kqueue does not take advantage of timer coalescing on macOS, so the mavericks changes would not benefit chrome very much and the improvements that do benefit chrome are elsewhere. However, I thought that the timer coalescing stuff was worthwhile to mention given that it applies to many other things on MacOS.