Last Week on My Mac: Wobbling plates and bfloat16 support

Few acts can excite an audience as much as the plate-spinner darting between crockery threatening to wobble out of control and smash on the stage below. In last week’s plate-spinning act here, I’ve been developing my new security watchkeeper Skint, looking at troubled relationships between PDF, Live Text and Spotlight, and working out how to look at new features in the instruction set of M2 and M3 CPUs. It’s the last of those that has started to wobble badly, and looks like it might smash on the stage in front of me.

This centres on a new format of floating-point number, known as bfloat16. It emerged from Google’s Brain project in around 2018, too late to be built into the first generation of Apple silicon chips, the M1 family. However, Apple opted for a more recent version of the Arm instruction set for the CPUs in its M2 and M3 chips, and they now support the format and some important arithmetic operations on bfloat16 numbers.

Like the better-established float16 floating-point number format (IEEE 754 binary16), bfloat16 values only use 16 bits of storage. This makes them more compact to store in bulk, and allows them to be processed more quickly using vector-processing units such as that built into both Performance and Efficiency cores in Apple silicon chips.

The Arm NEON vector-processing unit is designed to perform its instructions on multiple values at the same time, in Single-Instruction, Multiple-Data (SIMD) processing. Cores perform arithmetic instructions on data loaded in their registers, typically of 128 bits size. Those can be used for two 64-bit floating-point numbers, four 32-bit, or eight 16-bit. By using 16-bit floating-point numbers, throughput should be twice as fast as using 32-bit, and four times faster than 64-bit. When you can get away with 16-bit data, there are huge speed gains to be achieved.

Existing float16 numbers can be used for this, but because of their small size they have significant disadvantages. Most importantly, they can only represent numbers between +/- 6.1 x 10^-5 to 65,504. That would require many calculations to scale numbers to fit within their range, imposing further overhead. Neither is it simple to convert between float32 (32-bit floating-point format) and float16.

bfloat16 covers the same range as 32-bit floating-point numbers, about +/- 1.2 x 10^-38 to 3.4 x 10^38, which is more than enough for most calculations, but at reduced precision. Conversion between float32 and bfloat16 is also far quicker. Although originally designed for Machine Learning within the broad field of AI, bfloat16 may prove a faster and more efficient solution in other applications such as display coordinates, where float32 has been standard.

With a shiny new MacBook Pro M3 Pro sat next to me, this seemed an ideal opportunity to look at both the performance of bfloat16 arithmetic and the effects of its reduced precision, so I put that plate on its pole and set it spinning.

Although support for the older float16 format was added to Swift 5.3, bfloat16 doesn’t appear to be on the horizon yet. In any case, to ensure that my code did what I wanted it to do, I realised that I’d have to write it in assembly language. The next step was to see whether its additions to the instruction set were supported by the LLVM back end of Xcode. Although Apple details how you can check whether any given Arm cores support bfloat16 format, there’s no mention of support in its current advice to those writing assembly code, merely links to Arm’s reference documentation.

The next step was to try one of the new instructions BFCVT, to convert float32 to bfloat16. That was flatly rejected by Xcode, which didn’t recognise the instruction. Maybe I needed to add an option to invoke support. Eventually I discovered a list of almost 1,000 options for the clang LLVM compiler, none of which made any reference to the ARMv8.6A instruction set used by M2 and M3 CPU cores. My plate was starting to wobble badly.

Apple has, though, introduced support for bfloat16 format, as explained by Denis Vieriu, in his presentation at WWDC 2023. But that isn’t implemented in CPU cores, despite their instruction set support, it’s in Metal machine learning APIs exposed through the Metal Performance Shaders framework run on the GPU.

There’s also no sign yet of bfloat16 availability in any of Apple’s Accelerate library, although as that seldom documents which hardware its functions are run on, that would only lead to further questions. I think it’s probably best to catch this plate before it falls, and put it back on the stack for the future.