Making Apple silicon faster: 2 Multithreading

In the first of this series of articles, I drew the distinction between processes, threads and tasks, and introduced how threads are allocated to cores to be run. This article considers in more detail how macOS allocates threads to cores in Apple silicon CPU cores. It takes as its model queues of threads using the Dispatch API, with code such as:

let processingQueue: OperationQueue = {
let result = OperationQueue()
return result
}()

processingQueue.maxConcurrentOperationCount = numProcs
processingQueue.qualityOfService = QualityOfService.userInteractive
processingQueue.addOperation {
// run code
}

This adds a chunk of code to be run as a thread to an Operation Queue, with a set maximum number of concurrent operations, and at a set Quality of Service (QoS). Those are dispatched in first-in, first-out order, with threads of higher QoS being given preference, to the next available core of the appropriate type, performance (P) or efficiency (E).

QoS and core types

QoS is used to determine which types of core a thread can be allocated to. According to the rules currently used in macOS, threads with QoS between 17 and 33 are run preferentially on P cores, but when no P cores are available, they can be run on E cores. However, threads with QoS of 9, background, are confined to E cores and are never run on P cores, even if they’re all idle and the E cores are busy. There’s one unusual condition to consider here, when Game Mode is in force, which reserves the E cores for use by the game being run.

This is clearest when averaged over time in the CPU history of Activity Monitor.

This is shown when tasks are run using an increasing number of threads with the highest QoS. This original M1 chip is here being subjected to a series of loads from increasing numbers of CPU-intensive threads. Its two clusters, E0 and P0, are distinguished by the blue boxes. With 1-4 threads at high QoS (from the left), the load is borne entirely by the P0 cluster, then with 5-8 threads the E0 cluster takes its share.

In this sequence of tests at lowest QoS on an M1 Mac mini, I increased the number of identical test threads from 1 at the left to 4, 6 and 8 at the right. Over the first four tests, the height of the bars increases until they reach 100% on each of the four E cores, but the width of each test (time taken) remains constant at around 5-6 squares until the number of processes exceeds the number of E cores. No matter how many threads an app runs at the lowest QoS, they’re confined to the E cores alone.

Activity Monitor’s averaging obscures the movement of threads between cores, though, which can only be seen when using smaller sampling windows with the command tool powermetrics, and Xcode’s Instruments. Unfortunately, Instruments can have its own problems as it imposes its own load on cores. However, the next charts confirm this basic rule.

When four threads are run at high QoS on the 6+6 CPU cores of an M3 Pro, each is given the equivalent of one P core.

When charted by core, at any moment in time, four cores are running at full active residency with those threads, although the threads are moved from core to core within the six P cores in that cluster. There’s no sign of any overflow onto the six E cores at the top, as the P cores can comfortably accommodate all four threads.

The same four threads run at low QoS at lower frequency.

They’re confined to the E cores, though, and appear to move frequently between those six cores. A more detailed picture is available from powermetrics, which confirms that the threads are running on only four of the E cores at any time.

Fine control

macOS not only dispatches threads to CPU cores according to that general rule based on QoS, but exerts fine control over the core frequency, and handles some special circumstances. Two examples have come to light.

The M1 Pro is unusual as it has only two E cores, but two four-core clusters of P cores. To compensate for the fact that it has half the number of E cores as the basic M1, when both of its E cores are fully loaded, the two E cores are run at high frequency. This delivers better performance than the four E cores in the basic M1 running at their normal frequency. Other M-series E cores are often run at higher frequency when they’re running high QoS threads that have overspilled from P cores, because of the number of threads in the queue.

Normally, the pattern of allocation of threads run at high QoS is to progressively occupy P cores, one cluster at a time, until there are no P cores available, then to overspill onto E cores.

This diagram shows how high-QoS threads are normally allocated to the cores of the first P cluster in an M1 Max CPU, until it’s fully loaded with four cores each running one thread. Only when a fifth thread is added does macOS allocate that to a core within the second P cluster, which it then fills until there are four threads running in each of its two P clusters. Note also how those threads, with their high QoS, are run on P cores as long as they’re available, leaving the E cores to run background threads uninterrupted.

When running vDSP_mmul threads, but not with any other test, M1 Pro and Max chips, but not the M3 Pro, occupy their CPU cores in a completely different order, summarised in the diagram below.

Differences are obvious from two threads upwards, in that the second thread isn’t allocated to a second core in the first P cluster, but one in the second P cluster. Although being run at high QoS, the third thread is then allocated to the E cluster, and moved between its two cores. Core allocation is thus made to balance the load across the three clusters, with similar numbers of P cores running threads in each of the two clusters. Because these high QoS threads are also allocated to E cores, P cores are deliberately under-allocated. Although all 8 threads could have been run on P cores alone, because 2 of the threads are run on the E cores, there are 2 P cores sitting idle. macOS has chosen not to make most use of P cores when running high QoS threads, an exceptional allocation strategy that is likely to result from use of an out-of-core processor such as the AMX.

Summary

The general rule for macOS is that threads of high QoS are run preferentially on P cores; if no P core is available, then they will be run on available E cores. However, low QoS threads are confined to E cores and can’t overflow onto P cores.
This strategy involves control over core frequency too.
Fine controls operate in Game Mode to give a game exclusive access to the E cores, to compensate for core numbers in some chips, and to ensure access to out-of-core processors.
The goal is to deliver the best multithreading performance and optimum energy use.

In the next article, I will move from multithreading to Swift’s concurrency to discover what that offers.

Further reading

Dispatch (Apple)
Threading (Apple)
Concurrency (Apple)
Concurrency (Swift 6)