What can macOS Sonoma find in my images?

When it released macOS Ventura, Apple announced that it extended Spotlight search to include “information from images in Photos, Messages, Notes and the Finder to enable searching by locations, scenes, or even things in the images, like text, a dog or a car”, and those have been further enhanced in Sonoma. Image analysis is controversial, because of its potential for abuse, and the sensitivity of image content. Periodically claims are made that Apple secretly exfiltrates data about images stored on our Macs or devices, despite the lack of any evidence. This article looks at the metadata derived from image analysis by macOS, and how it can and can’t be accessed.

Spotlight and its metadata can readily become confusing, and search results may appear inexplicable. Findings reported here have been reproduced on two Macs running macOS 14.0, one an iMac Pro, the other a Mac Studio M1 Max. Both were set with Ask Siri turned off in Siri & Spotlight settings, all Search results there were enabled, and images were stored in folders that hadn’t been added to Spotlight Privacy.

Does macOS analyse and classify images?

One simple test is to use the mdfind command to search for images indexed by Spotlight, using a term appropriate for objects within those images. As my suite of test images contains several featuring countryside views, one test might be
mdfind “(** == ‘cattle*’cdw) && (kMDItemContentTypeTree == ‘public.image’cd)”

This should return a list of files of different image types with search keys whose values contain words starting with the text cattle, case-insensitive. Those terms could appear in any of the search keys, so that should return files whose names contain that text, those where it’s included in a list of keywords or other text metadata, including those derived from image analysis and classification.

Among the results listed were two files
/Users/hoakley/Pictures/1photo2process/BeltedGalloways1.jpg
/Users/hoakley/Pictures/1photo2process/HantsDownsLambs.jpg
with no other metadata that was likely to contain the text cattle. To check in full detail, all keyed metadata for those was then exposed using the mdls command, such as
mdls /Users/hoakley/Pictures/1photo2process/BeltedGalloways1.jpg

Nowhere in the full listing of keyed metadata was the text cattle found. Specifically, keys with names suggesting they might contain metadata obtained from image analysis and classification failed to show any results, returning (null) in every case:
MDItemMediaAnalysisLastAttempt = (null)
kMDItemPhotosCharacterRecognitionAnalysisVersion = (null)
kMDItemPhotosSceneAnalysisVersion = (null)
kMDItemPhotosSceneClassificationIdentifiers = (null)
kMDItemPhotosSceneClassificationLabels = (null)
kMDItemPhotosSceneClassificationMediaTypes = (null)
kMDItemPhotosSceneClassificationSynonyms = (null)
kMDItemPhotosSceneClassificationSynonymsCounts = (null)
kMDItemPhotosSceneClassificationTypes = (null)

Suspecting this could be a limitation of the mdls command, I attempted to retrieve the values for those keys using the Swift API call
NSMetadataItem.value(forAttribute: String)
for those keys, with the same result. The metadata for those keys exists, but in every case appears to be empty.

These tests confirm that macOS Sonoma does analyse and classify images that are accessible to Spotlight search, and can reveal images that meet search criteria for objects found within them, but that it’s unable to return lists of metadata values for those particular keys.

What content is obtained from images?

Examining search results from a broad range of images shows that indexed metadata includes text extracted from images using Live Text, and objects and other content identified within a image, such as cow and blue sky.

Object classification is loose, and far from being accurate or specific. Those two images that Spotlight considered contained cattle give a good idea of this.

In this case, cattle is an accurate object descriptor, as these are Belted Galloway cattle.

Here, cattle is inaccurate, as these are Hampshire Down sheep, not cattle at all.

Unlike Visual Look Up, which may attempt specific identification of objects against external databases, this form of object classification remains generic. Searching for ‘Galloway’ or ‘Hampshire’ won’t find either of those images, and they aren’t terms likely to be recognised in object search.

Why doesn’t Spotlight return a vocabulary for matched objects?

To understand why Spotlight doesn’t provide a vocabulary of matched terms for recognised objects, it’s useful to consider how it treats text extracted from a text file, or recognised in an image.

When Spotlight indexes text content, it’s almost certainly not storing that as text. One simple approach to word search is to convert each word in the text into a hash, and to store those hashes in a dictionary. When searching for a word, the word to be found is hashed, and the dictionary searched for that hash. If you were to ask the search engine for that dictionary, all you would get is a long list of hashes, as that’s the only representation the engine has for those words.

Conventional word search is simpler than image search, as the result for words is binary, they’re either found or they’re not. Object search in images has to deal with probabilities.

Objects identified in an image are generally assessed using metrics derived from machine learning, and might consist of multi-dimensional arrays of numbers. For the sake of this illustration, we’ll consider just one of those, that might equate to a measure of how strongly an object resembles a cow or other type of cattle. When that value is 1.0, the object is almost certainly cattle, and when it’s 0.0 it almost certainly isn’t.

When VisionKit analyses the objects found in the first of my two images, it might find that measure of being cattle-like is 0.9, and in the second image objects might have lower scores, with one as high as 0.8. Run mdfind to look for the text cattle, and both images might be deemed to have sufficiently high scores to match that word. Try another search using the term cow, though, and the value of objects in the second image might be deemed insufficient to match that term.

The only way then to build a vocabulary of recognised objects would be to search using many different terms, and for each to assess whether its score is sufficient for that term to be an appropriate object descriptor. That would only be feasible if there were few images to be analysed.

Could object recognition be used to detect banned images?

Over two years ago, Apple expressed its intention to check images that were to be uploaded for sharing, to determine whether they contained Child Sexual Abuse Material (CSAM). After a period of great controversy, Apple has since stated that is won’t progress that proposal, but many remain convinced that it already has introduced such checks, and that ‘Apple is checking images on our Macs’ presumably using object recognition in VisionKit.

A little exploration using mdfind or Spotlight search will demonstrate the high rates of false positives and negatives that result from this type of generalised object classification. Although it’s fairly reliable at distinguishing cattle from parked cars, and we might find it amusing when it returns a photo of a friend matched to the term goat, this isn’t at all suitable for even screening images to determine whether they’re likely to contain CSAM.

Apple’s original proposal was based on image recognition specifically developed to distinguish CSAM from other images. One of the strongest arguments put against its use was its poor performance, with numerous examples of false positives being demonstrated. When a specialist search engine is inadequate, a general object classifier could only be far worse at the task.

Summary

macOS Sonoma does analyse images for text content (‘Live Text’) and classify objects within them, when allowed to by Siri & Spotlight settings.
Spotlight and the mdfind command tool can find images whose objects match recognised search terms.
Unlike explicit test-based metadata, object search can’t be used to discover the vocabulary used in search terms, except on a very small number of images.
Generalised object classification as currently used in Sonoma isn’t suitable for screening images for specific content such as CSAM.
Screening or searching for CSAM images would require a specialist analysis and recognition system. As has been demonstrated in Apple’s previous proposal, such a system doesn’t yet exist, at least not in macOS.

I’m very grateful to John for raising this, and for suggesting its testing. I’m only sorry that it didn’t turn out the way that he expected.