Lift subjects from images in your app

Description: Discover how you can easily pull the subject of an image from its background in your apps. Learn how to lift the primary subject or to access the subject at a given point with VisionKit. We’ll also share how you can lift subjects using Vision and combine that with lower-level frameworks like Core Image to create fun image effects and more complex compositing pipelines. For more information about the latest updates to VisionKit, check out “What’s new in VisionKit." And for more information about person segmentation in images, watch "Explore 3D body pose and person segmentation in Vision" from WWDC23.

Speakers: Lizzy Board and Saumitro Dasgupta

What is a subject?

A subject is the foreground object, or objects, of a photo. This is not always a person or a pet. It can be anything from a building, a plate of food, or some pairs of shoes.

Supporting frameworks

VisionKit

Allows you to very easily adopt system-like subject lifting behavior, right out of the box.
Lets you easily recreate the subject lifting UI that we all know and love, with just a few lines of code.
Exposes some basic information about these subjects.
All happens out-of-process, which has performance benefits but mens the image size is limited.

Vision

Lower-level framework that doesn't have out-of-the-box UI.
Support multiple input sources and higher image resolution.
Good for more advanced image editing pipelines.

Subject lifting in VisionKit

iOS:

Initialize an ImageAnalysisInteraction.
Add the interaction to a view containing the image (this can be an UIImageView, but doesn't need to be).

macOS:

Initialize an ImageAnalysisOverlayView.
Add the view as a subview of the NSView containing the image.

A preferred interaction type can be set on the ImageAnalysisInteraction and ImageAnalysisOverlayView to choose which interactions should be supported.

Default is .automatic which mirrors system behavior. This supports subject lifting, live text and data detectors.
The new .imageSubject supports only subject lifting.

Manually analyzing images

let analyzer = ImageAnalyzer()
let analysis = try? await analyzer.analyze(image, configuration: configuration)

The ImageAnalysis struct has a property subjects which contains a list of the image's Subjects which contains an image and its bounds. It also has a property highlightedSubjects which contains the highlighted subjects. This can be changed programatically.

Looking up a subject

let subject = try? await interaction.subject(at: point)

If there are no subjects at that point, this method will return nil.

Generating subject images

// For an image for a single Subject:
subject.image
// For an image composed of multiple Subjects:
interaction.image(for: interaction.subjects)

Subject lifting in Vision

Different kind of APIs

Saliency:

VNGenerateAttentionBasedSaliencyImageRequest
VNGenerateObjectnessBasedSaliencyImageRequest

Saliency requests, like the ones for attention and objectness, are best used for coarse, region-based analysis.

The generated saliency maps are at a fairly low resolution and as such, not suitable for segmentation. Instead, you could use the salient regions for tasks like auto-cropping an image.

Person segmentation:

VNGeneratePersonSegmentationRequest

It shines at producing detailed segmentation masks for people in the scene. Use this if you specifically want to focus on segmenting people.

Person instance segmentation:

VNGeneratePersonInstanceMaskRequest NEW

The new person instance segmentation API takes things further by providing a separate mask for each person in the scene.

See more in the session "Explore 3D body pose and person segmentation in Vision".

Subject lifting:

VNGenerateForegroundInstanceMaskRequest NEW

The newly introduced subject lifting API is "class agnostic". Any foreground object, regardless of its semantic class, can be potentially segmented.

Concepts

You start with an input image. The subject lifting request processes this image and produces a soft segmentation mask at the same resolution. Taking this mask and applying it to the source image results in the masked image.

Each distinct segmented object is referred to as an instance.

Vision also provides you with pixelwise information about these instances. This instance mask maps pixels in the source image to their instance index. The zero index is reserved for the background, and then each foreground instance is labeled sequentially, starting at 1.

The ordering of the IDs are not guaranteed.

Generate a masked image

let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])
guard let result = request.results?.first else {
  return
}
let output = result.generateMaskedImage(
  ofInstances: result.allInstances,
  from: requestHandler,
  croppedToInstancesExtent: false)

This is a resource intensive task and best deferred to a background thread so as not to block the UI.

Working with masks

result.generateScaledMaskForImage(
  forInstances: result.allInstances,
  from: requestHandler)

The mask I just generated is perfectly suited for use with CoreImage. Vision, much like VisionKit, produces SDR outputs. Performing the masking in CoreImage, however, preserves the high dynamic range of the input.

func apply(mask: CIImage, toImage image: CIImage) -> CIImage {
  let filter = CIFilter.blendWithMask()
  filter.inputImage = image
  filter.maskImage = mask
  filter.backgroundImage = CIImage.empty()
  return filter.outputImage!
}

See more in the session "Support HDR images in your app".

Demo app / putting it all together

Between 13:22 and 18:04, the code snippets above, and some more are put together, to enable a visual effects demo app.