Apple Vision Framework - Text Extraction from Image

Anyone know how to use Apple's vision framework for real-time text recognition?

For performance reasons, I'd prefer to not convert the CMSampleBuffer to a UIImage, and would instead use the following to create an AVCaptureVideoPreviewLayer for live video:

class CameraFeedView: UIView {
    private var previewLayer: AVCaptureVideoPreviewLayer!
    
    override class var layerClass: AnyClass {
        return AVCaptureVideoPreviewLayer.self
    }
    
    init(frame: CGRect, session: AVCaptureSession, videoOrientation: AVCaptureVideoOrientation) {
        super.init(frame: frame)
        previewLayer = layer as? AVCaptureVideoPreviewLayer
        previewLayer.session = session
        previewLayer.videoGravity = .resizeAspect
        previewLayer.connection?.videoOrientation = videoOrientation
    }
    
    required init?(coder: NSCoder) {
        fatalError("init(coder:) has not been implemented")
    }
}

Once you have this, you can work on the live video data using Vision:

class CameraViewController: AVCaptureVideoDataOutputSampleBufferDelegate {
  
  private let videoDataOutputQueue = DispatchQueue(label: "CameraFeedDataOutput", qos: .userInitiated,
                                                   attributes: [], autoreleaseFrequency: .workItem)
  private var drawingView: UILabel = {
    let view = UILabel(frame: UIScreen.main.bounds)
    view.font = UIFont.boldSystemFont(ofSize: 30.0)
    view.textColor = .red
    view.translatesAutoresizingMaskIntoConstraints = false
    return view
  }()
  private var cameraFeedSession: AVCaptureSession?
  private var cameraFeedView: CameraFeedView! //Wrap

  override func viewDidLoad() {
    super.viewDidLoad()
    do {
      try setupAVSession()
    } catch {
      print("setup av session failed")
    }
  }

  func setupAVSession() throws {
    // Create device discovery session for a wide angle camera
    let wideAngle = AVCaptureDevice.DeviceType.builtInWideAngleCamera
    let discoverySession = AVCaptureDevice.DiscoverySession(deviceTypes: [wideAngle], mediaType: .video, position: .back)
    
    // Select a video device, make an input
    guard let videoDevice = discoverySession.devices.first else {
      print("Could not find a wide angle camera device.")
    }
    
    guard let deviceInput = try? AVCaptureDeviceInput(device: videoDevice) else {
      print("Could not create video device input.")
    }
    
    let session = AVCaptureSession()
    session.beginConfiguration()
    // We prefer a 1080p video capture but if camera cannot provide it then fall back to highest possible quality
    if videoDevice.supportsSessionPreset(.hd1920x1080) {
      session.sessionPreset = .hd1920x1080
    } else {
      session.sessionPreset = .high
    }
    
    // Add a video input
    guard session.canAddInput(deviceInput) else {
      print("Could not add video device input to the session")
    }
    session.addInput(deviceInput)
    
    let dataOutput = AVCaptureVideoDataOutput()
    if session.canAddOutput(dataOutput) {
      session.addOutput(dataOutput)
      // Add a video data output
      dataOutput.alwaysDiscardsLateVideoFrames = true
      dataOutput.videoSettings = [
        String(kCVPixelBufferPixelFormatTypeKey): Int(kCVPixelFormatType_420YpCbCr8BiPlanarFullRange)
      ]
      dataOutput.setSampleBufferDelegate(self, queue: videoDataOutputQueue)
    } else {
      print("Could not add video data output to the session")
    }
    let captureConnection = dataOutput.connection(with: .video)
    captureConnection?.preferredVideoStabilizationMode = .standard
    captureConnection?.videoOrientation = .portrait
    // Always process the frames
    captureConnection?.isEnabled = true
    session.commitConfiguration()
    cameraFeedSession = session
    
    // Get the interface orientaion from window scene to set proper video orientation on capture connection.
    let videoOrientation: AVCaptureVideoOrientation
    switch view.window?.windowScene?.interfaceOrientation {
      case .landscapeRight:
        videoOrientation = .landscapeRight
      default:
        videoOrientation = .portrait
    }
    
    // Create and setup video feed view
    cameraFeedView = CameraFeedView(frame: view.bounds, session: session, videoOrientation: videoOrientation)
    setupVideoOutputView(cameraFeedView)
    cameraFeedSession?.startRunning()
  }

The key functions to implement once you've got an AVCaptureSession set up are the delegate and request handler:

  func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
    
    let requestHandler = VNImageRequestHandler(cmSampleBuffer: sampleBuffer, orientation: .down)
    
    let request = VNRecognizeTextRequest(completionHandler: textDetectHandler)
    
    do {
      // Perform the text-detection request.
      try requestHandler.perform([request])
    } catch {
      print("Unable to perform the request: \(error).")
    }
  }
  
  func textDetectHandler(request: VNRequest, error: Error?) {
    guard let observations =
            request.results as? [VNRecognizedTextObservation] else { return }
    // Process each observation to find the recognized body pose points.
    let recognizedStrings = observations.compactMap { observation in
        // Return the string of the top VNRecognizedText instance.
        return observation.topCandidates(1).first?.string
    }
    
    DispatchQueue.main.async {
      self.drawingView.text = recognizedStrings.first
    }
  }
}

Note, you will probably want to process each of the recognizedStrings in order to choose the one with the highest confidence, but this is a proof of concept. You could also add a bounding box, and the docs have an example of that.

How do I extract specific text from an image using a UITextField in Swift?

You should watch the latest WWDC on Vision framework. Basically, from iOS 13
the VNRecognizeTextRequest returns the text and also the bounding box of the text in the image.
The code can be something like this:

func startTextDetection() {
    let request = VNRecognizeTextRequest(completionHandler: self.detectTextHandler)
    request.recognitionLevel = .fast
    self.requests = [request]
}

private func detectTextHandler(request: VNRequest, error: Error?) {
    guard let observations = request.results as? [VNRecognizedTextObservation] else {
        fatalError("Received invalid observations")
    }
    for lineObservation in observations {
        guard let textLine = lineObservation.topCandidates(1).first else {
            continue
        }

        let words = textLine.string.split{ $0.isWhitespace }.map{ String($0)}
        for word in words {
            if let wordRange = textLine.string.range(of: word) {
                if let rect = try? textLine.boundingBox(for: wordRange)?.boundingBox {
                     // here you can check if word == textField.text
                     // rect is in image coordinate space, normalized with origin in the bottom left corner
                }
            }
        }
   }
}

Swift iOS - Vision framework text recognition and rectangles

The VNImageRectForNormalizedRect is returning CGRect with the y coordinates flipped. (I suspect it was written for macOS, which uses a different coordinate system than iOS.)

Instead, I might suggest a version of boundingBox adapted from Detecting Objects in Still Images:

/// Convert Vision coordinates to pixel coordinates within image.
///
/// Adapted from `boundingBox` method from
/// [Detecting Objects in Still Images](https://developer.apple.com/documentation/vision/detecting_objects_in_still_images).
/// This flips the y-axis.
///
/// - Parameters:
///   - boundingBox: The bounding box returned by Vision framework.
///   - bounds: The bounds within the image (in pixels, not points).
///
/// - Returns: The bounding box in pixel coordinates, flipped vertically so 0,0 is in the upper left corner

func convert(boundingBox: CGRect, to bounds: CGRect) -> CGRect {
    let imageWidth = bounds.width
    let imageHeight = bounds.height

    // Begin with input rect.
    var rect = boundingBox

    // Reposition origin.
    rect.origin.x *= imageWidth
    rect.origin.x += bounds.minX
    rect.origin.y = (1 - rect.maxY) * imageHeight + bounds.minY

    // Rescale normalized coordinates.
    rect.size.width *= imageWidth
    rect.size.height *= imageHeight

    return rect
}

Note, I changed the method name because it does not return a bounding box, but rather converts a bounding box (with values in [0,1]) to a CGRect. I also fixed a little bug in their boundingBox implementation. But it captures the main idea, namely flipping the y-axis of the bounding box.

Anyway, that yields the right boxes:

Sample Image

E.g.

func recognizeText(in image: UIImage) {
    guard let cgImage = image.cgImage else { return }
    let imageRequestHandler = VNImageRequestHandler(cgImage: cgImage, orientation: .up)

    let size = CGSize(width: cgImage.width, height: cgImage.height) // note, in pixels from `cgImage`; this assumes you have already rotate, too
    let bounds = CGRect(origin: .zero, size: size)
    // Create a new request to recognize text.
    let request = VNRecognizeTextRequest { [self] request, error in
        guard
            let results = request.results as? [VNRecognizedTextObservation],
            error == nil
        else { return }

        let rects = results.map {
            convert(boundingBox: $0.boundingBox, to: CGRect(origin: .zero, size: size))
        }

        let string = results.compactMap {
            $0.topCandidates(1).first?.string
        }.joined(separator: "\n")

        let format = UIGraphicsImageRendererFormat()
        format.scale = 1
        let final = UIGraphicsImageRenderer(bounds: bounds, format: format).image { _ in
            image.draw(in: bounds)
            UIColor.red.setStroke()
            for rect in rects {
                let path = UIBezierPath(rect: rect)
                path.lineWidth = 5
                path.stroke()
            }
        }

        DispatchQueue.main.async { [self] in
            imageView.image = final
            label.text = string
        }
    }

    DispatchQueue.global(qos: .userInitiated).async {
        do {
            try imageRequestHandler.perform([request])
        } catch {
            print("Failed to perform image request: \(error)")
            return
        }
    }
}

/// Convert Vision coordinates to pixel coordinates within image.
///
/// Adapted from `boundingBox` method from
/// [Detecting Objects in Still Images](https://developer.apple.com/documentation/vision/detecting_objects_in_still_images).
/// This flips the y-axis.
///
/// - Parameters:
///   - boundingBox: The bounding box returned by Vision framework.
///   - bounds: The bounds within the image (in pixels, not points).
///
/// - Returns: The bounding box in pixel coordinates, flipped vertically so 0,0 is in the upper left corner

func convert(boundingBox: CGRect, to bounds: CGRect) -> CGRect {
    let imageWidth = bounds.width
    let imageHeight = bounds.height

    // Begin with input rect.
    var rect = boundingBox

    // Reposition origin.
    rect.origin.x *= imageWidth
    rect.origin.x += bounds.minX
    rect.origin.y = (1 - rect.maxY) * imageHeight + bounds.minY

    // Rescale normalized coordinates.
    rect.size.width *= imageWidth
    rect.size.height *= imageHeight

    return rect
}

///  Scale and orient picture for Vision framework
///
///  From [Detecting Objects in Still Images](https://developer.apple.com/documentation/vision/detecting_objects_in_still_images).
///
///  - Parameter image: Any `UIImage` with any orientation
///  - Returns: An image that has been rotated such that it can be safely passed to Vision framework for detection.

func scaleAndOrient(image: UIImage) -> UIImage {

    // Set a default value for limiting image size.
    let maxResolution: CGFloat = 640

    guard let cgImage = image.cgImage else {
        print("UIImage has no CGImage backing it!")
        return image
    }

    // Compute parameters for transform.
    let width = CGFloat(cgImage.width)
    let height = CGFloat(cgImage.height)
    var transform = CGAffineTransform.identity

    var bounds = CGRect(x: 0, y: 0, width: width, height: height)

    if width > maxResolution ||
        height > maxResolution {
        let ratio = width / height
        if width > height {
            bounds.size.width = maxResolution
            bounds.size.height = round(maxResolution / ratio)
        } else {
            bounds.size.width = round(maxResolution * ratio)
            bounds.size.height = maxResolution
        }
    }

    let scaleRatio = bounds.size.width / width
    let orientation = image.imageOrientation
    switch orientation {
    case .up:
        transform = .identity
    case .down:
        transform = CGAffineTransform(translationX: width, y: height).rotated(by: .pi)
    case .left:
        let boundsHeight = bounds.size.height
        bounds.size.height = bounds.size.width
        bounds.size.width = boundsHeight
        transform = CGAffineTransform(translationX: 0, y: width).rotated(by: 3.0 * .pi / 2.0)
    case .right:
        let boundsHeight = bounds.size.height
        bounds.size.height = bounds.size.width
        bounds.size.width = boundsHeight
        transform = CGAffineTransform(translationX: height, y: 0).rotated(by: .pi / 2.0)
    case .upMirrored:
        transform = CGAffineTransform(translationX: width, y: 0).scaledBy(x: -1, y: 1)
    case .downMirrored:
        transform = CGAffineTransform(translationX: 0, y: height).scaledBy(x: 1, y: -1)
    case .leftMirrored:
        let boundsHeight = bounds.size.height
        bounds.size.height = bounds.size.width
        bounds.size.width = boundsHeight
        transform = CGAffineTransform(translationX: height, y: width).scaledBy(x: -1, y: 1).rotated(by: 3.0 * .pi / 2.0)
    case .rightMirrored:
        let boundsHeight = bounds.size.height
        bounds.size.height = bounds.size.width
        bounds.size.width = boundsHeight
        transform = CGAffineTransform(scaleX: -1, y: 1).rotated(by: .pi / 2.0)
    default:
        transform = .identity
    }

    return UIGraphicsImageRenderer(size: bounds.size).image { rendererContext in
        let context = rendererContext.cgContext

        if orientation == .right || orientation == .left {
            context.scaleBy(x: -scaleRatio, y: scaleRatio)
            context.translateBy(x: -height, y: 0)
        } else {
            context.scaleBy(x: scaleRatio, y: -scaleRatio)
            context.translateBy(x: 0, y: -height)
        }
        context.concatenate(transform)
        context.draw(cgImage, in: CGRect(x: 0, y: 0, width: width, height: height))
    }
}

Converting a Vision VNTextObservation to a String

Apple finally updated Vision to do OCR. Open a playground and dump a couple of test images in the Resources folder. In my case, I called them "demoDocument.jpg" and "demoLicensePlate.jpg".

The new class is called VNRecognizeTextRequest. Dump this in a playground and give it a whirl:

import Vision

enum DemoImage: String {
    case document = "demoDocument"
    case licensePlate = "demoLicensePlate"
}

class OCRReader {
    func performOCR(on url: URL?, recognitionLevel: VNRequestTextRecognitionLevel)  {
        guard let url = url else { return }
        let requestHandler = VNImageRequestHandler(url: url, options: [:])

        let request = VNRecognizeTextRequest  { (request, error) in
            if let error = error {
                print(error)
                return
            }

            guard let observations = request.results as? [VNRecognizedTextObservation] else { return }

            for currentObservation in observations {
                let topCandidate = currentObservation.topCandidates(1)
                if let recognizedText = topCandidate.first {
                    print(recognizedText.string)
                }
            }
        }
        request.recognitionLevel = recognitionLevel

        try? requestHandler.perform([request])
    }
}

func url(for image: DemoImage) -> URL? {
    return Bundle.main.url(forResource: image.rawValue, withExtension: "jpg")
}

let ocrReader = OCRReader()
ocrReader.performOCR(on: url(for: .document), recognitionLevel: .fast)

There's an in-depth discussion of this from WWDC19

Apple Vision Framework: LCD/LED digit recognition

Train a model...

Train your own .mlmodel using up to 10K images containing screens of digital clocks, calculators, blood pressure monitors, etc. For that you can use Xcode Playground or Apple Create ML app.

Here's a code you can copy and paste into macOS Playground:

import Foundation
import CreateML

let trainDir = URL(fileURLWithPath: "/Users/swift/Desktop/Screens/Digits")

// let testDir = URL(fileURLWithPath: "/Users/swift/Desktop/Screens/Test")

var model = try MLImageClassifier(trainingData: .labeledDirectories(at: trainDir), 
                                    parameters: .init(featureExtractor: .scenePrint(revision: nil), 
                                    validation: .none, 
                                 maxIterations: 25, 
                           augmentationOptions: [.blur, .noise, .exposure]))

let evaluation = model.evaluation(on: .labeledDirectories(at: trainDir))

let url = URL(fileURLWithPath: "/Users/swift/Desktop/Screens/Screens.mlmodel")

try model.write(to: url)

Extracting a text from image...

If you want to know how to extract a text from image using Vision framework, look at this post.

Apple Vision image recognition

As of ARKit 1.5 (coming with IOS 11.3 in the spring of 2018), a feature seems to be implemented directly on top of ARKit that solves this problem.

ARKit will fully support image recognition.
Upon recognition of an image, the 3d coordinates can be retrieved as an anchor, and therefore content can be placed onto them.

Vision language detection

I have to specify Chinese at the front of the recognitionLanguages property for it to work with Chinese

This is how it was designed. .accurate uses a ML-based recognizer, and because Chinese is really complex, it must come first. See WWDC21's Extract document data using Vision at 8:02.

This also means that there is no way to automatically detect the language of the image.