Combining Coreml and Arkit

Combining CoreML and ARKit

Don't process images yourself to feed them to Core ML. Use Vision. (No, not that one. This one.) Vision takes an ML model and any of several image types (including CVPixelBuffer) and automatically gets the image to the right size and aspect ratio and pixel format for the model to evaluate, then gives you the model's results.

Here's a rough skeleton of the code you'd need:

var request: VNRequest

func setup() {
    let model = try VNCoreMLModel(for: MyCoreMLGeneratedModelClass().model)
    request = VNCoreMLRequest(model: model, completionHandler: myResultsMethod)
}

func classifyARFrame() {
    let handler = VNImageRequestHandler(cvPixelBuffer: session.currentFrame.capturedImage,
        orientation: .up) // fix based on your UI orientation
    handler.perform([request])
}

func myResultsMethod(request: VNRequest, error: Error?) {
    guard let results = request.results as? [VNClassificationObservation]
        else { fatalError("huh") }
    for classification in results {
        print(classification.identifier, // the scene label
              classification.confidence)
    }
}

See this answer to another question for some more pointers.

Combining CoreML Object Detection and ARKit 2D Image Detection

I found a solution. The problem is that the camera has a limited buffers available, I was enqueueing too many buffers while another Vision task was still running.

That is why the camera was slow. So, the solution is release the buffer before performing another request.

internal var currentBuffer: CVPixelBuffer?

func session(_ session: ARSession, didUpdate frame: ARFrame) {

    guard currentBuffer == nil, case .normal = frame.camera.trackingState else {
        return
    }
    self.currentBuffer = frame.capturedImage

    DispatchQueue(label: "CoreML_request").async {
        guard let pixelBuffer = session.currentFrame?.capturedImage else {
            return
        }

        let exifOrientation = self.exifOrientationFromDeviceOrientation()

        let imageRequestHandler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer, orientation: exifOrientation, options: [:])
        do {
            // Release the pixel buffer when done, allowing the next buffer to be processed.
            defer { self.currentBuffer = nil }
            try imageRequestHandler.perform(self.requests)
        } catch {
            print(error)
        }
    }
}

Here you can check the documentation:

https://developer.apple.com/documentation/arkit/recognizing_and_labeling_arbitrary_objects

Vision Framework with ARkit and CoreML

Update: Apple now has a sample code project that does some of these steps. Read on for those you still need to figure out yourself...

Just about all of the pieces are there for what you want to do... you mostly just need to put them together.

You obtain ARFrames either by periodically polling the ARSession for its currentFrame or by having them pushed to your session delegate. (If you're building your own renderer, that's ARSessionDelegate; if you're working with ARSCNView or ARSKView, their delegate callbacks refer to the view, so you can work back from there to the session to get the currentFrame that led to the callback.)

ARFrame provides the current capturedImage in the form of a CVPixelBuffer.

You pass images to Vision for processing using either the VNImageRequestHandler or VNSequenceRequestHandler class, both of which have methods that take a CVPixelBuffer as an input image to process.

You use the image request handler if you want to perform a request that uses a single image — like finding rectangles or QR codes or faces, or using a Core ML model to identify the image.
You use the sequence request handler to perform requests that involve analyzing changes between multiple images, like tracking an object's movement after you've identified it.

You can find general code for passing images to Vision + Core ML attached to the WWDC17 session on Vision, and if you watch that session the live demos also include passing CVPixelBuffers to Vision. (They get pixel buffers from AVCapture in that demo, but if you're getting buffers from ARKit the Vision part is the same.)

One sticking point you're likely to have is identifying/locating objects. Most "object recognition" models people use with Core ML + Vision (including those that Apple provides pre-converted versions of on their ML developer page) are scene classifiers. That is, they look at an image and say, "this is a picture of a (thing)," not something like "there is a (thing) in this picture, located at (bounding box)".

Vision provides easy API for dealing with classifiers — your request's results array is filled in with VNClassificationObservation objects that tell you what the scene is (or "probably is", with a confidence rating).

If you find or train a model that both identifies and locates objects — and for that part, I must stress, the ball is in your court — using Vision with it will result in VNCoreMLFeatureValueObservation objects. Those are sort of like arbitrary key-value pairs, so exactly how you identify an object from those depends on how you structure and label the outputs from your model.

If you're dealing with something that Vision already knows how to recognize, instead of using your own model — stuff like faces and QR codes — you can get the locations of those in the image frame with Vision's API.

If after locating an object in the 2D image, you want to display 3D content associated with it in AR (or display 2D content, but with said content positioned in 3D with ARKit), you'll need to hit test those 2D image points against the 3D world.

Once you get to this step, placing AR content with a hit test is something that's already pretty well covered elsewhere, both by Apple and the community.

Object detection ARKit vs CoreML

Due to glass refraction phenomenon and different lighting conditions an object recognition process (in ARKit and CoreML) for perfume bottles is the most sophisticated one.

Look at the following picture – there are three glass balls at different locations:

Sample Image

These glass balls have different Fresnel's IOR (Index Of Refraction), environment, camera's Point Of View, size and lighting conditions. But they have the same shape, material and colour.

So, the best way to speed up a recognition process is to use identical background/environment (for example monochromatic light-grey paper BG), the same lighting condition (location, intensity, color, and direction of the light), good shape's readability (thanks to specular highlights) and the same POV for your camera.

Sample Image

I know, sometimes it's impossible to follow these tips but these ones are working.

Is it possible to detect object using CoreML model and find measurement of that object?

Use ARKit's built-in object detection algorithm for that task. It's simple and power.

With ARKit's object detection you can detect your door (preliminary scanned or shot on smartphone).

The following code helps you detect real world objects (like door) and place 3D object or 3D text at ARObjectAnchor position:

import ARKit

extension ViewController: ARSCNViewDelegate {

    func renderer(_ renderer: SCNSceneRenderer,
                 didAdd node: SCNNode,
                  for anchor: ARAnchor) {

        if let _ = anchor as? ARObjectAnchor {

            let text = SCNText(string: "SIZE OF THIS OBJECT IS...",
                       extrusionDepth: 0.05)

            text.flatness = 0.5
            text.font = UIFont.boldSystemFont(ofSize: 10)

            let textNode = SCNNode(geometry: text)
            textNode.geometry?.firstMaterial?.diffuse.contents = UIColor.white
            textNode.scale = SCNVector3(0.01, 0.01, 0.01) 

            node.addChildNode(textNode)
        }
    }
}

And supply an Xcode's folder Resources with images of your real-life objects.

class ViewController: UIViewController {

    @IBOutlet var sceneView: ARSCNView!
    let configuration = ARWorldTrackingConfiguration()

    override func viewDidLoad() {
        super.viewDidLoad()

        sceneView.debugOptions = .showFeaturePoints
        sceneView.delegate = self

        guard let dObj = ARReferenceObject.referenceObjects(inGroupNamed: "Resources", 
                                                                  bundle: nil) 
        else {
            fatalError("There's no reference image")
            return
        }

        configuration.detectionObjects = dObj
        sceneView.session.run(configuration)
    }

    override func viewWillDisappear(_ animated: Bool) {
        super.viewWillDisappear(animated)
        sceneView.session.pause()
    }
}

Combining Coreml and Arkit