Feedback

Your feedback is important to keep improving our website and offer you a more reliable experience.

Chromium

Chromium is an ‘upstream’ open source browser and web engine project that contributes significantly to the growing Chrome ecosystem. Chromium’s code is utilized in important products and other open source projects including the Chrome browser, Chrome OS, Android Open Source Platform, the rendering engine Blink and the web app run-time project Crosswalk.

AR marker detection on GPU using WebGL*

BY Aleksandar Stoj... ON Jun 25, 2018

1. Introduction

While developing a 3D model scanning implementation, I realized that fiducial Augmented Reality (AR) markers are a simpler alternative to the Iterative Closest Point (ICP) algorithm. AR markers have the added benefit of marking a cut-off plane separating the scanned 3D object from the environment. After evaluating existing solutions, I started designing the algorithm and implementing proof of concept as open source code on github.com. All feedback is welcome.
 
This blog post includes the following sections:
  • Section 2: Current status of the prototype code and the constraints.
  • Section 3: Algorithm steps, so you can modify them for different types of AR markers or purposes.
  • Section 4: Latency measurements using slow motion camera and EXT_disjoint_timer_query.
 
The prototype code does not implement camera pose estimation based on identified marker position. A future article may address solving P3P in the WebGL shader. For the use case in this document (3D model scanning using depth capture camera), P3P is not required. Marker square corners 3D position is obtained from the depth stream, using the approach described in the Media Capture, Depth Stream Extension of the W3C API.
 

1.1 Developing the algorithm

Work on this started as part of 3D model scanning implementation using depth camera capture in Chrome* web browser. My plan was to use the ICP algorithm for camera movement estimation, however, implementing ICP using WebGL only is a complex task. AR markers were easier to implement and they were available sooner than ICP. 
 
Around the same time, I had other ideas about using a webcam and AR markers. For example, a webcam could be used to augment an environment, to compute motion data from webcam video frame to add more immersive VR experience, or to improve accessibility by calculating viewing position and direction. At first, I was skeptical about user input based on webcam capture, given that 30 FPS webcam introduces 33 ms latency, plus there is some latency added by a capture pipeline. All together, it seemed like too much of a latency for VR.
 
Then again, if reusing current, OpenCV-based approaches, part of the processing would be done in JavaScript* and reading back the pixels from HTML5 canvas (2D or WebGL context) would increase latency. So, detection of markers would need to be done in the WebGL shaders code, without reading back the pixels from GPU and the detection results should be available with low latency.
 
Using slow motion capture on my iPhone*, I measured this latency and found it was better than initially assumed, see the Latency Measurements section for details. In short, after the video frame gets uploaded to GPU texture, shaders detect the position of the markers while rendering the frame, with no CPU side intervention. There is minor additional latency compared to latency when only the webcam video frame is displayed. 
 
Video texture and the detected markers then get rendered. There is no readback from GPU textures to CPU side - all the recognition processing happens on GPU (render-to-texture shader passes) and the result is consumed in a render-to-screen shader pass. The demo code also provides the readback implementation, if needed.

2. Current status and constraints

2.1 This is a first step, prototype code

The algorithm is a prototype code, the first step in implementing robust and performant AR label detection. I will continue to work on it while integrating it with 3D model scanning. I also plan to explore other use cases, as mentioned.

2.2 Implementation supports 3x3 AR codes

3x3 marker can uniquely identify 64 different codes. For convenience, here is a link to print-ready images containing all of the codes, from 0 to 63. For the generators, check the links available at ARToolKit. Modifying the algorithm for other types of markers is explained later in this document, in the Algorithm section.

2.3 It is fast and can detect multiple markers simultaneously

I posted a regular speed video (in GIF form), which shows a browser canvas that is rendering capture from integrated 30 Hz laptop webcam. 
 
 
Notice that the marker square at the bottom right of the paper contains the label “NOT VALID” and it is not recognized. Other markers, with valid code and layout, are recognized and the code, yellowish numbers near the markers’ base corner are rendered. 
 
I also posted a slow motion video (in GIF form), which was captured by iPhone at 240 Hz. If my calculations are correct, it should be 8 times slower than regular speed. The USB camera samples at 60 Hz and the laptop’s display refresh rate is 75 Hz.
 
 
As mentioned, when the marker coordinates and the code are recognized, a yellow number is visible on laptop screen, next to the marker’s base corner.
 
Notice that all of the markers on the paper on the left are detected and that the code 0 from the marker in my hand, on the right and moving very fast, is also detected, except in cases when there is a significant motion blur. The same video, in regular speed is available for download here, if you’d like to check it further.

2.4 Demo requires controlled lighting

Adaptive thresholding should be the very first step in the algorithm to make it work in different lighting conditions. Deliberately, I left it for later, to tweak it in the first shader pass, without compromising the latency. At this time, the algorithm works only in a reasonably lit room when there is no direct light in the same direction from camera to marker.

2.5 Browser requirements

The algorithm was tested on Chrome version 64 or later, and on Firefox* version 60, when browsers were running on Windows* 10, Ubuntu* 16.04, or macOS* Sierra.
To run on Firefox, you must set certain options because the demo.html code uses custom elements.
 
Open the about:config page and set dom.webcomponents.customelements.enabled and dom.webcomponents.shadowdom.enabled to true. This dependency is related only to page layout code and it can be easily removed, to avoid the need for dealing with about:config

3. Algorithm and the code

The algorithm operates as a series of rendering passes. Each pass renders to texture and each pass output is input for one or more rendering passes. Finally, the texture with the information about detected AR markers is used in passes rendering to screen.
 
In the code, the output of each pass is defined by the framebuffer property. When framebuffer is null, the output is rendered to screen.
 
The algorithm focuses on identifying the inner square, which is highlighted in the picture below. After identifying straight edges and four pixels of the square, in pass 5 the algorithm samples the cells inside and detects the code.
 
 

3.1 Pass #0: Threshold and edges code

As mentioned, adaptive thresholding is not currently implemented in the algorithm. Thresholding makes the texture monochromatic - each pixel is either black or white. Instead of adaptive thresholding, we use a simple constant threshold to tell the black pixels from white pixels - sometimes it works, sometimes it doesn’t, depending on the room lighting. In constrained conditions, the algorithm can identify black and white pixels, which enables it to identify the edge pixels. In this scenario, edge pixels refer to black pixels that have white pixels next to them. 
 
This pass sets the foundation for processing in the later steps and encodes edge pixels. For each of the edge pixels (black pixels with immediately adjacent white pixels), the algorithm enumerates the neighboring pixels in counterclockwise order (all eight neighbors) and, in that order of enumeration, the algorithm identifies a sequence of black pixels. For the sequence, or we could call it arc of black pixels, values for start-of-black (S) and end-of-black (E) are calculated, as described by this diagram:
 
This pass also calculates the direction between end-of-black and start-of-black. It is marked as B, bisector of the white angle on the diagram above. P on the diagram, refers to the pixel currently being analyzed and written to framebuffer.
 
Following S (start) and E (end) calculated in this step, in later steps we can enumerate all of the edge pixels on the inner-edge of the square. Start from point P, and following S links, come back to P
 
There is one specific case, where pixels get skipped if following the S-chain. In the diagram above, for both P and 0 pixels, S points to pixel 7. In this case, if we follow S links from 4, it would go like 4 -> P -> 7 and 0 would get skipped. In any case, if we start from 0 and follow the S links, the neighboring pixels are reached so this case is handled properly. 
 
To verify connection between two corners, the algorithm actually does something like this, following the S-links. The algorithm makes it in larger steps, for example, sample edge pixel on distance of eight S-links in pass 4.
 
Bisector information is used to identify straight lines. Following S links, the next step checks the maximal difference in B values, in multiple consecutive pixels.
 
Calculating S, E, and B is done outside the shader in the calculateEdgeCodeToGradientDirectionMap function so it does not repeat on every frame. Pre-calculated data for all of the potential neighbor values is supplied as a uniform array to the shader. The shader samples neighbors s0 to s7, forms a lookup into the precalculated array, and writes vec4(B/8.0, S/8.0, E/8.0, 0.0) to framebuffer - to texture color-attached to framebuffer that will be input for the next step.
 
The following picture shows the S, E, and B values for different pixels. 

3.2 Pass #1: Corner candidates from edges codes

For each of the edge pixels, this step averages bisector (B) direction for eight consecutive edge pixels in both S and E directions. For every pixel, it also checks the maximal variation in this value. Based on that, the shader is able to estimate if if the pixel is on straight line. If there are no significant changes in direction in 16 edge pixels around, or when the pixel is on an edge that is not straight.
 
A special case when 8 pixels in S (start) direction form a straight line and the 8 pixels in E (end) direction, form it too, but those straight lines make an angle, is when the pixel is a corner. Here, we discard outer edge corners from further analysis.
 
Since we sampled, as far as eight edge pixels in S direction, we can store the offset to eighth edge pixel in output, so that later we can quickly advance when traversing through edge pixels without additional sampling. 

3.3 Pass #2: Refine neighbor corners

There could be several corner candidates nearby, on a pixel or two distance, as the experience proved so far, and we need to select only one. This pass, based on the data from previous passes, discards any corner candidate that has more distinguished corner candidate on sampling distance 2.

3.4 Pass #3: Reduce corners texture dimensions, divide both by 5

All prior passes operate on full resolution (960x540, at the time of writing this document). For the client of this computing, it is not very efficient to sample all 960x540 pixels looking for a detected marker. Starting with this pass, all passes reduce the size of output texture. In this pass, the algorithm divides width and height by 5 (960/5 x 540/5). In later passes, the algorithm further divides width and height by 8 and 6, respectively (192/8 x 108/6), to achieve the final 24x18 texture. 
 
When reusing this script in 3D scanning demo, it is likely that another pass would be appended, for example, to produce texture 2 by 2 that would only include information for 4 defined marker codes. That should be straightforward, following the code here. For the use case here, we get 24x18 texture carrying the detection results.
 
As a result of reduced dimensions, the output needs to carry the position of the corners.  

3.5 Pass #4: Detect straight line edges between corners

The input for this pass is 192x108 texture and the output keeps the same size. For every corner, the shader attempts to traverse through straight-line, 8 step S links until it reaches some other corner. If it reaches the other corner, the output carries the information of two corners connected via straight inner edge line: first corner as RG, the second as BA color components.

3.6 Pass #5: Identify markers, their edges, corners, orientation and the code

This pass reduces the size of output, to the 24x18 texture. For each corner, connected via straight line to another corner, the algorithm tries to make four corner hops, following the connection in an attempt to traverse back to itself. If that condition is reached, the algorithm has identified four connected edges and there is a need to read the inner content.
 
This pass of the algorithm could be made more robust, to avoid false positives, but that work is to be done after adaptive thresholding. For now, we sample one point per inner cell, total nine of them for 3x3 Hamming code. First we need to identify the base corner of the marker. Let’s then explain how the 3x3 AR code encodes the data.

3.6.1 3x3 marker data encoding: Base corner

Base corner is defined by position of two black and one white cell, as shown in the diagram below. The cells with the numbers define how the code is computed: if the cell is black, the corresponding number is added to the calculated code. In the diagram, the code of the marker is 0 since all the numbered cells are white.

3.7 Pass #6: Rendering to screen

We have two render passes here: line loop that is rendering the detected inner square line and the label with marker code, rendered next to the base corner
 
When calculating the position of vertices, both shaders sample the results of previous, recognition passes in vertex shader. Code label rendering also reads the detected code and, based on that, renders the code label from the atlas.

4. Latency measurements

As described in the introduction, there are different use cases where this algorithm could be used:  identifying the position in space, user input, or fast reading of information. It is interesting to know the latency introduced by recognition processing, alone, and the total event-to-screen latency, including camera capture, OS and web browsers capture and other latencies, such as rendering pipeline latencies, display latency, etc. 
 
We measure latency introduced by recognition processing using the EXT_disjoint_timer_query_webgl2 extension.
We estimate the total event-to-screen latency using slow motion camera.

4.1 EXT_disjoint_timer_query

Using EXT_disjoint_timer_query_webgl2 extension, we can query the GPU for the amount of time the recognition-related processing takes to complete. The extension is available in the Firefox browser, when enabling webgl.enable-privileged-extensions in about:config.
 
After enabling the extension, the WebGLQuery object is created and used on every rendering frame, to asynchronously query for the amount of time information.

To query it asynchronously, beginQuery is placed just before issuing drawing calls related to recognition rendering passes. After the last of the recognition render passes, an endQuery call is coupled with it. We do not attempt to read the query results until the next rendering call, to give the pending query enough time to complete. If the result is available, it is printed to the console and the new query call can begin.

On Asus ROG-GL702VM, recognition-related processing takes from 0.2 to 0.3 ms to complete on GPU.
On Mid-2014 MacBook Pro with integrated graphics, on macOS, it takes around 5.5 ms.
 
This gives us an estimate of the latency introduced by recognition processing and the way to monitor upcoming changes and optimizations.

4.2 Slow motion camera

I used iPhone capturing slow motion video at 240 frames per second and then counted the number of frames elapsed from the event happening until the event was displayed on the laptop display. The iPhone camera was pointed in the same direction as the web camera that was capturing 540p video at 60 frames per second. The iPhone was pointed towards the quickly moving AR label in my hand, and to the laptop display, detecting the label’s marker code and rendering the same object, as shown in the following picture.
Measuring latency this way is not straightforward. We need to identify the event happening when laptop screen refreshes (on 75 Hz refresh rate) and find the same event on slow motion video timescale (240 Hz rate). Then, we need to find many of those, because the “camera sample to display” period could vary, depending on how much the 16.6 ms camera (at 60 Hz) sample period overlaps with the 13.3 display sync period. In measurements, I used VLC* frame by frame stepping to count the number of frames from real world event to event display on laptop screen.
 
In addition, it is important to verify the latency introduced by the marker recognition code. This latency is also measured using EXT_disjoint_timer_query - the same measurement was repeated without recognition code; only displaying captured video.
 
On Asus ROG-GL702VM, with discrete graphics, 60 Hz camera capture and 75 Hz display refresh rate, latency is around 90 ms.
On Mid-2014 MacBook Pro, with integrated graphics, on Windows* (Boot Camp*), using 30 Hz camera capture and 60 Hz display refresh rate, the latency is around 120 ms.

5. Conclusion

This blog post describes an open source prototype algorithm for AR markers recognition using browser and standard webcam. The results of our testing with the algorithm are very promising, at least when it comes to recognition code latency, which was measured at 0.3 ms. Overall event-to-screen latency, which includes camera latency, browser video capture, media decoding and WebGL pipeline and the display latency ranged from 90-120 ms. 
 
In the future, I plan to explore how the total latency affects using webcam and browser for user input or rendering in augmented reality.
 
The algorithm is open source code on github.com - please take a look. Your contributions are welcome. 

5.1 Additional reading

I have published other articles related to use of depth camera instead of a standard webcam. They are available at:
 
Additional depth capture demos and articles are available on the Depth Camera Web Demo project’s front page.