AR Tag Detection System

Overview
The detection process, at the highest level, takes in an input image, represented as an OpenCV matrix, and returns a list of tags detected, with their 3d positions and orientations in space. This depends on a process called 3D reconstruction; with knowledge of how the camera projects a 3D scene onto a 2D plane, and the real-world size of the AR tag we are looking for, we can determine its 3D position and orientation based on its appearance in the image.

Definitions
In this document and in the AR system, we will refer to a tag as a physical, real-world instance of an AR tag, that has a position and an orientation relative to our rover. When referring to the pattern on the AR tag, we will call this a marker.

Since different competitions have different systems of marker patterns that they use (e.g. URC uses ALVAR, CIRC uses ARUco), we will define the concept of a marker set, which is a set of marker patterns, some of which may have some special meaning in the context of the competition (e.g. URC uses specific patterns to mark certain landmarks for the autonomous navigation task).

Therefore, the detection process will take an image and a marker set containing markers to look for, and then will return a set of tags representing the locations of those markers in the real world. Additionally, the marker set can be used to associate the detected markers with some kind of meaningful identifier in the context of the competition, like the number of the gate we are looking for.

3D Reconstruction
(For a more in-depth description of the following process, with actual math notation, it is highly recommended to read the “Detailed Description” section of the Camera Calibration and 3D Reconstruction page in the OpenCV documentation.)

Every camera has unique properties of its lens such as the focal lengths and the coordinates of the image center. These can be expressed in a 3x3 matrix called the camera matrix, and when the coordinates of a point in 3D space (relative to the camera) are multiplied by this matrix, the result is the 2D coordinates of the point projected on the image. In other words, this matrix defines how a 3D scene is projected onto a 2D plane by the camera. This matrix is obtained by a process called camera calibration, where multiple pictures of a pattern with easily detectable features, and known size and dimensions, such as a checkerboard (see calib.io), are used to determine the camera matrix. We also need to know the distortion coefficients, which describe the amount of radial distortion (like you would see in a fisheye camera like a GoPro) in the camera. Every camera has some slight radial distortion, even if they are not explicitly fisheye cameras, knowing these coefficients will help us correct for it. The camera calibration process is done once ahead of time and then the coefficients are saved to be loaded later.

We usually set the origin for 3D space centered on the camera, with the x/y axes in the same directions as the image, and the z axis pointing straight out. This is called the camera space. However, we can define another 3D space, centered on the AR tag we want to detect, with the x/y axes parallel to the surface and the z axis normal to the tag; this is called the object space or the world space. We then have a matrix which will take coordinates relative to the origin of the object space, and transform them to be relative to the camera space. The coordinates will still refer to the same point in space, but they will be measured starting from a different origin. If we have this transformation matrix, we basically know where the tag is in 3D space relative to the camera; the origin of the object space will be a point right on the tag’s surface, and if we can get the coordinates of this point relative to the camera using this matrix, then we have the location of the tag relative to the camera.

It is important to note that while coordinates in the image space are measured in pixels, coordinates in the object and camera spaces are measured in real world units like millimeters; these units will be determined at camera calibration because they affect the scale of the camera matrix. Therefore, if we get the coordinates of the tag, we have its real world location in space, in real world units!

Our goal is therefore to determine the transformation matrix, which is a 3x4 matrix where the leftmost 3 columns describe the rotation of the camera relative to the object, and the rightmost column describes the translation (or location) of the camera relative to the object. The translation is more important to us, because the tag will be mounted on all sides of the post, so rotation cannot really be used to tell which side of the post/gate we are on.

OpenCV has functions (see in the OpenCV docs) that will, given a camera matrix, the 2D coordinates of points in an image, and the 3D coordinates of those points in the object space, calculate and return this matrix, which will give us the location of the tag relative to the camera. We know the coordinates of the tag in the object space (i.e. relative to the center of the tag), because these are just the physical size of the tag. For example, the tags in URC (at time of writing) are 200mm on a side; since we set the origin at the center, the upper right corner will be at (100mm, 100mm), and so on. So, we just have to detect the corners of the tag in the image, and then we can pass their coordinates to this function to get the real-world position and orientation of the tag.

2D Detection Process
Most of the marker detection process is already implemented in the OpenCV ARUco module; this is a contributed module, but if you installed OpenCV using the instructions in the README you should already have this module.

Code Structure
The current AR tag detection system has a few important parts. All files are located under  except where otherwise specified.


 * contains the definition of a  class which holds the camera matrix and distortion coefficients of a camera. Additionally, it has a   function that will return an instance of a   class, given one of the constants in the enum.
 * defines the  class, which represents a marker data pattern. This class contains fields for the border size, pattern size, and actual data in the marker pattern.
 * defines the  class, which holds a set of markers and additionally contains the physical size of the tags in that competition as well as a dictionary that optionally maps marker IDs to a meaningful ID in the context of the competition. Note that not all markers will be mapped.
 * The  class, defined in , represents a physical tag found in the world, and stores its location and rotation information. Clients should not need to construct   classes; the   takes care of that, but you can if you absolutely need or want to.
 * defines a  class, which is given a marker set and a camera matrix at construction, and provides a single method   which takes an input image, performs the detection process described above, and returns a vector of   objects.
 * implements a sample client of the AR detection code that opens a webcam on the computer and displays the video feed from it, with cubes projected onto the detected tags in the image to indicate their position and orientation.

Additional Information
Additional information can be found in [our API documentation].