Old AR tag detection system

Please note: What follows is a description of the old (as of Winter 2021) system for AR tag detection. This system only supports ALVAR tags and has many optimization/accuracy issues, and was deprecated and replaced with what is now the current system; see "Problems With This System" below. It is documented here for the purposes of helping others understand how the system is laid out, and the computer vision concepts on which it is based.

Detection Process
The detection process, at the highest level, takes in an input image, represented as an OpenCV matrix, and returns a list of tags detected, with their 3d positions and orientations in space. This depends on a process called 3D reconstruction; with knowledge of how the camera projects a 3D scene onto a 2D plane, and the real-world size of the AR tag we are looking for, we can determine its 3D position and orientation based on its appearance in the image.

3D Reconstruction
(For a more in-depth description of the following process, with actual math notation, it is highly recommended to read the “Detailed Description” section of the Camera Calibration and 3D Reconstruction page in the OpenCV documentation.)

Every camera has unique properties of its lens such as the focal lengths and the coordinates of the image center. These can be expressed in a 3x3 matrix called the camera matrix, and when the coordinates of a point in 3D space (relative to the camera) are multiplied by this matrix, the result is the 2D coordinates of the point projected on the image. In other words, this matrix defines how a 3D scene is projected onto a 2D plane by the camera. This matrix is obtained by a process called camera calibration, where multiple pictures of a pattern with easily detectable features, and known size and dimensions, such as a checkerboard (see calib.io), are used to determine the camera matrix. We also need to know the distortion coefficients, which describe the amount of radial distortion (like you would see in a fisheye camera like a GoPro) in the camera. Every camera has some slight radial distortion, even if they are not explicitly fisheye cameras, knowing these coefficients will help us correct for it. The camera calibration process is done once ahead of time and then the coefficients are saved to be loaded later.

We usually set the origin for 3D space centered on the camera, with the x/y axes in the same directions as the image, and the z axis pointing straight out. This is called the camera space. However, we can define another 3D space, centered on the AR tag we want to detect, with the x/y axes parallel to the surface and the z axis normal to the tag; this is called the object space or the world space. We then have a matrix which will take coordinates relative to the origin of the object space, and transform them to be relative to the camera space. The coordinates will still refer to the same point in space, but they will be measured starting from a different origin. If we have this transformation matrix, we basically know where the tag is in 3D space relative to the camera; the origin of the object space will be a point right on the tag’s surface, and if we can get the coordinates of this point relative to the camera using this matrix, then we have the location of the tag relative to the camera.

It is important to note that while coordinates in the image space are measured in pixels, coordinates in the object and camera spaces are measured in real world units like millimeters; these units will be determined at camera calibration because they affect the scale of the camera matrix. Therefore, if we get the coordinates of the tag, we have its real world location in space, in real world units!

Our goal is therefore to determine the transformation matrix, which is a 3x4 matrix where the leftmost 3 columns describe the rotation of the camera relative to the object, and the rightmost column describes the translation (or location) of the camera relative to the object. The translation is more important to us, because the tag will be mounted on all sides of the post, so rotation cannot really be used to tell which side of the post/gate we are on.

OpenCV has functions (see solvePnP in the OpenCV docs) that will, given a camera matrix, the 2D coordinates of points in an image, and the 3D coordinates of those points in the object space, calculate and return this matrix, which will give us the location of the tag relative to the camera. We know the coordinates of the tag in the object space (i.e. relative to the center of the tag), because these are just the physical size of the tag. For example, the tags in URC (2020) are 200mm on a side; since we set the origin at the center, the upper right corner will be at (100mm, 100mm), and so on. So, we just have to detect the corners of the tag in the image, and then we can pass their coordinates to this function to get the real-world position and orientation of the tag.

2D Detection Process
The way we detect the corners of the tag in the image is a multi-step process that starts with an image directly from the camera and ends with groups of 4 points in an image that represent the corners of tags. The process is as follows:


 * 1) The image is converted to grayscale; the tags are black and white so this helps the contrast between the black and white regions of the tags stand out better, and also converts the image to using one dimension of color (ranging from black to white) instead of 3, which is needed for the next step.
 * 2) A blur is applied to the image to smooth out noise.
 * 3) A Canny (see in the OpenCV docs) edge detection is run on the image. This will attempt to determine edges of shapes in the image. Ideally, the shapes of the tags will be exposed in this process. This process outputs an image that is black with the edges drawn in white.
 * 4) Contours are found (see in the OpenCV docs) in the image. This takes the black and white edge drawings and returns groups of coordinates that represent the shapes in the image. This gives us the coordinates of the edges in the image.
 * 5) Contours are approximated (see in the OpenCV docs) with quadrilaterals. Any shapes whose sides deviate from a straight line more than a certain threshold cannot be approximated and are removed from consideration. The threshold increases as the perimeter of the contours. The goal of this step is to remove any contours that do not have reasonably straight edges and are not quadrilaterals, since we are looking for a square object in the image. At this stage, any resulting quadrilateral whose area is smaller than a certain threshold is removed from consideration; this is to remove small quadrilaterals that might have been detected as a result of noise or other shapes in the image.
 * 6) At this stage, any quadrilateral that makes it to this point could very well be a candidate for a tag. The tags are square in real life but since we could be viewing them at an angle there is no guarantee they will be square in the image. Therefore, we transform the region of the grayscale image inside every quadrilateral to a square, use adaptive thresholding to make the difference between black and white clear, and then attempt to detect the marker border, orientation region, and data regions that would be present in a marker. If the quadrilateral fails any of these tests (e.g. there is white where a black border should be, it is missing the orientation shape, etc) then it is removed from consideration. Any quadrilaterals that make it past this stage are then valid tags.
 * 7) The coordinates of the corners of the remaining quadrilaterals are transformed using the 3D Reconstruction process described earlier, and the 3D positions and orientations returned as well as the corners of the tags are returned as the result of detection.

Code Structure
The current AR tag detection system has a few important parts. All files are located under  except where otherwise specified.


 * contains global definitions of OpenCV matrices that represent camera matrices for different cameras we have used.
 * The  class, defined in , represents a physical tag found in the world, and stores its location and rotation information. The   class performs the 3D reconstruction process when it is constructed. Clients should not construct   classes; the   takes care of that.
 * defines a  class, which is given a camera matrix at construction, and provides a single method   which takes an input image and optional parameters for edge detection and blur, performs the detection process described above, and returns a vector of   objects.
 * implements a sample client of the AR detection code that opens a webcam on the computer and displays the video feed from it, with cubes projected onto the detected tags in the image to indicate their position and orientation. Can optionally display the images produced at intermediate stages of the detection process.
 * provides an alternative to the OpenCV  class that uses a separate thread to capture images from the camera. This is helpful because   blocks upon reading from the camera until the next frame is available, which would slow down our test implementation considerably; if we had just missed a frame, we had to wait for the next one. With , there is always a frame available instantaneously because the frame is stored until a new frame is available.

Problems With This System

 * The system only supports ALVAR tags; if we compete in CIRC we will need to detect ARUco tags.
 * The system is not very robust; it has a range of about 5 meters before detection results start becoming wildly inaccurate. Additionally, detection results have a certain degree of inaccuracy even within the operating range; it is currently workable but it would always be nice for it to be more accurate. For comparison, the maximum distance that the GPS coordinates could deviate from the actual landmark is 10 meters, so we would have to perform some kind of search with the rover with the current system.
 * The system has performance issues. The develop branch code contains a lot of code to take advantage of the GPU on the Jetson, which worked but was difficult to test and did not give us a huge performance boost. We also attempted to use multithreading to process the image in parallel, which helped somewhat.
 * The code is somewhat messy and does computation in the wrong places; for example, the Tag class should probably just be a data representation, and not do the 3D reconstruction computations. The Detector would be a better place for those.