Artificial Intelligence vs Computer Vision For Counting Applications
AI Vs Computer Vision Counting
Countculture promotes its approach to counting as being based on Computer Vision (CV). We often hear companies promoting their product as being better because it is based on Artificial Intelligence (AI). To many outside our industry, this is quite perplexing as they think of CV as either part of AI or an area that makes heavy use of AI. This paper is not going to delve into that discussion because the problem lies with companies describing their product as AI-based because it's currently an in-vogue marketing buzzword, instead of specifically saying their product is based on the use of neural networks. Instead, we will discuss the difference between the approach that Countculture uses based on background subtraction (BGS) and approaches based on neural networks. Along the way, we will reference SORT(1,2), a widely known neural network-based approach often touted as being state-of-the-art, to illustrate various points.
Let’s start with considering what is meant by being counted. The application area is counting objects, such as pedestrians or people on bicycles/scooters, who perform a particular action based on movement. For example, counting people entering or exiting a store. Or people walking along a footpath in different directions. Or mixed traffic doing so. To do this, we need the path the object took while in the field of view, that is, the object’s trajectory. And getting the trajectory of an object is a process called tracking. Because multiple objects can be in view simultaneously, the process is multi-object tracking (MOT).
We will only consider the situation where a sequence of measurements is available at discrete points in time. Such as you would get from processing frames in a video. This is also common in other MOT applications, though, such as the sequence of measurements generated by radar sweeps. In fact, while tracking algorithms existed prior to it, much of the automated tracking algorithm development resulted from radar processing, such as for military applications and early air-traffic control (until the latter engineered out the problems by requiring aircraft to have identification transponders). So given the sequence of measurements over time, we need to identify those that were generated by the same object, thus giving its trajectory. We will further restrict the discussion to one-shot state-based filtering algorithms. If we think of the measurements as having a discrete time index t, then state-based means we define the state at time t for a given object, which is some representation of all the measurements from that object at times up to and including time t. In the simplest cases, the state could be the sequence of measurements for the object up to and including time t. Filtering means we restrict attention to those algorithms that take the state of all objects at time t, plus the new measurements at time t+1, and from those along, estimate the new state of all objects at time t+1. This involves, in some sense, making an association between each object with zero, one, or more of the measurements at time t+1. The one-shot refers to making a hard decision about that association. SORT fits into this class of algorithms. The Countculture tracker is close but contains elements that are not one-shot.
The hard part the above is the association step: determining which new measurements belong to existing trajectories, which new measurements don’t, and which existing trajectories don’t have new measurements. Let’s consider the simple situation where there are N existing trajectories, and there are N new measurements, with each measurement belonging to just 1 trajectory and visa-versa. Then, we can form an NxN matrix where each row corresponds to an existing trajectory and each column corresponds to a new measurement. We fill the matrix with values that represent the cost or penalty of associating the measurement with the trajectory. Such a matrix is called an affinity matrix. A potential solution consists of selecting Ncells, such that there is only 1 per row and 1 per column. We want to select the potential solution as the one that has the minimum cost (the sum of the values in the selected cells). In mathematics, this is called an assignment problem. Or, in another view, consider a spreadsheet with N rows and N columns, with columns for measurements and rows for trajectories. Then, for example, cell C2 contains the cost of assigning measurement C to trajectory 2. The assignment problem is then colouring N cells, such that there is only one per column and one per row, such that the sum of the costs in the coloured cells is the minimum of all such possibilities. SORT constructs such an affinity matrix. The cost values are determined by using a constant velocity model to predict the locations of the existing trajectories at the time of the new measurements and measuring the distance of each measurement from this prediction. So, the further away a measurement is from the predicted location, the higher the cost. SORT then uses the well-known Hungarian algorithm(3) to solve the assignment problem. DeepSORT(4) is an extension of SORT that uses a neural network to fill out the affinity matrix, taking other factors, such as appearance similarity, into account.
The association method, as described above is clearly incomplete. As new objects come into view, they will generate measurements that don’t correspond to existing trajectories. And similarly, as objects leave the field of view, there will be trajectories for which there are no measurements. The extensions to handle these are relatively simple. Harder to deal with are the cases where multiple measurements are corresponding to a single trajectory and the cases where a trajectory doesn’t generate any measurements, for example, when the object is occluded by something else, or the measurement system fails for some reason. Or there is a measurement for a trajectory, but it is a long way off from the correct location, such as can happen if an object is only partially detected.
The Countculture tracker also constructs an affinity matrix, but in such a way that it addresses all of the above concerns, whereas SORT, which the authors describe as using “rudimentary data association and state estimation techniques”(2), does not. But at the end of the day, tracking is neither AI nor CV. It belongs to the Engineering field of Control Theory, draws on the Statistics field of Time series Analysis, and is covered in texts such as Bar-Shalom and Fortmann(5). The fundamental differences between the Countculture approach and others do not lie in the tracker. So far, we have described the tracker as being driven by a set of measurements from each frame time without much detail on what is meant by measurements other than to hint that it means the locations of the detected objects. Of course, the raw input in this situation are images. Therefore, these must be processed to detect all the objects and determine their location. This is commonly referred to as object detection, and the combination of object detector and tracking is commonly called tracking by detection. In this front-end processing, the fundamental difference between Countculture’s BGS approach and neural network-based methods lie.
Object detection is the process of taking an image and processing it to detect all of the objects of interest (what ever that is defined to be) and possibly identifying what sort of object they are if there are multiple sorts of interests that can be visible simultaneously. It is undoubtedly true that neural networks have become the standard approach to object detection for the purposes of competitions such as ILSVRC(6) and PascalVOC(7). But the key question is how performance on a competition benchmark translates to the performance of counting in practice.
SORT uses Faster R-CNN as its object detector, but many people have replaced that with a single-step detector such as YOLO. Many alternatives could be used, representing a range of options based on computing requirements and performance. But independent of the network architecture, the two most important questions are
1. What is the cost of re-training and how often it needs to be done?
2. What is the achievable accuracy?
As indicated above, a practical neural network approach will almost certainly use a network architecture developed by some research group instead of designing its own due to the detailed knowledge needed to do such designs and the phenomenal resources needed for neural architecture search. Access to pre-developed networks is available in model zoos associated with neural network implementation libraries such as TensorFlow, PyTorch, MxNet, Darknet, OpenCV, etc. These networks have multiple millions of parameters. Example parameter sets are often included in the zoo, such as a parameter set trained on the ImageNet image collection for some object detection competition. In the early days, mainly based on a poor interpretation of the universal approximator theorem, it was believed that once a trained parameter set was available, it could be used for all subsequent applications with equal accuracy. This has turned out to be a myth. The ability of a trained network to generalise to other data sets has been shown to be quite limited in many cases and restricted to data sets that have very similar statistics to the ones used for training. For example, the ImageNet image collection contains images of people taken from the viewpoint of a standing person or a vehicle-mounted camera. A neural network trained using this dataset won’t necessarily work with any accuracy on images from a typical camera used for counting, which is mounted much higher and tilted down to some extent. Similarly, the ImageNet collection contains only American vehicles from the early 2000s, which is when it was put together. The upshot of all this is that a neural network will most often need training based on each particular installation. Moreover, it may need retraining at regular intervals to keep it working.
The cost of training comes in two parts:
1. The cost of collecting the data needed.
2. Tuning of the parameters, I.e., training the neural network.
The data needed for training is a lot of examples of images of the objects of interest annotated manually with abounding box around each such object and a label of the object type. So, costs and time are associated with collecting and storing the video, plus the costs associated with getting the data annotated.
Training a neural network containing multiple millions of parameters is compute-intensive, often requiring many hours of time with multiple GPUs. Training a network from scratch will often require millions of examples too. Some techniques, such as transfer learning and fine-tuning, reduce the compute resources and the number of examples needed somewhat.
The performance will slowly degrade overtime due to changes in the appearance of objects, such as the evolution of vehicle body styles over time and even the impact of external factors such as the weather. This means the network will need retraining at regular intervals.
The effort involved in training inevitably leads to adopting a MLOps pipeline for managing the whole process.
From an end-user point of view, three approaches could be taken
· Go with a low-cost vendor offering a neural network with fixed parameters, and hope it delivers acceptable accuracy.
· Go with a vendor who will customise the parameters to an installation and review them over time. Presumably, the vendor will pass on the associated costs.
· Do the training in-house.
Contrast this with the BGS algorithm used by Countculture. It automatically learns and maintains the background overtime. The algorithm is controlled by two parameters, for which one of three combinations are used (for normal situations, for the outdoors with sunshine, and one for where people congregate for long periods, such as a queue). In addition, two different algorithms are available (one for normal situations and one for areas with variable backgrounds, such as auto-sliding doors). The rules for selecting which parameter set and which algorithm are simple and easily executed before installation. There are also a set of 5 other parameters that control the interpretation of the BGS output that need tuning. We have an automated optimisation process for tuning these processes. Moreover, it is driven by the ground-truth data needed to generate an accuracy report, not additional data like object bounding boxes.
When it comes to measuring accuracy, the accuracy measures of concern in a counting application are those related to the accuracy of the generated counts, i.e., the end-to-end accuracy of the system. But most papers about neural network-based object detectors focus just on the object detection aspect, and many only consider mAP – the mean average precision. That is, the fraction, on average, of detections that are generated by the object detector which are actually correct. Few object detectors achieve a mAP above 60%, that is to say at least 40% of the object detections they generate are errors. A minimal counting scenario would be one object detection opportunity prior to the counting zone/line and one after, needing both to forma trajectory to generate the count. For this, the mAP of 60% translates to a 36% chance, on average, of correctly detecting the count. It could be argued that there are normally many more opportunities on either side, so the chance of a correct detection will be higher. For example, if there were three opportunities on either side, and the likelihoods of detecting are independent, then the chance is 88%. But given the nature of a neural network, if it fails on one image, it is highly likely that it will fail on a similar image, such as the next frame of a video, so the detections are not independent, and the chance will be somewhere in-between. Note also that since 40% of the detections are errors, there will be significant opportunities for the tracker to associate detection errors into a trajectory leading to an overcount. The other side of precision is recall, the fraction of true objects that are actually detected. This can be related to missed trajectories and missed counts. From all this, it is clear that the accuracy of a counting system based on a neural network detector could be well short of expectations.
Contrast that with the BGS used by Countculture which detects objects with enough accuracy and such small amounts of noise that we are able to guarantee 98% accuracy of the resulting counts.
Another aspect of the comparison is the hardware needed to implement a system. The original neural network would have been developed using multiple GPUs to achieve requisite frame rates. Techniques such as lower floating-point precision, integer quantisation, network trimming, reducing the resolution of the images processed, etc., can be used to reduce the computation requirements so it can be deployed on limited hardware such as an Nvidia Jetson or a Google Coral neural processor. But all these techniques involve a reduction in accuracy to some extent. In contrast, the BGS algorithm can run at full speed and accuracy on a Raspberry Pi.
In conclusion, we have peeled away the marketing claims and shown that the AI systems of others are actually similar in form to the Countculture system in that they are both using tracking by detection. The major difference is in the object detection front end, where Countculture uses background subtraction, and the others use neural networks. We have discussed the heavy burden that a neural network imposes due to the need for continual re-training to maintain accuracy, and even then, the counting accuracy attained is often substantially lower than the Countculture system.
1. Alex Bewley, et al,“Simple Online and Realtime Tracking”, ICIP 2016,
2. SORT source code available at https://github.com/abewley/sort.
5. Bar-Shalom and Fortmann, “Tracking and Data Association”, Academic Press, 1988.