Analyze the Algorithm of soccer match movement identification

About us

Yanshen Sun, Jun Xiao

Program Teammate

Links

1. Project video

https://drive.google.com/file/d/1BYJjOiOCDZQNT_h_99ihCrWIZpya6CBI/view

2. Project slides:

https://drive.google.com/file/d/1tDULVtCZPDuYsy6IPh1xZDOBmgOeWJeN/view?usp=sharing

3. Project data:

https://drive.google.com/drive/u/1/folders/19cqwtRy0Zi1rw-29h88nivjMM_4CgnHj

4. Final report:

https://drive.google.com/file/d/1JKYAPEx-e6bbQQtofcZZF4RHX9r-FiLA/view

5. MaskRCNN demo

https://colab.research.google.com/drive/1znBYyjlYsNN-Ac59bsrIiqOV0iFshObv?usp=sharing

6. Project code

https://drive.google.com/file/d/1Og69ZOlEbcKH86XyNU9Ae1NNIZyjvAdz/view?usp=sharing

Figure 1: Player detection in a video recorded in Alfheim Stadium (Norway), 2013-11-03. The figure shows the player detection result of our statistic method. The red rectangle/point indicates red team. Blue rectangle/point indicates blue team.

Abstract

Machine learning methods have been wildly used in soccer game analysis for decades. Most of these soccer game analysis models are used for research and tracking for commercial purpose. This project applies three popular methods to identify soccer players in poor resolution videos shot by three static cameras. It separates players from the background and identifies each player by statistic and machine learning algorithms which has three models needed to be compared: a statistic model, Dbscan classification, and a pretrained MaskRCNN model. The accuracy of identification is measured by total ICP (Iterative Closest Points) distances. Based on the result, the statistic model performs best. Dbscan performs a little bit worse than the former. In comparison, the pretrained MaskRCNN model could hardly work for our dataset. This system is highly interactive and provides multiple views of processed results, making it more convenient for visual validation

KEYWORDS

soccer game, machine learning, video processing, human detection

Introduction

The soccer game has been popular all over the world for hundreds of years. As one of the world's most popular sport games, soccer games bring considerable commercial benefits to soccer players, coaches, soccer clubs, and media. Recording and analyzing soccer games helps soccer clubs predict their opponents' behaviors. Soccer fans can also learn soccer skills by analyzing soccer videos. Most people tend to spend plenty of time watching and analyzing soccer games to rate soccer teams' ability. However, manual analysis is time consuming and full of blind spots. Statistic and machine learning methods could help extract hot-spots and thus make soccer play analysis more efcient.

Most of the current soccer game analysis models are performed on broadcast streams or commercial videos[1]. These soccer videos are shot by authorized organizations such as the play organizer or the media, which usually own professional devices and camera people. Therefore, these videos are usually of high resolution. However, unprofessional soccer groups may also want to analyze their playing videos. Videos shot by unprofessional people may be of low resolution due to the camera's quality. Besides, an unofficial cameraman (e.g., fans) may not be allowed to take the best spot for video shooting. All these factors may largely aﬀect the quality of the videos.

Based on Manaffard et al.'s survey[6], tasks in soccer match analysis include playfeld detection, player detection, occlusion resolution, player tracking, appearance modeling, and player labeling (Figure 2). Types of input videos include broadcast streams, single stationary cameras, multiple moving cameras, and mixed cameras.

Due to time limitations, in this project, we specifcally aim to handle playfeld detection and player detection problems for multiple static cameras. Our input videos are downloaded from a soccer video dataset. The cameras are far from the playfeld, and thus each player occupies only tens of pixels. We applied three diﬀerent player detection methods to our dataset and compared their performance. The result shows that none of these three methods can efficiently handle the player detection task.

Figure 2: Tasks in soccer match analysis

2 TECHNIQUES TO TACKLE THE PROBLEM

2.1 Related work

During the past 20 years, plenty of algorithms have been proposed to solve diﬀerent problems on diﬀerent types of data. Cuevas et al.[1] suggest that the most popular player detection methods include Key-Point Detection (KPD), Background Subtraction (BGS), Region-Based Segmentation (RBS), Supervised Learning (SL), and Silhouette Detection (SD) without mentioning any machine learning method. Komorowski et al.[5] train a ball detection model based on MaskRCNN[4]. Komorowki claims that a pretrained MaskRCNN model (https://github.com/matterport/Mask_RCNN) works fine for player detection in their videos. Gerke et al.[2] identifying players by feeding constellation features and jersey number features to a convolutional neural network. Giancola et al.[3] propose a neural network identifying soccer play events such as passing and yellow/red cards

2.2 Dataset

Our dataset is downloaded from https://datasets.simula.no/alfheim/. This dataset provides three static camera videos (left, mid, and right camera view to cover the whole playfeld) and a panorama video of two games. For validation, the dataset provides playfeld coordinates of 11 players (red team in the video) every 0.05 second. The "playfeld coordinate" refers to a 2D Cartesian coordinate system with the bottom-left corner of the playfeld as (0, 0) while the top-right corner as (105, 68). Aside of coordinates, the dataset also provides player id, face direction, moving direction, energy, speed, and total moving distance. Direction use a ﬂoat number to describe the angle diverging from the direction of y-axis. Energy is estimated from step frequency. Speed is in m/s.

Due to the limitation of device, only the frst 30-second of the 2013-11-03 video is used in this project.

2.3 Approach

The videos' pre-processing can be divided into three steps: video stitching, background extraction, and coordinate transformation matrix calculation. The frst step is to stitch the three static camera videos together. As the dataset's panorama video is merged by cylindrical projection instead of homography projection, the screen coordinate is hard to be transformed into ground truth coordinate. Therefore, we program to generate a homography transformation matrix for video stitching. FFmpeg is used to concatenate ten threesecond segments to 30-second short videos (900 frames).

For camera view stitching, SIFT of OpenCV is employed for feature extraction. Then, the homography matrices are also found by an OpenCV approach called "fndHomography()." Finally, the views are stitched together with the "PerspectiveTransform()" method. Background extraction calculates the mean pixel values of multiple frames. Moving objects are fltered out as the pixels are of the playfeld's color most of the time. This approach works well for this dataset because of the static camera views.

The ground truth and the output result should be of the same coordinate system for model validation. Homography transformation matrices are used to turn the perspective view into the top view/real-world coordinate. A homography matrix needs at least four control points. We download a top view image from google and manually pick the playfeld's four corners as control points on both perspective view and top view. The transformation results from the perspective view to the top view is shown in Figure 3.

Figure 3: Transformed perspective view based on top view coordinates.

2.3.1 Statistic method. The statistic method's basic idea is to detect the contours of each player and classify players by the average hue within each contours' bounding box. Hue is chosen over RGB because RGB may vary with lighting conditions. Before detecting players, we need to acquire a standard hue of team1 player, team1 keeper, team2 player, team2 keeper, and referees. The first frame is extracted for feature selection. We manually draw bounding boxes of each sample character and calculate the mean hues within the boxes. These standard hues are then used as the center of classification of this method.

Player detection includes four steps. The first step is to extract the foreground by subtracting the video frame's background (Figure 4).

The second step is to find contour with OpenCV's "fndContours()". Then the bounding box of each person is identified. Boxes with a width larger than height are filtered out considering that players are usually of standing posture. Finally, the average hue of each box is computed for classification. We classify each person by computing the distance from the box's hue to each standard hue. If the hue of the current bounding box is five units or further from its closest standard hue, the box is classified as "noise." Otherwise, the bounding box is assigned to the team with the most similar hue. Box coordinate is the center of the bottom line of each bounding box.

Figure 4: Absolute diﬀerence between the background and the first video frame.

2.3.2 Dbscan. There are only ground truth for half of the players in our dataset, and it is hard to link each coordinate with a specific player. That is, we cannot connect each player with a specific label. In this case, we decide to choose an unsupervised model for classification. Dbscan is a density-based partial clustering method that classifies graph nodes by the distances among them. A Dbscan model requires two parameters:

• eps: two points are considered as neighbors if the distance between them is smaller or equal to eps.

• minPoints: the minimum number of points to form a cluster. Initially, a point is marked as unvisited. Then the point's neighborhood is retrieved. If the point can find sufficient neighbors, a cluster is established. Otherwise, the point is labeled as noise. The point can still be assigned to a cluster if other points found it as a neighbor.

Figure 5: Foreground extraction for Dbscan. Background is masked as purple.

In the project, three foreground pixel features are fed to the DBSCAN model: coordinate x, coordinate y, and hue. The coordinates are used to separate detached players, while hue is used to solve occlusion problems among crowded players. We first extract pixels with hue "far" diﬀerent from the hue of the corresponding background pixel. Here, "far" means the absolute diﬀerence is larger than 0.05 ∗ 180 (the range of hue is 0 to 179). Pixels outside the playfield's boundary are cut oﬀ through linear functions inferred from the four corners (Figure 5). The remained pixels' screen coordinates and hues are recorded as features. The 3D-space distances among each pixel are calculated as inputs for Dbscan.

Considering that each player only occupies a few pixels, we set "minPoints" as 4. A best value of "eps" is found by a ten-fold cross-validation. The model is trained on 50% of all the frames.

2.3.3 MaskRCNN. MaskRCNN is considered as one of the best image segmentation deep learning model. Therefore, we choose this model to represent CNN models. MaskRCNN is an evolved form of Faster R-CNN, which is not designed for pixel-to-pixel alignment. MaskRCNN uses a ConvNet to extract features of the images. These features are passed through a Regional Proposal Network that uses a CNN to generate the multiple Region of Interest (RoI) using a lightweight binary classifier to make boundary box prediction if an object is present in that region and then returns the boxes candidate bounding. The RoI Align network outputs multiple bounding boxes rather than a single and then warps them to a fxed dimension. The warped features are fed to fully connected layers to classify, predict. Also, they fed into Mask classifer, which has two CNN to generate masks for every class without competition among classes. Later these candidate bounding boxes are applied to an RoI pooling layer to bring all the candidates to the same size. Finally, a fully connected layer has received the proposals and output a class label and a bounding-box oﬀset for each candidate and object mask.

Figure 6: MaskRCNN architecture.

As mentioned in Section 2.3.2, we cannot train any model with our current labels. Considering that Komorowski et al.[5] successfully detect players in their video with a pretrained MaskRCNN model, we decided to use the same model for our dataset. Our code is copied from https://colab.research.google.com/github/tensorﬂow/tpu/blob/master/models/ofcial/mask_rcnn/mask_rcnn_demo.ipynb#scrollTo=2oZWLz4xXsyQ and our version is https://colab.research.google.com/drive/1znBYyjlYsNN Ac59bsrIiqOV0iFshObv#scrollTo=CuDqEsMzdHfF.

3 EMPIRICAL EVALUATION

From Section 2, we can conclude that the statistic method requires most manual inputs and thus the hardest to use. However, only the statistic model can run in real-time, which is evidence of its simplicity. As our MaskRCNN model is run on google's free TPU, it is hard to compare its time complexity with the other two models. However, MaskRCNN is the most complex model if every model is programmed from scratch.

As it is hard to link a player with a coordinate label in our dataset, it becomes challenging to evaluate our models quantitatively. Iterative Closest Point (ICP) is chosen for accuracy evaluation. Consider there are two groups of points A and B. ICP identifies the closest point from group A to a point in group B. With ICP, the average distance between closest points from two diﬀerent groups can be calculated. In our case, we fnd each ground truth point's closest point in our classification result and then calculate the average distances of each point pair. The smaller the distance is, the classifcation method is considered to be more accurate

3.1 Qualitative evaluation

The marked video of the statistic model is shown in Figure 7. The figure shows that the algorithm can identify most of the players. However, it tends to mark things such as shadows and boundary line segments as players. Besides, the statistic method can hardly distinguish two diﬀerent teams.

Figure 7: Video marked by statistic method.

We performed a ten-fold cross-validation on 50% of the data to train a good "eps". "eps" ranges from 0.001 to 0.024 and is plotted against average ICP distances. Figure 8 shows an apparent global minimum at around 0.007. The video frame marked by Dbscan is shown in Figure 9. As we can see, Dbscan tends to split a person into several parts due to the hue diﬀerence among diﬀerent body parts.

The pretrained MaskRCNN model performs the worst on our dataset. As we can see in Figure 10 and 11, MaskRCNN can hardly identify contours of players. For those marked objects, MaskRCNN cannot classify them as "person" either. The reason behind this is that MaskRCNN relies too much on the shape of human beings.

3.2 Quantitative evaluation

Considering that the pretrained MaskRCNN model cannot handle our dataset, quantitative evaluation is only performed on our statistic model and Dbscan. The test data is the same as the 50% frames not used to train "eps." The average screen ICP distance of our statistic model is 27.84, and that of Dbscan is 28.66. It is hard to map this distance to the real world distance as distances are varied during transformation. On a 2748 ∗ 710 frame, 27.84 is 1% out of 2748, which seems to be an acceptable error rate.

Figure 8: Plot of eps against average ICP distances

Figure 9: Video marked by statistic method. eps = 0.007, miniPoints = 4

Figure 10: MaskRCNN on original frame image.

Figure 10: MaskRCNN on original frame image.

4 Conclusion

In this project, we performed a statistic model, a Dbscan model, and a MaskRCNN model for player detection on a low-resolution video. Based on Section 3, we can conclude that the statistic model performs best on our dataset. However, the average ICP of the statistic model is not low enough and we can see there are plenty of classification errors through visual validation.

In the future, we will first label the data better so that we can testmore supervised machine learning methods. Then, we will train our own MaskRCNN model instead of a pretrained model. We should also improve our quantitative evaluation method as it is currently measures only the false negative error. For example, we can add a penalty factor on identifying much more or less than 22 objects.

Reference

[1]Carlos Cuevas, Daniel Quilón, and Narciso García. 2020. Techniques and applications for soccer video analysis: A survey. Multimedia Tools and Applications (2020), 1-37.

[2] Sebastian Gerke, Antje Linnemann, and Karsten Müller. 2017. Soccer player recognition using spatial constellation features and jersey number recognition. Computer Vision and Image Understanding 159 (2017), 105-115.

[3] Silvio Giancola, Mohieddine Amine, Tarek Dghaily, and Bernard Ghanem. 2018. Soccernet: A scalable dataset for action spotting in soccer videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 1711-1721.

[4] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961-2969.

[5] Jacek Komorowski, Grzegorz Kurzejamski, and Grzegorz Sarwas. 2019. Deepball: Deep neural-network ball detector. arXiv preprint arXiv:1902.07304 (2019).

[6] Mehrtash Manaffard, Hamid Ebadi, and H Abrishami Moghaddam. 2017. A survey on player tracking in soccer videos. Computer Vision and Image Understanding 159 (2017), 19-46.

Contribution

Yanshen Sun

1. Modify the code of Dbscan and MaskRCNN to get the qualitative results for analysis

2. Polish and add more content to the final report

Jun Xiao

1. Find the code for Dbscan and MaskRCNN and try to run the MaskRCNN

2. Make the project video, build the project website and write draft for report