COSC3000 - Visualization, Computer Graphics
Visualization, Computer Graphics
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COSC3000 - Visualization, Computer Graphics
& Data Analysis
Data Visualisation Project Report
References 16
1 Introduction
“Some people believe football is a matter of life and death, I am very
disappointed with that attitude. I can assure you it is much, much more
important than that...” - Bill Shankly (Former footballer and manager)
For as long as competitive sport has existed, the desire to be (or at least support) the
very best has also. Without comment on whether a positive aspect, modern times
have seen sport become a professional endeavour, with vast sums of money involved
in nearly every aspect. This has led to more and more sophisticated techniques used
to analyse outcomes, improve performance, predict results.
While sports statistics and analytics have always been used in this regard, the use
of sabermetrics in baseball and the subsequent book/film Moneyball have created an
upsurge in their popularity and usage. One sport which is notoriously difficult to
analyse due to low scores and tight margins is association football.
Manchester United Football Club is one of the most recognisable names in world
sport. This is in large part to their great success on the field, especially in modern
times under the tenure of now retired manager Sir Alex Ferguson. Competing mainly
in the English Premier League, they have been crowned champions a record 13 times
since the inception of the competition in 1992.
1.1 Aims
This report will examine the English Premier League data between 1995 and 2013
in an attempt to determine both the level of Manchester United’s success and the
potential reasons for it. This will be achieved mainly through the use of data vi-
sualisation to highlight patterns or interesting trends in the data not readily visible
otherwise.
1.2 Background
For those unfamiliar with the English Premier League, it is the most popular domestic
club football competition in the world. Forming from the previous Division 1 in the
English football league system in 1992, it begins in August and runs through until
May the following year. Since the 1995/96 season it has consisted of 20 teams in
total, with each playing all others twice - once at their home ground and once at
their opponents home ground. A team is awarded 3 points for a win, 1 point for a
draw, and no points for a loss, with the overall winner being the team with the most
points at the end of the season. The three teams at the least number of points are
relegated to the league below.
2 Methods
2.1 Data Collection
Sourcing a single reliable data repository for all desired data proved troublesome.
Such services do exist, however due to the “big business” nature of professional
sport, access to these were prohibitively expensive. As such, community and fan
created resources had to be relied upon. The quality and reliability of any analyses
performed on these data is thus only as good as the data themselves. With that
being said, some measure of insight can still be found.
The data used in this report came mainly from four separate sources:
• Wikipedia list of Manchester United seasons1
• Bookmaker data aggregation site Football-Data.co.uk2
• Transfer spending data aggregation site transfermarkt.com3
• UK football stadium site doogal.co.uk4
The type of collection method varied between each source, dependant on how the
data were presented. For the Wikipedia articles, the Python programming language
and HTML parsing library BeautifulSoup were used. Firstly, each of the required
links off the main page1 were looped through and the HTML source saved locally. A
detailed description of how the data were processed and made into a workable state
is provided in the subsequent section. Some of the details provided in this dataset
included the date, opponents, location, full time results and scores, time of goals,
and attendance for each Manchester United game of each of the seasons.
Figure 1: Example Wikipedia Data
3
The bookmaker aggregator data2 were presented as a series of CSV files. The Chrome
plugin Download Master was used to automatically download every Premier League
data file for the required seasons. The subsequent section details how this data were
processed into a usable format. This dataset included much of the same data as the
Wikipedia entries, however for every game in every season. It also presented the
referees and gambling odds data for games in seasons 2000/2001 onwards.
For the transfer spending data3, complex HTML tables made parsing the data some-
what difficult. As such the data were copied and pasted manually for each season
and team into CSV files for later processing. Provided was the total spending of each
Premier League club (and the league as a whole) for each of the Summer transfer
windows. The Summer transfer window is one of the two sanctioned periods where
player purchase and transfer is allowable. Unfortunately the second (January) trans-
fer window data were unavailable. That being said, however, the Summer window is
the longer period and is often when clubs do most of their purchasing.
The football ground location data were presented nicely in table format4. As such
they were simply pasted into a CSV file for later processing. Provided were the
stadium name, the team that plays there, the capacity, and coordinates of the ground.
Figure 2: Example Location Data
4
2.2 Data Processing and Cleansing
The Python library BeautifulSoup was used to parse the HTML files downloaded
for all Wikipedia entries. A loop through each of the files stored pulled required
information into dictionary objects indexed by season and game number. Once this
was done, a search was completed through each of the entries for anomalous or in-
consistent entries.
Due to the nature of the data source, many were found. Things such as inconsistent
naming for opposition teams (e.g. both Blackburn and Blackburn Rovers appear-
ing) and weird character combinations in goal scorer names due to incorrect ASCII
encoding were quite common. A combination of modifying the original HTML files
and tweaking the parsing code eventually led to a state where the data were in a
consistent, readable format. Helper functions for data access were then created for
easy retrieval during analysis and visualisation steps.
For the remaining data, the built-in csv module within Python was utilised to pull
all required data into dictionary objects. Similarly to above, functions were created
to ensure ease of access during later stages. Again there was much difficulty with
the consistency and quality of the data, which appeared to be a recurring issue
throughout the data collection and processing stage. For example, the same referee
was listed as “C Foy”, “CJ Foy”, “Chris Foy”, and “Foy, C” all within one CSV file.
Often it was simpler to manually correct these issues than writing some small script
to do so. This was judged on a case by case basis depending on the scale of the
inconsistencies found.
5
3 Results and Discussion
3.1 Success vs Rivals and Strong Opponents
It is widely believed among sports fans that results in “big games” show the mettle
of true champion teams. As such, it was desired to see how Manchester United
performed against their traditional rivals and other strong teams within the league.
The wins, losses, and draws against other traditionally successful teams were tallied
for both home and away games and subsequently plotted as a percentage of the total
number of games. The results can be seen in Figure 3 below:
Figure 3: Wins/Losses/Draws Against Strong Clubs
It can immediately be seen that for many of the bars, the green section stretches over
the 50% mark, whereas no red section does so from the opposing side. One other
thing which is unsurprising is that it appears teams tend to perform better at their
home grounds - red bars are larger for the away results.
From this it also be seen that only two teams appear to have a better win percentage
than Manchester United - Chelsea and Arsenal when playing at their respective
home grounds. This is unsurprising, as these are the clubs which have won the most
Premier League titles other than Manchester United.
6
3.2 Monthly Performance Comparison
It is a commonly held belief of Manchester United fans that they are poor starters.
Inconsistent results at the beginning of the season can make it more difficult to chase
down opponents during the latter part of the season. Thus, the average number of
points won per game (maximum 3, minimum 0) in a given month was calculated
for each season and visualised on a polar plot to determine if this belief held any
credence. The colour of the circle also represents the league position at the end of
the month, to see how the points won affected their overall standing. The results
can be seen in Figure 4 below:
Figure 4: Monthly Average Points Won
7
Upon brief inspection, it appears there may be some truth to the belief. It appears
there are are more “small” circles in the initial few months of the season in compar-
ison to some of the later months. The appearance of more red circles in the early
month is also a sign of their slow start, however it is also a result of the fact that
any points won by a team may drastically change the composition of the ladder due
to small differences between teams since not many games have yet been played.
It can be seen that towards the later months of the season, more consistency is
achieved. Another thing of note is that when heading into the last month of the
season in first place (a green circle in April), they have always ended up winning
the league (a green circle in May). This is an indication that they were either too
far ahead to catch, or they did not “choke” when it came to the important final few
games.
An unfortunate consequence of this graph is no circle appearing during a month where
no points were won, such as May of the 2000/01 season. This was an unexpected
result due to the success of the club. That being said, however, they still won the
league in that season 10 points clear of their nearest rival. These results then may
have been an indication of the club playing junior or reserve players to give them
experience since the outcome of the league had already been decided in their favour.
3.3 Geographical Performance Comparison
It was desired to see whether there were any particular geographical locations within
England where Manchester United performed better or worse. As such a the location
of each opponents ground was plotted with the size of the circle representing the
number of games played against the given opponent and the colour the percentage
of total points available won. This can be seen in Figure 5 below:
8
Figure 5: Geographic Performance
It appears that they seem to perform well against clubs in the middle and far North
of the country, perform strongly in the greater London region apart from at two
grounds (the aforementioned Chelsea and Arsenal), and average in the region sur-
rounding their own area. Based on this, however, it appears that this is mainly an
indication of the relative strength of the club that plays at the ground as opposed to
any other factor.
The English league system has clubs which play from all over the country. Those
that have been in the Premier League for any time in the seasons under examination
in this report will have a circle on the above map. Those with larger circles have
been in for longer, since that is the number of games against Manchester United.
This tells us how many games in they have had in the Premier League by proxy
(since Manchester United have been in the Premier League since its inception).