Data Mining Project on Performance Analysis (Soccer)

27-01-2013 11:05



English Premier League (U.K) started in its present format in 1992 and over the last 20 years it has grown into a billion dollar business ( Over the last three years it has become increasingly apparent that the field of performance analytics in English football has been held back for one simple reason - the lack of publically accessible data. Analytics across U.S sports has flourished over the past two decades with myriad high profile success stories, the most notable of which saw the story of the Oakland A’s baseball team dramatized in the film Moneyball. Yet their story is the tip of the iceberg and many more similar successes from within the NBA, NFL, MLB and NHL have yet to be told. Analytics and objective analysis is in the very fabric of those sports. There also lies within them a culture of analytics amongst the franchises themselves as well as the fans, the media, bloggers and students alike. [1] For the first time, on August 16th 2012 one of the leading clubs in English Premier League (EPL) (Manchester City Football Club) made a historic decision by releasing the entire data of 2011-2012 season to the public, Data includes every match played and every detail of each match over the entire year. Aim of this project is to analyze the MCFC data set and come up with a robust and insightful player evaluation system.

Rating the players using the metrics developed

The various features considered were then normalized to obtain a fair distribution while doing the rating. Normalization was done following the method mentioned below Final Value = Player’s Original Value for the metric / Max Value for this metric among all players. This ensures that the best player(s) in each metric list will have a score of 1 and all other players will have value in the range 0 ≤Value ≤1. Moreover, it will ensure that metrics like goals scored which get bigger values would not have bigger effect on the rankings. In order to rate the players on the features defined, the foremost thing is identifying how effective and relevant each factor is towards the success of the club. This can be measured in various ways depending on what you define success as. 

We are considering the following factors as success criteria

  • The position at which a team finished in the league. 1 meaning champions, 20 meaning last.

  • Goals conceded by each team (for Defenders)

  • Goals scored by each team (for Strikers)

All the different features were calculated for all the 20 teams, summing up over all the players in each team and the correlation between these values and the success criteria was determined. Correlation value thus obtained is taken as the weightage for each of these features in determining the final rating of the players.


Correlation Example:

In a football league, it’s not always true that the team who scores most number of goals win the league. We took the correlation between goals scored by a team against their final position in the table and found that this value is approx. -0.81. This showed that goals scored by a team are a strong indicator of its final position (negative since a better position in the table corresponds to a lower standing).


Methodology of Evaluation

The current data available in Fantasy Premier League website ( is highly influenced by the performance of the players in the current season (2012-13).We calculated the base prices(prices at the beginning of this season) of these players by taking the difference between the current price of the player and the price rise during this season. 1. We correlated our rankings (Top 50) against the prices mentioned in BFPL and got the following correlation for the 3 sets of players




The low correlation for defenders is attributed to the very myopic evaluation strategy used by BFPL in rating defenders.( While the metrics developed above considers about 10 different factors, BFPL considers 3 among them to calculate the ratings. 2. Top 50 Players from each set were looked up in the BFPL ratings. The major hurdle in this comparison is the relatively high number of new foreign players who are playing well this season and a few good players like Yakubu, Drogba etc. who left the EPL at the end of last season. This combined with the 14/20 team factor leads us to the fact that Expected Number of matches observed is < 70% of 50 which is 35. The top 50 players from each set is then compared with top 50 from BFPL data and the following results were obtained



The player evaluation metrics developed has been designed to be used by the professionals involved in performance analysis at football clubs. We have evaluated all the players in all the three capacities, which we believe is a rarity in the field of performance analysis. The main shortcoming of segregating players into different groups and then analyzing their skills in that particular category is that we may overlook the ability of multi- talented players.


Using our metrics and ratings will help any coach/manager to take decisions in tough situations where they are not having a ‘labeled’ midfielder/striker on the bench to substitute and are compelled to substitute a different player (say a defender). The evaluation strategy developed has a perfect statistical answer to these questions as all the players have been gauged on all the three metrics developed.


It is a very straight forward application of the results given by our metric. Any team/ coach looking to buy new players from within the league can easily look up the ratings for the players as an indicator of current form and weakness at the same time. Robin Van Persie has topped the ratings for strikers and it was no wonder he was signed by Manchester United after facing stiff competition.


Full Study in:

For more information contact Nipun Valoor

Figures: Data Mining Project on Performance Analysis (Soccer)