FIFA21 Data Analysis Project

AM Viciana
5 min readApr 4, 2021

--

For my first team project in the Data Analyst Bootcamp at IronHack I teamed up with my colleagues Enne, Dash and Joan. We worked together in the FIFA21 Project, where we tried to predict a player’s “Overall Rating” by analysing data from a dataset. This dataset is an extract from https://sofifa.com/ and to understand the acronyms and abbreviations better, we used the following websites:

The project had to be finished and presented to our cohorts in less than a day, and with great teamwork we managed to solve it.

EXPLANATORY ANALYSIS

We were given a table of data with information on professional football player’s, existing of data on the categories personal, professional and skills. Using this data, we had the goal to build a model to predict the overall player’s rating score of new players.

We first imported the different packages to be able to: get, read and work with the data.

After exploring and reading the data, we decided to split our group and divide the multiple columns into 2 groups of 51 (out of the 102) in order to clean it, so it would be easier to check its values and work with a smaller list.

We performed the following actions in order to clean the data:

- Standardised the column names with lower cases and replaced the spaces with underscores.

- After noticing the relationships with the different variables we dropped the ones that we thought would be redundant for our model and kept in place the ones we considered relevant.

Dropped variables:

- Unnamed: most likely a copy of the ID variable. Noticed it with “Group by” function.
- Club related variables: we got rid of all club related variables like Team & Contract, Joined, Loan Date End, etc.
- Position: kept the BP value as it added more value to our model.
- Dropped all football skills variables: all mostly included in Base Stats and Total Stats.

Kept values:

- Height, Weight, Foot: thought physical attributed could be a factor which impacts the overall rating of a player (ova).
- Base Stats & Total Stats: highly correlated with ova.
- Variables that described player money valuation: could be related that the most a player is paid is because it is correlated with its ova.
- BP & player best position score: the players best position needs to be in the model because the ova is most likely coming from the players best position rating score.

Dividing the numerical and categorical variables and starting the following EDA steps:
- Transforming numerical types stored as categorical into numerical variables by defining functions and pandas commands. Example: removing euro symbols.
- Visualisation analysis: Pairplots, Displots, Box-plots and heat-map.
- Creation of new data frames with cleaned values.
- Concatenating each cleaned variable into a final data frame.
- Checking correlations between final variables using heat-map in order to decide whether to drop any other before creating the prediction model.

Box-plot
Pairplot
Heat map

Creation of the prediction model.
- X-y split by defining “ova” as target variable.
- Building the model applying linear regression method.
- Train-test split to fit the model with our final data.
- Model validation.
- Calculate R2, MSE, RMSE and MAE
Noticed our model was better than expected with a model score of 0.79 and mean absolute error of 2.4.

X-y split by defining “ova” as target variable.

After trying to improve the models prediction by normalising the numerical variables with MinMaxScale, introducing relevant categorical variables like international rating and BP and, a key numerical variable by using panda’s melt function: players best position score. We achieved to reach a model score of 0.98 with a mean absolute error of 0.8.

To end with a conclusion it is clear that the overall rating of a player depends mostly on the players best position score. Although using the total and base stats lead to predict a players overall’s rating, this is not with the accuracy as expected. The addition of the physical and personal information values were not useful after all when trying to predict a player’s ova.

In the future, if we look into building a new model, focusing on the players best position score would lead to less use of redundant variables, speeding up the computer prediction model results.

--

--

Responses (1)