Betting on 21.5

AM Viciana
6 min readApr 25, 2021

For this mid project in the Data Analyst Bootcamp at IronHack I was considering to do something related to one of my passions; tennis. I did have some ideas regarding this topic, which you will see very soon in my next project, and after some conversations I teamed up with my colleague Enne. He had a similar challenge in mind; to finde a simple strategy to beat the bookies in the football matches. The decision was very simple: we could use my broad knowledge in tennis and his expertise in the betting system, and perfectly combine those two ideas. Sounds like a good plan, right?

Getting ready:

Our idea is based on a simple analysis of the betting market and to be able to build a strategy using Machine Learning in order to beat the bookies. The open data available for this was crucial and very rich. Therefore, we decided to only use the data of the last eleven years. We decided to focus on best-of-three sets tournaments because otherwise the odds would be to difficult for our analysis, which I will explain further on.

But, how does tennis work?

Tennis has a complex score system, which in short works like this:

In tennis you have points, games, sets and match. You play points against your opponent, and in order to win a game you need to score at least four points with a difference of two, then you win one game in your scoreboard. You need to win six or seven games, with a difference of two to get a set, and you need two sets to win the match. That was Easy, not? So, you are now ready to play!

Finding the data

The database we chose is rich and consists of concrete results of tennis matches, see below:

This file also included betting odds from different betting companies, which was great for our further analysis.

GOAL:

Predicting when a match is going to be under or over a quantity of games. We chose 21.5 as our target for this project. Will the amount of games of the next match be over or under 21.5? Let’s check this out!

EXPLANATORY ANALYSIS

We collected all data from tournaments played in the ATP tour available, from 2010 to 2021. We focused on the best-of-3 matches and individual category. You can see below a bit of code we used to import all the libraries we needed during our project and the best part of data cleaning. We standardised it in PE8 style and dropped some irrelevant columns like best-of-5 sets.

Importing Python Libraries

Columns and further explanation:

PE8 format

Dropped variables:

  • w4: result of the 4th set, for the winner
  • l4: result of the 4th set, for the loser
  • w5: result of the 5th set, for the winner
  • l5: result of the 5th set, for the loser
  • exw: bookies odds (Express) winner
  • exl: bookies odds (Express) loser

The same procedure for the rest of the beating companies, except for bet365.

dropping columns

Kept values:

  • atp: number of the tournament. This number is repeated every single row until the tournament changes
  • location: place of the tournament
  • tournament: name of the tournament
  • date: date of the beginning of the tournament
  • series: type of tournament
  • court: type of court played
  • surface: type of surface played
  • round: type of round played
  • best_of: in this case only best of 3
  • winner: name of the winner
  • loser: name of the loser
  • wrank: ATP ranking of the winner
  • lrank: ATP ranking of the loser
  • wpts: ATP points of the winner
  • lpts: ATP points of the loser
  • w1, l1, w2, l2, w3, l3: games won or lost by the player
  • wsets: sets won by the player
  • lsets: sets lost by the player
  • comment: if the match is finished or not
  • b365w & b365l: bookies odds (Bet365)
  • avgw & avgl: AVG odds offered by the beating companies

Further steps:
1.
Visualising the missing values of the new DataFrame and therefore filling nan values of the 3rd set with zeros, since the match was finished after 2 sets.

2. Creating our target variable with the amount of games and its AVG calculated.

AVG games played per match using Tableau public.

3. In this point is we calculated the AVG of games played per player, when they won or lost a match. This was a crucial point and we used methods as concat, rest indexes, rename columns, merge DataFrame, etc. You can see the procedure below:

4. Now it was time to drop all the unnecessary columns for our final DataFrame in order to test and run our Machine Learning model.

5. We decided at this point our target will be under or over 21.5 games per match and we are going to run the logistic regression technique for our model.

creating variable more/less for our target and then a function to identify them.

6. Splitting the dataset.

7. Selecting numerical and categorical variables and running the model.

8. Using Gradient Boost.

Conclusion

After running different models and getting 60% accuracy in our model, we got to the conclusion that we need more rich variables for our dataset in order to have a better result. However, with the data that we worked on, we are confident in making 10% NET profit regardless the outcome of single bets.

Betting companies are always changing the betting system. As for example the target over/under 21.5 games per match changes continuously to other values, like 22.5 or 24.5, which makes it very unpredictable and difficult to beat.

If we want to improve our model, we need to have more details about every single player and some more deep information as for example the trainings and mental conditions. With all this into account, we could improve our model and make our changes to get some percents higher.

Thank you for reading and see you in the tennis court :)

--

--