Author: Phosphen
Compiled by; Gans , Bagel Predictive Market Watch

This man collected data from all professional tennis matches over the past 43 years, fed it all into a machine learning model, and then asked only one question: Can you predict who will win?
The model answered with just one word: Yes.
Subsequently, at this year's Australian Open, it correctly predicted 99 out of 116 matches, achieving an accuracy rate of 85%!
This was a competition that the model had never seen before during training, and it even correctly predicted every single match that the final champion would win.
All of this was accomplished using only a laptop, free data, and open-source code, by @theGreenCoding .
Next, I will break down this transformative project from raw data to the final successful prediction. This will be the most impressive AI + prediction success story you've ever seen.
The story begins with a data set that can be called the "holy grail of sports data".
This collection covers records of every professional tournament held by the ATP (Association of Tennis Professionals) from 1985 to 2024.
Break points, double faults, forehand, backhand, player height, age, ranking, head-to-head records, match venue... every point-by-point statistic that the ATP has ever tracked is available.
Forty years' worth of CSV files, all stored in one folder.
When he opened the complete dataset, the computer crashed.
But he didn't give up. For the 95,491 matches in the dataset, he calculated a large number of additional derived features:
Head-to-head record between the two players
Age difference, height difference
Winning percentage in the last 10, 25, 50, and 100 matches
First shot scoring rate difference
Break point recovery rate difference
A custom ELO rating system borrowed from chess (key point)
Final dataset: 95,491 rows × 81 columns.
Every professional tennis tournament over the past forty years has been accompanied by dozens of manually calculated features.
Before feeding the data into the classifier, he decided to thoroughly understand how the algorithm worked. To do this, he wrote a decision tree from scratch using NumPy.
Decision trees work similarly to a reasoning game—approaching the answer step by step through a series of questions.
To illustrate this concept, he chose a completely different dataset: the Titanic.
For example: Did passenger number 11 survive?
Question 1: Is the TA in first class? → Yes.
Question 2: Is she female? → Yes.
Predicted outcome: Survival.
How does the algorithm decide which questions to ask?
It starts with all the data and finds the single variable that best distinguishes between "survivor" and "not survivor." In the Titanic data, the answer is cabin class. First-class passengers went to one side, and everyone else went to the other.
However, some people in first class also perished, so there is still "impurity." The algorithm continues to search for the next optimal split point: gender. All the women in first class survived, forming a "pure node," and the branch ends here.
Repeat this process until a complete decision tree covering all situations is built.
His handwritten NumPy code performed well on small datasets, but became frustratingly slow when used with 95,000 tennis match records. So, during the actual training phase, he switched to an optimized version of sklearn, with the same logic but much faster.
Before training the model, he first plotted all the variables in pairs into a huge scatter plot matrix (SNS pairplot) to look for patterns that could distinguish between winners and losers.
Most features are noise. Player IDs are clearly useless. While the win rate differences show some patterns, they are not obvious enough to support a reliable classifier.
Only one variable far outweighs the others: the ELO difference (ELO_DIFF).
The scatter plots of ELO_DIFF and ELO_SURFACE_DIFF clearly demonstrate the degree of separation between the two categories, which no other feature can match.
This discovery prompted him to build the core of the entire project.
ELO is a method for evaluating the skill level of players, first applied to chess. Currently, the world's number one chess player, Magnus Carlsen, has a rating of 2833.
He decided to apply this system to tennis:
Starting rating for each player: 1500 points
Win: Rating increases; Lose: Rating decreases
Core Mechanism: The amount of points gained or lost depends on the score difference with your opponent. Beating a high-scoring opponent earns more points, while losing to a low-scoring opponent results in a heavier point deduction.
He demonstrated this formula in the 2023 Wimbledon final: Carlos Alcaraz (rated 2063) faced Novak Djokovic (rated 2120), and Alcaraz came from behind to win the title.
Substituting into the formula: Alcaraz +14 points, Djokovic -14 points.
Although the calculation is simple, its power is astonishing when applied to 43 years of historical data.
He plotted Federer's ELO ratings throughout his career as a curve, clearly recording every match from his debut on the professional circuit to his retirement.
This curve fully illustrates a legendary period: a rapid rise in the early years, absolute dominance during the peak period (around the 400th game), and fluctuations in the later stages of his career.
But what's truly striking is when Federer is placed on the same chart as all ATP players since 1985:
The three curves stand tall, far surpassing everyone else—Federer (green), Nadal (blue), and Djokovic (red).
The "Big Three" of Grand Slam tournaments is more than just a title. When you visualize 40 years of tournament data, you'll find that this dominance is clearly visible mathematically.
According to his custom ELO system, the current world number one is Yannick Sinner (2176 points), followed by Novak Djokovic (2096 points) and Alcaraz (2003 points).
Remember that Sinner is ranked number one; this is crucial later on.
The type of court used in tennis will completely change the face of the sport:
Clay: Slow speed, high bounce
Grass: Fast, low bounce
Hard court: somewhere in between
A player who dominates on one type of field may completely collapse on another.
Therefore, he established ELO ratings for three different surfaces: clay, grass, and hard court.
The results confirm what every tennis fan already knows, and are supported by 43 years of
Nadal's peak rating on clay surpasses Federer's highest rating on grass, surpasses Djokovic's highest rating on hard courts, and surpasses anyone's all-time high on any surface.
He has 14 French Open titles and a 112-4 record at Roland Garros.
The ELO formula doesn't care about narratives or fame; it only deals with win-loss records. And its conclusions are completely consistent with forty years of sports news reporting.
With the data prepared and the ELO system set up, he began training the classifier. This process perfectly demonstrates the importance of algorithm selection.
Decision tree: 74% accuracy
A single decision tree achieved 74% accuracy on the complete dataset. Sounds good—until you discover that simply using ELO difference to predict the winner yields 72%.
The decision tree provided almost no improvement over the scoring system he had manually built.
Random Forest: 76% accuracy
The problem with a single decision tree is its "high variance"—it is overly sensitive to the subset of data that happens to be selected during training. The standard solution is a random forest: dozens or even hundreds of decision trees are built, each trained with a different subset of random data and features, and the prediction result is finally determined by majority vote.
Ninety-four distinct decision trees collectively vote for each game.
The result was 76%. An improvement, but he hit a ceiling. No matter how he adjusted the hyperparameters, redesigned features, or manipulated the data, the accuracy just wouldn't exceed 77%.
He then tried XGBoost—which he called "a steroid version of random forest."
The key difference lies in the fact that Random Forest builds trees in parallel and then averages the results, while XGBoost builds trees sequentially—each new tree is specifically designed to correct the errors of all the previous trees. It introduces regularization to prevent overfitting and deliberately keeps each tree small to avoid rote memorization of training data.
Result: Accuracy rate 85%.
This is a huge breakthrough compared to the 76% ceiling of random forests. The same data, the same features, the only change is the algorithm.
XGBoost also considers the three most important features to be: ELO difference, field-specific ELO difference, and overall ELO. This scoring system, borrowed from chess, has been validated as the strongest predictor among 81 features.
In comparison, he trained a neural network using the same data and achieved an accuracy of 83%. While this was good, it still lost to XGBoost. On this dataset, tree-based methods won.
All of the above content was trained using data prior to December 2024.
The Australian Open in January 2025 is completely outside of training camps, making it the perfect testing ground: has the model truly grasped the essence of tennis, or is it merely memorizing historical patterns?
He input the complete tournament draw into the model, allowing it to predict every match.
Results: Correctly predicted 99 out of 116 matches, with only 17 incorrect predictions. Accuracy rate: 85.3%.
The most crucial prediction: the model accurately predicted Sinner's (the player ranked number one in the world by the ELO system) every victory throughout the entire tournament.
AI predicted the Grand Slam champion before the first ball even hit the ground.
One person, one laptop, no proprietary data, no expensive infrastructure, no research team—and a professional tennis prediction model was built with an accuracy rate of up to 85%, predicting Grand Slam champions before the tournament even started.
The tennis data is available on GitHub and is fully reproducible.
Creating miracles has never been more within reach than it is today.
The real difference lies not in resources, but in whether you are willing to do it.


