If you would like to skip all of the technical info and methodology, my list is here.
Covid-19 shut down the college baseball season on March 15th. Most teams played between 1/4 and 1/3 of their games. This was a small sample size, but the top teams started to emerge. Florida was ranked #1. UCLA was #1 in RPI. Alabama was having a good year as well. They finished 3rd in RPI. Kansas State had a vastly improved pitching staff when compared to last year. They were top 10 in team ERA, team WHIP, and team runs allowed. Oh, what could’ve been.
I have been reflecting on the past semester and I'm curious as to what would happen if the college baseball season didn't get canceled. I decided to try and project total wins for all 299 D1 baseball teams in 2020. This would include scraping all of 2019 team data, to train the model, as well as scraping all 2020 team data to predict wins.
College baseball data is not nearly as easy to find as MLB data. I had to go to
TheBaseballCube.com and scrap every single team’s data. I had to get the link for each teams page, then the link to each year I was scraping, and finally all of the data on the pages.
However, this data was by player. There was a team totals row, for a few of the teams, but I summed all of the players columns and redid the more advanced metrics and added a few which weren’t included. All the scraping and data cleaning/manipulation was done with BeautifulSoup4 and Pandas in Python. Thank God for Python.
Once I got all of my 2019 hitting metrics, I ran a multi linear regression with OLS statsmodel in Python to project win totals. My R-squared was around .97. That’s good. Really good. I tested the model with the 2019 data and compared to the actual 2019-win totals. I was off by, at most, 15 wins. However, my average difference was closer to 3.
Below, I am showing the breakdown of wins from 2019 and the model's projection. The measure column shows the statistic which is being computed on the numbers. I'm using statistic lightly, because this is only showing the number of samples, mean, standard deviation, max/min, and quartiles. As I show the different models, notice how much closer the statistics get between 2019 Wins and Projected Wins.
Additionally, I will be showing the statsmodel OLS model summary. This summary gives us the R-squared, coefficients, and p-values. These are helpful to decided whether the model should be used and which factors are significant. More on this later.
Projected Wins with Offense Data Only
Measure |
2019 Wins |
Projected Wins |
Difference |
count |
296 |
296 |
296 |
mean |
27 |
27.39 |
3.64 |
std |
9 |
8.12 |
2.65 |
min |
5 |
4.70 |
0.005 |
25% |
20 |
21.80 |
1.50 |
50% |
27 |
27.66 |
3.13 |
75% |
34 |
32.52 |
5.45 |
max |
54 |
48.46 |
12.36 |
The P>|t| column describes the significance of the variable. The smaller the p-value, the more significant. Generally, if the p-value is less than 0.05 the variable is considered significant. It’s not a bad model. It’s only using offensive data. I’m shocked it even got that close.
I created my pitching table with all of my metrics and stats. I run the same type of regression, OLS with statsmodel, but with pitching data. The R-squared is now .98, fantastic! I run a similar test and find I’m off by as much as 13 wins, but an average of 2.9 per team. Some more info is below.
Projected Wins with Pitching Data Only
Measure |
2019 Wins |
Projected Wins |
Difference |
count |
296 |
296 |
296 |
mean |
27.39 |
27.39 |
2.94 |
std |
9.29 |
8.52 |
2.25 |
min |
5 |
5.28 |
0.009 |
25% |
20 |
21.95 |
1.27 |
50% |
27 |
27.80 |
2.55 |
75% |
34 |
33.05 |
4.11 |
max |
54 |
50.82 |
13.07 |
Like the previous model summary, anything with a P>|t| value less than 0.05 is considered significant. Here, we see that innings pitched, saves, and complete games are significant. Runs are close to the 0.05 mark. 0.05 isn't a magic number, but rather a good rule of thumb. It will be interesting to see which stats are significant when we combine offense and pitching data and re run the regression.
I combine the two tables and run the regression. My R-squared is .99. I’m only off by as many as 5 wins and an average of 1.4 wins. More info is below.
Projected Wins with Pitching and Offensive Data
Measure |
2019 Wins |
Projected Wins |
Difference |
count |
296 |
296 |
296 |
mean |
27.4 |
27.4 |
1.4 |
std |
9.3 |
9.1 |
1.06 |
min |
5 |
6.5 |
0.01 |
25% |
20 |
20.54 |
0.61 |
50% |
27 |
27.9 |
1.24 |
75% |
34 |
33.83 |
2.04 |
max |
54 |
54.75 |
5.82 |
From the summary above here are the metrics/stats which are predictors (.0.5 >( P>|t|)):
- Offensive Runs (R_x)
- BABIP
- Complete Games (GC)
- Games Relieved (GR)
- Saves (SV)
- Innings Pitched (IP)
- Runs Allowed (R_y)
- Wild Pitches (WP)
- Runs Allowed Per 9 Innings (ra9)
- Strikeout/Walk Ratio (kbb_y)
Interestingly, 8 out of the 10 significant factors are pitching stats. Here are some other stats and metrics which are somewhat significant (.1 > (P>|t)|) :
- Offensive Walks (BB_x)
- Sacrifice Flies (SF)
- Singles (1b)
- Secondary Average (secA)
- At-Bats per Home Run (abhr)
I now have a trained model and it’s pretty spot on. The average difference in projected wins and actual wins is small. This is a good sign. Baseball is a difficult sport to project. Being within 5 wins, for every team, is fantastic. I’m eager to input my 2020 data.
Before I just input the 15-20 game sample size, I have to increase our sample size to match the sample size of the previous model. I’m trying to predict year end wins so shouldn’t I use year end stats? I took 56 and divided each teams total games. This give us how much of the season the have left to play. I then multiplied each teams stats by this number. I recalculated statistics and metrics with the extended numbers. I also calculated an on-pace win total. Here are the top 10 projected win totals and schools.
Projected Top 10 Wins - 2020
Team |
Projected Wins |
On-Pace Wins (current wins * 56/games played) |
Projected Wins - Wins |
Wins (2019) |
Win Diff. From 2019 |
University of Mississippi |
59 |
53 |
6 |
41 |
18 |
University of Tennessee |
57 |
49 |
8 |
40 |
17 |
University of Florida |
50 |
53 |
-3 |
34 |
16 |
Tulane University |
49 |
49 |
0 |
32 |
17 |
University of Miami |
49 |
42 |
7 |
34 |
15 |
University of North Carolina-Greensboro |
49 |
42 |
7 |
34 |
15 |
Davidson College |
48 |
46 |
2 |
29 |
19 |
UCLA |
48 |
49 |
-1 |
52 |
-4 |
East Carolina University |
48 |
46 |
2 |
46 |
2 |
University of Louisville |
47 |
43 |
4 |
49 |
-2 |
A few important notes:
- Florida State was projected at 148 wins. The data from BaseballCube was wrong. The the at-bats column equaled the games column. The team hit north of .800
- Ball State, Florida International, and UConn were taken out of the 2019 model due to incomplete data.
- Some teams were projected negative win totals for 2020, many of these teams were Ivy League. They did not play a large amount of games this year.
- RPI or any type of weighting was not taken into account. This is something I can work on for version 2.0.
- Yes, Ole Miss and Tennessee are projected to win more games than they will actually play. Perhaps projecting win percentage might be a better option. This will be heavily considered in version 2.0.
For a full list of projected wins, click here
No comments:
Post a Comment