Wednesday, May 27, 2020

Projecting College Baseball Wins For 2020

If you would like to skip all of the technical info and methodology, my list is here.

Covid-19 shut down the college baseball season on March 15th. Most teams played between 1/4 and 1/3 of their games. This was a small sample size, but the top teams started to emerge. Florida was ranked #1. UCLA was #1 in RPI. Alabama was having a good year as well. They finished 3rd in RPI. Kansas State had a vastly improved pitching staff when compared to last year. They were top 10 in team ERA, team WHIP, and team runs allowed. Oh, what could’ve been. 

I have been reflecting on the past semester and I'm curious as to what would happen if the college baseball season didn't get canceled. I decided to try and project total wins for all 299 D1 baseball teams in 2020. This would include scraping all of 2019 team data, to train the model, as well as scraping all 2020 team data to predict wins. College baseball data is not nearly as easy to find as MLB data. I had to go to TheBaseballCube.com and scrap every single team’s data. I had to get the link for each teams page, then the link to each year I was scraping, and finally all of the data on the pages.

However, this data was by player. There was a team totals row, for a few of the teams, but I summed all of the players columns and redid the more advanced metrics and added a few which weren’t included. All the scraping and data cleaning/manipulation was done with BeautifulSoup4 and Pandas in Python. Thank God for Python. 

Once I got all of my 2019 hitting metrics, I ran a multi linear regression with OLS statsmodel in Python to project win totals. My R-squared was around .97. That’s good. Really good. I tested the model with the 2019 data and compared to the actual 2019-win totals. I was off by, at most, 15 wins. However, my average difference was closer to 3.

Below, I am showing the breakdown of wins from 2019 and the model's projection. The measure column shows the statistic which is being computed on the numbers. I'm using statistic lightly, because this is only showing the number of samples, mean, standard deviation, max/min, and quartiles. As I show the different models, notice how much closer the statistics get between 2019 Wins and Projected Wins.

Additionally, I will be showing the statsmodel OLS model summary. This summary gives us the R-squared, coefficients, and p-values. These are helpful to decided whether the model should be used and which factors are significant. More on this later.

Projected Wins with Offense Data Only

Measure 2019 Wins Projected Wins Difference
count 296 296 296
mean 27 27.39 3.64
std 9 8.12 2.65
min 5 4.70 0.005
25% 20 21.80 1.50
50% 27 27.66 3.13
75% 34 32.52 5.45
max 54 48.46 12.36

testHitting Coefficients

The P>|t| column describes the significance of the variable. The smaller the p-value, the more significant. Generally, if the p-value is less than 0.05 the variable is considered significant. It’s not a bad model. It’s only using offensive data. I’m shocked it even got that close.

I created my pitching table with all of my metrics and stats. I run the same type of regression, OLS with statsmodel, but with pitching data. The R-squared is now .98, fantastic! I run a similar test and find I’m off by as much as 13 wins, but an average of 2.9 per team. Some more info is below.

Projected Wins with Pitching Data Only

Measure 2019 Wins Projected Wins Difference
count 296 296 296
mean 27.39 27.39 2.94
std 9.29 8.52 2.25
min 5 5.28 0.009
25% 20 21.95 1.27
50% 27 27.80 2.55
75% 34 33.05 4.11
max 54 50.82 13.07

testHitting Coefficients

Like the previous model summary, anything with a P>|t| value less than 0.05 is considered significant. Here, we see that innings pitched, saves, and complete games are significant. Runs are close to the 0.05 mark. 0.05 isn't a magic number, but rather a good rule of thumb. It will be interesting to see which stats are significant when we combine offense and pitching data and re run the regression.

I combine the two tables and run the regression. My R-squared is .99. I’m only off by as many as 5 wins and an average of 1.4 wins. More info is below.

Projected Wins with Pitching and Offensive Data

Measure 2019 Wins Projected Wins Difference
count 296 296 296
mean 27.4 27.4 1.4
std 9.3 9.1 1.06
min 5 6.5 0.01
25% 20 20.54 0.61
50% 27 27.9 1.24
75% 34 33.83 2.04
max 54 54.75 5.82

All Stats Header All Stats Coefficients 1 All Stats Coefficients 2

From the summary above here are the metrics/stats which are predictors (.0.5 >( P>|t|)):
  • Offensive Runs (R_x)
  • BABIP
  • Complete Games (GC)
  • Games Relieved (GR)
  • Saves (SV)
  • Innings Pitched (IP)
  • Runs Allowed (R_y)
  • Wild Pitches (WP)
  • Runs Allowed Per 9 Innings (ra9)
  • Strikeout/Walk Ratio (kbb_y)

Interestingly, 8 out of the 10 significant factors are pitching stats. Here are some other stats and metrics which are somewhat significant (.1 > (P>|t)|) :
  • Offensive Walks (BB_x)
  • Sacrifice Flies (SF)
  • Singles (1b)
  • Secondary Average (secA)
  • At-Bats per Home Run (abhr)
I now have a trained model and it’s pretty spot on. The average difference in projected wins and actual wins is small. This is a good sign. Baseball is a difficult sport to project. Being within 5 wins, for every team, is fantastic. I’m eager to input my 2020 data.

Before I just input the 15-20 game sample size, I have to increase our sample size to match the sample size of the previous model. I’m trying to predict year end wins so shouldn’t I use year end stats? I took 56 and divided each teams total games. This give us how much of the season the have left to play. I then multiplied each teams stats by this number. I recalculated statistics and metrics with the extended numbers. I also calculated an on-pace win total. Here are the top 10 projected win totals and schools.

Projected Top 10 Wins - 2020

Team Projected Wins On-Pace Wins (current wins * 56/games played) Projected Wins - Wins Wins  (2019) Win Diff. From 2019
University of Mississippi 59 53 6 41 18
University of Tennessee 57 49 8 40 17
University of Florida 50 53 -3 34 16
Tulane University 49 49 0 32 17
University of Miami 49 42 7 34 15
University of North Carolina-Greensboro 49 42 7 34 15
Davidson College 48 46 2 29 19
UCLA 48 49 -1 52 -4
East Carolina University 48 46 2 46 2
University of Louisville 47 43 4 49 -2

A few important notes:
  • Florida State was projected at 148 wins. The data from BaseballCube was wrong. The the at-bats column equaled the games column. The team hit north of .800
  • Ball State, Florida International, and UConn were taken out of the 2019 model due to incomplete data.
  • Some teams were projected negative win totals for 2020, many of these teams were Ivy League. They did not play a large amount of games this year.
  • RPI or any type of weighting was not taken into account. This is something I can work on for version 2.0.
  • Yes, Ole Miss and Tennessee are projected to win more games than they will actually play. Perhaps projecting win percentage might be a better option. This will be heavily considered in version 2.0. 

For a full list of projected wins, click here

Tuesday, March 19, 2019

MLB Pitch Data (2015-2018)



Once I realized I enjoy coding in Python and I love baseball, I decided to see where the two could meet. I talked to a professionals and they said Kaggle has some public data sets I could check out. They were correct. I found this one. Paul Schale, I don't know you, but I love you. Finding this data set was the highlight of my February.

My first idea was to find the percentage of lead off walks which end up scoring. I personally believe that lead off walks are the biggest contributor to big innings. I wanted to prove this. After looking at the CSV files, I realized this would be more difficult than I thought. There isn't a runner ID or anything so I can't be totally sure if the hitter who was walked actually scored. That's fine. I can find something else using this huge data set. So, I did this: 

Every MLB pitch (2015-2018) location as the ball crossed home plate. 


That is definitely a jumbled mess of data points, but I'm here to make sense of it. 

First off, there is a black rectangle in the middle of all of those dots which represents the strike zone. The strike zone for each hitter is different, but in the large CSV file, there was a strike zone parameter for each player. The bottom value was distance from the ground to the bottom of the strike zone and the top value was the distance from the ground to the top of the strike zone. I wrote a simple python script to go through and average the bottom and top values of the strike zone. The width was fixed, because the dimensions of the plate never change. 

Next, let’s get into all of the colors. When I look at this picture I see an elephant. It's almost like one of those ink-blot pictures. The green looks like his big ears and that blue streak down the middle could be his trunk. However, that isn't the point of this. When looking at this much data you must make it meaningful. 

Let's start with the red dots. Those are Four-Seam Fastball. They generally are the fastest and straightest pitch. Pitchers tend to throw more Four-Seam Fastballs than any other pitch. It's no surprise to me that red can be seen almost anywhere on the graph.

Next is our black dots. They represent Two-Seam Fastballs. These are still one of the faster pitches, but they move a lot more than most Four-Seam Fastballs. In fact, Two-Seam Fastballs move to the arm side of the pitcher. If a right handed pitcher throws a Two-Seam Fastball, the ball will move or break from left to right as viewed by the pitcher.  The Two-seam Fastball can be a very effective pitch when set up correctly. 

Change-Ups, a slower pitch, are being represented by the blue dots. They are often used to throw off the timing of the hitter and get ground balls. If a pitcher has a good Fastball and Change Up, he can be dangerous. The Change-Up often breaks down towards the bottom of the strike zone or ground 

The yellow dots are Curveballs. They are slower like Change-Ups, however Curveballs have more loop in them and are slower than Change-Ups. Curveballs can break straight down or hard to the side. Curves move to the glove side of a pitcher. So if a right handed pitcher throws a Curveball, the ball will move from the right to left, as seen from the pitchers point of view. 

A Cutter is show by the orange dots. Mariano Rivera made this pitch famous. The ball does the opposite of a Two-Seam Fastball. It breaks to the glove side of the pitcher. Not every pitcher has a Cutter so they are a little less common than the first 4 pitch I talked about.

The purple dots show Sliders. Sliders are faster than Curveballs but their break/movement is more horizontal than vertical. They have similar movement to Cutters, but Sliders move more. They aren't as common as Curveballs, but they are still used often. 

Finally, the green dots are other. There are about 15-20 pitch types in the file, and I wanted to distinguish the main pitches. I might work with these less common pitch types in a later post.

I bet you're wondering how I used color dots to represent pitches and categorize them. You're in luck. I want to show this off. 




Within my script, I ran a loop which read the CSV file row by row. I appended the pitch types to the list, 'pitch'. I did this while I was recording the x and y values of the pitch locations. So position 0 of the lists 'x', 'y', and 'pitch' all correlate to one another. Then I created the function, 'pltcolor'. I call the function using cols=pltcolor (pitch). I passed the list 'pitch' to 'pltcolor'. From there, the function creates a new list called 'cols' which is where the colors will be stored. 'Pltcolor' then uses 'lst' and 'pitch' and runs a loop which goes position by position setting the color of each pitch in 'cols'.

Later on in the program, while I use matplotlib, I plot x, y, and cols. This gives us the x and y value of the point and then colors the dot.

If you want to see how to plot certain pitches and expand on this you can contact me or wait until my next post.