Once I realized I enjoy coding in Python and I love baseball, I decided to see where the two could meet. I talked to a professionals and they said Kaggle has some public data sets I could check out. They were correct. I found this one. Paul Schale, I don't know you, but I love you. Finding this data set was the highlight of my February.
My first idea was to find the percentage of lead off walks which end up scoring. I personally believe that lead off walks are the biggest contributor to big innings. I wanted to prove this. After looking at the CSV files, I realized this would be more difficult than I thought. There isn't a runner ID or anything so I can't be totally sure if the hitter who was walked actually scored. That's fine. I can find something else using this huge data set. So, I did this:
That is definitely a jumbled mess of data points, but I'm here to make sense of it.
First off, there is a black rectangle in the middle of all of those dots which represents the strike zone. The strike zone for each hitter is different, but in the large CSV file, there was a strike zone parameter for each player. The bottom value was distance from the ground to the bottom of the strike zone and the top value was the distance from the ground to the top of the strike zone. I wrote a simple python script to go through and average the bottom and top values of the strike zone. The width was fixed, because the dimensions of the plate never change.
Next, let’s get into all of the colors. When I look at this picture I see an elephant. It's almost like one of those ink-blot pictures. The green looks like his big ears and that blue streak down the middle could be his trunk. However, that isn't the point of this. When looking at this much data you must make it meaningful.
Let's start with the red dots. Those are Four-Seam Fastball. They generally are the fastest and straightest pitch. Pitchers tend to throw more Four-Seam Fastballs than any other pitch. It's no surprise to me that red can be seen almost anywhere on the graph.
Next is our black dots. They represent Two-Seam Fastballs. These are still one of the faster pitches, but they move a lot more than most Four-Seam Fastballs. In fact, Two-Seam Fastballs move to the arm side of the pitcher. If a right handed pitcher throws a Two-Seam Fastball, the ball will move or break from left to right as viewed by the pitcher. The Two-seam Fastball can be a very effective pitch when set up correctly.
Change-Ups, a slower pitch, are being represented by the blue dots. They are often used to throw off the timing of the hitter and get ground balls. If a pitcher has a good Fastball and Change Up, he can be dangerous. The Change-Up often breaks down towards the bottom of the strike zone or ground
The yellow dots are Curveballs. They are slower like Change-Ups, however Curveballs have more loop in them and are slower than Change-Ups. Curveballs can break straight down or hard to the side. Curves move to the glove side of a pitcher. So if a right handed pitcher throws a Curveball, the ball will move from the right to left, as seen from the pitchers point of view.
A Cutter is show by the orange dots. Mariano Rivera made this pitch famous. The ball does the opposite of a Two-Seam Fastball. It breaks to the glove side of the pitcher. Not every pitcher has a Cutter so they are a little less common than the first 4 pitch I talked about.
The purple dots show Sliders. Sliders are faster than Curveballs but their break/movement is more horizontal than vertical. They have similar movement to Cutters, but Sliders move more. They aren't as common as Curveballs, but they are still used often.
Finally, the green dots are other. There are about 15-20 pitch types in the file, and I wanted to distinguish the main pitches. I might work with these less common pitch types in a later post.
I bet you're wondering how I used color dots to represent pitches and categorize them. You're in luck. I want to show this off.
Within my script, I ran a loop which read the CSV file row by row. I appended the pitch types to the list, 'pitch'. I did this while I was recording the x and y values of the pitch locations. So position 0 of the lists 'x', 'y', and 'pitch' all correlate to one another. Then I created the function, 'pltcolor'. I call the function using cols=pltcolor (pitch). I passed the list 'pitch' to 'pltcolor'. From there, the function creates a new list called 'cols' which is where the colors will be stored. 'Pltcolor' then uses 'lst' and 'pitch' and runs a loop which goes position by position setting the color of each pitch in 'cols'.
Later on in the program, while I use matplotlib, I plot x, y, and cols. This gives us the x and y value of the point and then colors the dot.
If you want to see how to plot certain pitches and expand on this you can contact me or wait until my next post.
No comments:
Post a Comment