NeuRayl - a Neural Network Baseball Projection System
This semester, I finally got a chance to take a class on a topic that I’ve wanted to for a while: Artificial Neural Networks (and Computational Evolution). For the longest time I’d been fascinted by neural networks, especially with all the hype they’d been getting and had tried multiple times to read books or articles on them to little to no avail so when I found out that OU was offering a class on it that fit my requirements with a professor I’ve had classes with and enjoyed in the past, I quickly signed up. The class was great, covering basics of neural networks and evolutionary computing before reading and discussing relevent recent academic work in the fields. I learned so much and am so glad I took the class. Maybe the best part was our final project where we were given free reign to apply what we’d learned to domains of our choice: I chose baseball.
Average baseball salaries have been going through the roof lately, with the average player making about $4 million and stars 5 to 10x that a year. Even the average salary is nearly out of the reach of my favorite team, the Tampa Bay Rays who carry a payroll of around $77 million for their 25 man roster. When the Rays sign players to free agent contracts, they must take special care since they will generally occupy such a heavy portion of their payroll, therefore making the need for a good projection system that much more important. There are a few public (non-internal) projection systems used in baseball, Steamer, ZiPS and Marcel which all seem to use just variations of linear cobinations of past years’ data. Although this will accurately regress player’s stats to their career mean but it fails to capture aging trends in the league. For example, if players consistently have a breakout season in their third year in the league (say) or that young players tend to have upward career trajectories or older players tend to have downward trajectories. The existing systems fail to capture these trends, which led me to create NeuRayl, a neural network based baseball projection system.
Why Neural Networks?
Well neural networks are really good at capturing non-linear trends which projecting in baseball seems to be.
What configuration of NNs did you use?
For my class project, I kept it pretty simple and tested a neural network with one hidden layer and another one with two hidden layers. I tested each of them on three configurations of the data: an input vector with a single year of back data (One Back), one with three years of back data (Three Back) and one with five years of back data (Five Back). Each layer used the linear activation function.
Tell me more about your data
Okay! I used the Lahman DB which has statistics all the way back to the 1870s. Unfortunately I wasn’t able to use data going back quite that far just due to the fact that many of the input stats I was using weren’t calculated then and I didn’t want to poison the predicter with false zeroes. In reality I started more around 1970 for nack data.
Okay, so what do your results look like?
I started with internal comparisons, how do the individual configurations of the neural network compare to one another. I measured this with Mean Squared Error (MSE). The first image is the single hidden layer configuration and the second is the two hidden layer configuration.
After internal comparisons, I compared my system to the few existing systems for three players: Mike Trout, Evan Longoria and Bryce Harper. I chose these players since they demonstrate three different directions of stats compared to the previous year’s stats: similar, up and down respectively. These players would show the range of NeuRayl.
What were you hoping to achieve with this?
I’d really like to see my system meet or exceed the current performance of the current systems. The full MSE data isn’t a great comparator because then in order to properly compare the data, I would need predictions for each of the systems back to the 1970s which is around where the data started which (to the best of my knowledge) doesn’t exist. But in the individual player comparisons NeuRayl seems to stack up pretty nicely to the current projection systems with it being a little better than the systems for some stats and the current systems being a little better than it in other stats, which makes sense. So overall I’m pretty happy about it
What do you think was holding you back?
So I’ve identified a few things that I think really contributed to the MSE number being what it was: injuries and lack of playing time and the linear activation function. Injuries and lack of playing time are things that can’t really be accurately predicted with raw stats (except maybe “if a player is playing under replacement level then he won’t play” but that would involve projecting a subset of games at a time) but they hugely affect the MSE. If a player is projected to have ~100 hits but ends up having none due to injury or demotion then that’s 10000 error right there, not even counting the rest of the stats that are also zeroes. The linear activation function also seems like a mistake in hindsight since having multiple linear layers with linear activation functions are equal to just a single linear layer. I really need to break up the linearity by using a non-linear activation function such as sigmoid or relu.
How are you planning on improving it going forward?
- Besides changing the activation function, I’d really like to try a few different configurations of NNs, adding in another layer or two to see if that improves the MSE.
- I’d really like to add in minor league or college stats. Right now if a player doesn’t have one, three or five years of back data, I fill it in with zeroes. If I had college or minor league data, I could adjust it to a major league level and use that as another factor. The only issue with this would be that good minor league or data is hard to find, especially when it gets to the lower minors.
I want to see the code, where can I find it?
It’s just up on my Github page!