# Moneyball: How linear regression changed baseball

Friday, July 28, 2017

It’s unbelievable how much you don’t know about the game you’ve been playing all your life.
Mickey Mantle

Moneyball tells the story of Oakland A’s in 20021. It was one of the poorest teams in baseball. Billy Beane became its General Manager in 1997. The team’s performance started to improve. But, in the beginning of 2002, Oakland A’s lost three key players. Could they continue improving?

Billy Beane, with his colleague Paul DePodesta, followed an analytical, sabermetric approach to assembling a competitive baseball team, despite Oakland’s disadvantaged revenue situation. Their analysis suggested that some skills were undervalued and some skills were overvalued. If they could detect the undervalued skills, they could find players at a bargain.

They analyzed that a team needs to win atleast 95 games to make to the playoffs. Based on this, the A’s calculated that they must score 135 more runs than they allow during the regular season to expect to win 95 games. We can verify this using linear regression. I’m going to use R. The dataset2 baseball.csv consists of 15 variables, whose description is given in codebook.

Thus, A team having atleast 95 wins has almost always got into the playoffs, which is in accordance of what DePosta predicted.

The above plot shows linear relationship b/w Wins and Run Difference. Now, let’s build our regression model.

Our regression equation for wins is:
W = 80.8814 + 0.105766 × RD and W >= 95
⇒ 95 >= 80.8814 + 0.105766 × RD
⇒ RD = 133.4
Thus, a team need to score almost 135 more pts than allowed to get into the playoffs.

Now, how does the A’s score more runs?
The A’s discovered that two baseball statistics were significantly more important than others:

• On-Base percentage (OBP): Percentage of time a player gets on base (including walks) and
• Sluggish percentage (SLG): How far a player gets around the bases on his turn (measures power).

And, Batting Average was overvalued. Let’s verify this:

The linear regression yields a R-squared value of 0.92, thus our model is a good fit; and both variables are significant.
Runs Scored (RS) = -804.63 + 2737.77(OBP) + 1584.91(SLG) …(i)

We can use pitching statistics to predict runs allowed:

• Opponents On-Base percentage (OOBP)
• Opponents Sluggish percentage (OSLG)

We get the linear regression model as:

Runs Allowed (RA) = -837.38 + 2913.60(OOBP) + 1514.29(OSLG) …(ii)

We can predict how many games the 2002 A’s will win using our models. Using 2001 regular season statistics,

• Team OBP is 0.339 and
• Team SLG is 0.430.

Thus putting these values in the equation (i), we get Runs Scored (RS) = 805.
In the same way, Runs Allowed (RA) = 622 using equation (ii) as in 2001,

• Team OOBP was 0.307 and
• Team OSLG was 0.373.

Now, our regression equation to predict wins was: W = 80.8814 + 0.1058 × RD where RD = RS - RA.
Our prediction for wins in 2002 is: W = 80.8814 + 0.1058(805 – 622) = 100

Paul DePosta used a similar approach to make predictions.

Our prediction Paul’s prediction Actual
Runs Scored 805 800-820 653
Runs Allowed 622 650-670 653
Wins 100 93-97 103

Our prediction closely match actual performace. The A’s set a League record by winning 20 games in a row and made it to the playoffs. Their 2002 record of 103-59 was joint best in Major League Baseball.

Although they didn’t win the World Series, Paul and Billy revolutinised the game through their data-driven approach. Neverthless, Moneyball changed the way many major league front offices do business.

Footnotes:
1: Moneyball
2: Baseball-Reference
3: R (programming language)

Resources:
The case study is an extract of chapter 2 Linear Regression from the Course The Analytics Edge.