Daddy U: Sabermetrics

Okay, let's see if we can use Julia for some sabermetrics.

First, head on over to juliabox.org and register. I have a secret code.

Load some packages

We need to add on some external packages. Sadly these can be a bit slow to load, so let's do it know:

using Gadfly, DataFrames

(These are pre-installed on juliabox.)

Download some data

f = "http://seanlahman.com/files/database/lahman-csv_2015-01-24.zip"
d = "/tmp/lahman"
if !isdir(d)
  mkdir(d)
  fout = "$d/lahman-csv.zip"
  download(f, fout)
  run(`unzip -n -d $d $fout`)
end

Read in some data

Julia had a DataFrame type to hold tabular data. (Akin to Python's Pandas or R's data.frame types.) We can use readtable to read a csv file into a data frame

d = readtable("$d/Batting.csv")
playerIDyearIDstintteamIDlgIDGABRHx2Bx3BHRRBISBCSBBSOIBBHBPSHSFGIDP
1abercda0118711TRONA140000000000NANANANANA
2addybo0118711RC1NA251183032600138140NANANANANA
3allisar0118711CL1NA291372840450193125NANANANANA
4allisdo0118711WS3NA2713328441022271102NANANANANA
5ansonca0118711RC1NA2512029391130166221NANANANANA
6armstbo0118711FW1NA124991121050101NANANANANA
7barkeal0118711RC1NA140100020010NANANANANA
8barnero0118711BS1NA311576663109034116131NANANANANA
9barrebi0118711FW1NA151110010000NANANANANA
10barrofr0118711BS1NA18861313210111000NANANANANA
11bassjo0118711CL1NA228918271103180134NANANANANA
12battijo0118711CL1NA130000000010NANANANANA
13bealsto0118711WS3NA10366700012020NANANANANA
14beaveed0118711TRONA3157600052000NANANANANA
15bechtge0118711PH1NA20942433911214022NANANANANA
16bellast0118711TRONA291282632330234492NANANANANA
17berkena0118711PH1NA140000000003NANANANANA
18berryto0118711PH1NA140100000000NANANANANA
19berthha0118711WS3NA1773171711083142NANANANANA
20biermch0118711FW1NA120000000010NANANANANA
21birdge0118711RC1NA251061928250131032NANANANANA
22birdsda0118711BS1NA291525146330246044NANANANANA
23brainas0118711WS3NA301342430400214072NANANANANA
24brannmi0118711CH1NA3142100000000NANANANANA
25burrohe0118711WS3NA12631115231140011NANANANANA
26careyto0118711FW1NA19871620200105021NANANANANA
27carleji0118711CL1NA291273132810182183NANANANANA
28conefr0118711BS1NA197717203101612182NANANANANA
29connone0118711TRONA7336700020000NANANANANA
30cravebi0118711TRONA271182638810266330NANANANANA

Data frames store variables in columns. To access a single column we use the notation d[:name]. The : makes the name a symbol. This works fine unless there are spaces in the names, in which case use :(name with spaces) as a pattern.

d[:RBI]
99846-element DataArray{Int64,1}:
  0
 13
 19
 27
 16
  5
  2
 34
  1
 11
 18
  0
  1
  ⋮
 13
  0
 28
 10
 30
 17
  0
  0
  1
 38
 52
 60

Subsetting

There are values with no RBI info. These are stored as NA (not 0). To see them we need to

Indexing of d can be done by column (as above) or by row and column. The notation is d[rowindex,columnindex].

The row index can be specified by number (1 or by range 1:3 or by vector [1,3,5], use : to refer to all of them). Similarly the column index.

More commonly we can index by a logical vector with a length that matches the number of rows. These are generated by logical operators or by special "is" functions, as here with isna:

d[isna(d[:RBI]), :yearID]
5573-element DataArray{Int64,1}:
 1882
 1882
 1882
 1882
 1882
 1882
 1882
 1882
 1882
 1882
 1882
 1882
 1882
    ⋮
 1999
 1999
 1999
 1999
 1999
 1999
 1999
 1999
 1999
 1999
 1999
 1999

We see the early years, but also some in 1999. What happened there?

d[(isna(d[:RBI])) & (d[:yearID] .== 1999), :]
playerIDyearIDstintteamIDlgIDGABRHx2Bx3BHRRBISBCSBBSOIBBHBPSHSFGIDP
1aguilri0119991MINAL17NANANANANANANANANANANANANANANANA
2alvarju0119991ANAAL8NANANANANANANANANANANANANANANANA
3anderma0119991DETAL37NANANANANANANANANANANANANANANANA
4appieke0119992OAKAL12NANANANANANANANANANANANANANANANA
5arrojro0119991TBAAL24NANANANANANANANANANANANANANANANA
6balejo0119991TORAL1NANANANANANANANANANANANANANANANA
7bradfch0119991CHAAL3NANANANANANANANANANANANANANANANA
8broweji0119991CLEAL9NANANANANANANANANANANANANANANANA
9brunswi0119991DETAL17NANANANANANANANANANANANANANANANA
10buddimi0119991NYAAL2NANANANANANANANANANANANANANANANA
11bunchme0119991SEAAL5NANANANANANANANANANANANANANANANA
12candito0119991OAKAL11NANANANANANANANANANANANANANANANA
13cartela0219991KCAAL6NANANANANANANANANANANANANANANANA
14castica0219991CHAAL18NANANANANANANANANANANANANANANANA
15coopebr0119991ANAAL5NANANANANANANANANANANANANANANANA
16cordefr0119991DETAL20NANANANANANANANANANANANANANANANA
17corsiji0119992BALAL13NANANANANANANANANANANANANANANANA
18daveyto0119992SEAAL16NANANANANANANANANANANANANANANANA
19davisdo0219991TEXAL2NANANANANANANANANANANANANANANANA
20delucri0119991CLEAL6NANANANANANANANANANANANANANANANA
21depause0119991CLEAL11NANANANANANANANANANANANANANANANA
22durbich0119991KCAAL1NANANANANANANANANANANANANANANANA
23falkebr0119991BALAL2NANANANANANANANANANANANANANANANA
24fasseje0119992TEXAL7NANANANANANANANANANANANANANANANA
25fettemi0119991BALAL27NANANANANANANANANANANANANANANANA
26floribr0119992BOSAL14NANANANANANANANANANANANANANANANA
27fossato0119991NYAAL5NANANANANANANANANANANANANANANANA
28frankry0119991SEAAL6NANANANANANANANANANANANANANANANA
29fussech0119991KCAAL17NANANANANANANANANANANANANANANANA
30gailled0119991TBAAL8NANANANANANANANANANANANANANANANA

I'm guessing American League pitchers with no atbats.

That syntax is pretty clunky. You need to put the expressions in (), and use .== instead of just == to get comparison as we are comparing each element of d[:yearID] to 1999. The DataFramesMeta package can make it easier. We have to install this one thoughL

Pkg.add("DataFramesMeta")
using DataFramesMeta

Then the above becomes something like:

@linq d |>
where(isna(:RBI)) |>
where(:yearID .== 1999)
playerIDyearIDstintteamIDlgIDGABRHx2Bx3BHRRBISBCSBBSOIBBHBPSHSFGIDP
1aguilri0119991MINAL17NANANANANANANANANANANANANANANANA
2alvarju0119991ANAAL8NANANANANANANANANANANANANANANANA
3anderma0119991DETAL37NANANANANANANANANANANANANANANANA
4appieke0119992OAKAL12NANANANANANANANANANANANANANANANA
5arrojro0119991TBAAL24NANANANANANANANANANANANANANANANA
6balejo0119991TORAL1NANANANANANANANANANANANANANANANA
7bradfch0119991CHAAL3NANANANANANANANANANANANANANANANA
8broweji0119991CLEAL9NANANANANANANANANANANANANANANANA
9brunswi0119991DETAL17NANANANANANANANANANANANANANANANA
10buddimi0119991NYAAL2NANANANANANANANANANANANANANANANA
11bunchme0119991SEAAL5NANANANANANANANANANANANANANANANA
12candito0119991OAKAL11NANANANANANANANANANANANANANANANA
13cartela0219991KCAAL6NANANANANANANANANANANANANANANANA
14castica0219991CHAAL18NANANANANANANANANANANANANANANANA
15coopebr0119991ANAAL5NANANANANANANANANANANANANANANANA
16cordefr0119991DETAL20NANANANANANANANANANANANANANANANA
17corsiji0119992BALAL13NANANANANANANANANANANANANANANANA
18daveyto0119992SEAAL16NANANANANANANANANANANANANANANANA
19davisdo0219991TEXAL2NANANANANANANANANANANANANANANANA
20delucri0119991CLEAL6NANANANANANANANANANANANANANANANA
21depause0119991CLEAL11NANANANANANANANANANANANANANANANA
22durbich0119991KCAAL1NANANANANANANANANANANANANANANANA
23falkebr0119991BALAL2NANANANANANANANANANANANANANANANA
24fasseje0119992TEXAL7NANANANANANANANANANANANANANANANA
25fettemi0119991BALAL27NANANANANANANANANANANANANANANANA
26floribr0119992BOSAL14NANANANANANANANANANANANANANANANA
27fossato0119991NYAAL5NANANANANANANANANANANANANANANANA
28frankry0119991SEAAL6NANANANANANANANANANANANANANANANA
29fussech0119991KCAAL17NANANANANANANANANANANANANANANANA
30gailled0119991TBAAL8NANANANANANANANANANANANANANANANA

Splitting data

Okay, let's get at some anomalies. Which players have the most home runs?

The data is stored by player so we need to split up the data by each player and add up the :HR values. When we split the data, we will get a smaller data frame. This function will add up the HRs ( paying attention to NA values):

function add_hrs(d)
  hrs = d[:HR]
  hrs = hrs[!isna(hrs)]
  sum(hrs)
end
add_hrs (generic function with 1 method)

Here we split the data and apply the function

a = by(d, :playerID, add_hrs)
playerIDx1
1aardsda010
2aaronha01755
3aaronto0113
4aasedo010
5abadan010
6abadfe010
7abadijo010
8abbated0111
9abbeybe010
10abbeych0119
11abbotda010
12abbotfr011
13abbotgl010
14abbotje0118
15abbotji010
16abbotku0162
17abbotky010
18abbotod010
19abbotpa010
20aberal010
21abercda010
22abercre019
23abernbi010
24abernbr018
25abernte010
26abernte020
27abernwo010
28aberscl015
29ablesha010
30abnersh0111

We want to sort by the numbers. We can sort as follows:

sort(a[:x1])
18405-element DataArray{Int64,1}:
   0
   0
   0
   0
   0
   0
   0
   0
   0
   0
   0
   0
   0
   ⋮
 569
 573
 583
 586
 609
 612
 630
 654
 660
 714
 755
 762

But to sort the data frame we want to shuffle the items around. The sortperm function helps here. It returns the indices of the permutation that will sort the data. Here we pass in rev=true to get reverse order:

a[sortperm(a[:x1], rev=true),:]
playerIDx1
1bondsba01762
2aaronha01755
3ruthba01714
4mayswi01660
5rodrial01654
6griffke02630
7thomeji01612
8sosasa01609
9robinfr02586
10mcgwima01583
11killeha01573
12palmera01569
13jacksre01563
14ramirma02555
15schmimi01548
16mantlmi01536
17foxxji01534
18mccovwi01521
19thomafr04521
20willite01521
21pujolal01520
22bankser01512
23matheed01512
24ottme01511
25sheffga01509
26murraed02504
27gehrilo01493
28mcgrifr01493
29musiast01475
30stargwi01475

Go Bonds!

On base percentage

What function will compute the on base percentage?

$$ OBP = \frac{H + BB + HBP}{AB + BB + HBP + SF} $$

Do we have these in the data?

names(d)
22-element Array{Symbol,1}:
 :playerID
 :yearID  
 :stint   
 :teamID  
 :lgID    
 :G       
 :AB      
 :R       
 :H       
 :x2B     
 :x3B     
 :HR      
 :RBI     
 :SB      
 :CS      
 :BB      
 :SO      
 :IBB     
 :HBP     
 :SH      
 :SF      
 :GIDP    

We are good to go. Here we define a function to compute the OBP. The only issue is the NA values must be addressed and those who barely played are weeded out (and given a 0.0 value)

function compute_obp(d)
u = d[:, [:H, :BB, :HBP, :AB, :SF]]
## make NA -> 0
for nm in names(u)
    u[isna(u[nm]), nm] = 0
end
sum(u[:AB]) > 100 || return(0.0)	
top =  @with(u, :H + :BB + :HBP)
bottom = @with(u, :AB + :BB + :HBP + :SF)
sum(top) / sum(bottom)
end
compute_obp (generic function with 1 method)

The pattern is as before:

OBP = by(d, :playerID, compute_obp)
OBP = OBP[sortperm(OBP[:x1], rev=true), :]
playerIDx1
1willite010.48170856325362765
2ruthba010.4739598210035228
3mcgrajo010.4656722517368206
4hamilbi010.4551961823966066
5gehrilo010.44735189449445256
6bondsba010.44429455641961596
7joycebi010.4348668280871671
8hornsro010.43374014472405226
9cobbty010.4329654848555999
10foxxji010.42827377851859566
11speaktr010.4279475982532751
12smithol010.4268292682926829
13collied010.4243817787418655
14fainfe010.42407407407407405
15broutda010.42332810867293624
16jacksjo010.42273790250044974
17bishoma010.4226840436773512
18mantlmi010.42051541182415364
19wardpi010.4191866527632951
20cochrmi010.41915772089182496
21thomafr040.41909866984316063
22martied010.41780189332717615
23vottojo010.4172821270310192
24musiast010.41666009308195945
25bassljo010.41612436731742586
26childcu010.4157673860911271
27burkeje010.41511811023622047
28boggswa010.41499393147231817
29pottsjo010.4140625
30heltoto010.41399386048481