A sabermetric database for the GUMBO era
In my last post, I started working on a scraper for armchair sabermatricians in the post-gameday data era. I'll continue that work now, by designing a set of database tables to leverage things that are available in the database.
We can start, as always by looking at the tables in the old pitchRx
databases, which we want to recreate at a minimum. A pitchRx
database would hold a series of tables: game
, coach
, umpire
, atbat
, player
, hip
, action
, runner
, po
, and pitch
. We are most interested in the game
, atbat
, pitch
, and runner
tables.
We can start accordingly with a game
table of our own, which will hold:
name | description | type |
---|---|---|
gamePk |
game primary key | char(6) |
awayId |
away team code | char(3) |
homeId |
home team code | char(3) |
awayTeamName |
away team name | char(22) |
homeTeamName |
home team name | char(22) |
awayGamesPlayed |
away team games played | int |
homeGamesPlayed |
home team games played | int |
awayWins |
away team wins to date | int |
homeWins |
home team wins to date | int |
awayLosses |
away team losses to date | int |
homeLosses |
home team losses to date | int |
awayGB |
away team games back | float |
homeGB |
home team games back | float |
awayWCGB |
away team games behind wildcard | float |
homeWCGB |
home team games behind wildcard | float |
awayRuns |
away runs | int |
homeRuns |
home runs | int |
awayHits |
away hits | int |
homeHits |
home hits | int |
awayErrors |
away errors | int |
homeErrors |
home errors | int |
doubleHeader |
flag for whether this is a doubleheader | boolean |
tiebreaker |
flag for whether this is a tiebreaker | boolean |
dateTime |
time of first pitch | datetime |
gameNumber |
game number | int |
gameType |
game type | char(1) |
sportId |
baseball organization ID | char(3) |
venueId |
game venue ID | char(3) |
originalDate |
originally scheduled game date (if makeup) | datetime |
statusCode |
game status (final, etc) | char(2) |
season |
season | int |
Note that the italicized key(s) in any given table identify the unique primary key of any given observation. With that, the game table should be functionally equivalent to the tables in pitchRx
. pitchRx
doesn't have innings in its database, but I want it so up next is the innings
table:
name | description | type |
---|---|---|
gamePk |
game primary key | char(6) |
inningNum |
inning number (1-index) | int |
awayRuns |
away team runs scored during inning | int |
homeRuns |
home team runs scored during inning | int |
awayHits |
away team hits during inning | int |
homeHits |
home team hits during inning | int |
awayErrors |
away team errors during inning | int |
homeErrors |
home team errors during inning | int |
awayLOB |
away team men left-on-base at inning end | int |
homeLOB |
home team men left-on-base at inning end | int |
awayTotalRuns |
away team total runs scored through inning | int |
homeTotalRuns |
home team total runs scored through inning | int |
awayTotalHits |
away team total hits through inning | int |
homeTotalHits |
home team total hits through inning | int |
awayTotalErrors |
away team total errors through inning | int |
homeTotalErrors |
home team total errors through inning | int |
awayTotalLOB |
away team total men left-on-base through inning end | int |
homeTotalLOB |
home team total men left-on-base through inning end | int |
Up next, we'll work on the atbat
table. I use the convention of at-bats (ABs) from pitchRx
but really, this will include all plate appearances (PAs) and, in fact, every time someone distinctly steps up to bat, even if it doesn't result in anything (due to outs on the basepaths, substitutions, ejections, etc.); in the table below (and nowhere else unless specified), PA will be understood to represent a player stepping into the box, not statistical PAs.
name | description | type |
---|---|---|
gamePk |
game primary key | char(6) |
playNum |
play index (straight out of the GUMBO list) | int |
inningNum |
inning number | int |
inningHalf |
inning half | 'T'/'B' |
batterId |
batter's ID | char(6) |
batterSide |
batter's batting stance | 'L'/'R' |
pitcherId |
pitcher's ID | char(6) |
pitcherHand |
pitcher's throwing hand | 'L'/'R' |
awayScore |
away team score at end of PA | int |
homeScore |
home team score at end of PA | int |
awayScorePre |
away team score just before batter event of PA | int |
homeScorePre |
home team score just before batter event of PA | int |
b |
balls in count at end of PA | int |
s |
strikes in count at end of PA | int |
o |
outs at end of PA | int |
oPre |
outs just before batter event of PA | int |
event |
the outcome of the PA | char(??) |
eventType |
the outcome of the PA, encoded | char(??) |
startTime |
the time the PA starts | datetime |
endTime |
the time the PA ends | datetime |
description |
description of the play | text |
runsScored |
runs scored by the batter event of this play | int |
rbi |
runs batted in by the batter in this PA | int |
earnedRuns |
runs earned against the pitcher in this PA | int |
on1Bpre |
is there a runner on 1B just before batter event of PA | boolean |
on2Bpre |
is there a runner on 2B just before batter event of PA | boolean |
on3Bpre |
is there a runner on 3B just before batter event of PA | boolean |
on1Bpost |
is there a runner on 1B just after batter event of PA | boolean |
on2Bpost |
is there a runner on 2B just after batter event of PA | boolean |
on3Bpost |
is there a runner on 3B just after batter event of PA | boolean |
The terminology "batter event of the PA", distinguishes the causality of steals, pickoffs, errors, etc., that occur independently of the batter's attempts to put the ball in play. This way, we can make sure that the contributions of the batter's PA in terms of transitional base-out states, etc., versus the ones that should be allocated to the steal, defensive error, etc. will be allocated correctly.
Next up, the pitch
and hit
tables.