A sabermetric database for the GUMBO era

A sabermetric database for the GUMBO era

In my last post, I started working on a scraper for armchair sabermatricians in the post-gameday data era. I'll continue that work now, by designing a set of database tables to leverage things that are available in the database.

We can start, as always by looking at the tables in the old pitchRx databases, which we want to recreate at a minimum. A pitchRx database would hold a series of tables: game, coach, umpire, atbat, player, hip, action, runner, po, and pitch. We are most interested in the game, atbat, pitch, and runner tables.

We can start accordingly with a game table of our own, which will hold:

name description type
gamePk game primary key char(6)
awayId away team code char(3)
homeId home team code char(3)
awayTeamName away team name char(22)
homeTeamName home team name char(22)
awayGamesPlayed away team games played int
homeGamesPlayed home team games played int
awayWins away team wins to date int
homeWins home team wins to date int
awayLosses away team losses to date int
homeLosses home team losses to date int
awayGB away team games back float
homeGB home team games back float
awayWCGB away team games behind wildcard float
homeWCGB home team games behind wildcard float
awayRuns away runs int
homeRuns home runs int
awayHits away hits int
homeHits home hits int
awayErrors away errors int
homeErrors home errors int
doubleHeader flag for whether this is a doubleheader boolean
tiebreaker flag for whether this is a tiebreaker boolean
dateTime time of first pitch datetime
gameNumber game number int
gameType game type char(1)
sportId baseball organization ID char(3)
venueId game venue ID char(3)
originalDate originally scheduled game date (if makeup) datetime
statusCode game status (final, etc) char(2)
season season int

Note that the italicized key(s) in any given table identify the unique primary key of any given observation. With that, the game table should be functionally equivalent to the tables in pitchRx. pitchRx doesn't have innings in its database, but I want it so up next is the innings table:

name description type
gamePk game primary key char(6)
inningNum inning number (1-index) int
awayRuns away team runs scored during inning int
homeRuns home team runs scored during inning int
awayHits away team hits during inning int
homeHits home team hits during inning int
awayErrors away team errors during inning int
homeErrors home team errors during inning int
awayLOB away team men left-on-base at inning end int
homeLOB home team men left-on-base at inning end int
awayTotalRuns away team total runs scored through inning int
homeTotalRuns home team total runs scored through inning int
awayTotalHits away team total hits through inning int
homeTotalHits home team total hits through inning int
awayTotalErrors away team total errors through inning int
homeTotalErrors home team total errors through inning int
awayTotalLOB away team total men left-on-base through inning end int
homeTotalLOB home team total men left-on-base through inning end int

Up next, we'll work on the atbat table. I use the convention of at-bats (ABs) from pitchRx but really, this will include all plate appearances (PAs) and, in fact, every time someone distinctly steps up to bat, even if it doesn't result in anything (due to outs on the basepaths, substitutions, ejections, etc.); in the table below (and nowhere else unless specified), PA will be understood to represent a player stepping into the box, not statistical PAs.

name description type
gamePk game primary key char(6)
playNum play index (straight out of the GUMBO list) int
inningNum inning number int
inningHalf inning half 'T'/'B'
batterId batter's ID char(6)
batterSide batter's batting stance 'L'/'R'
pitcherId pitcher's ID char(6)
pitcherHand pitcher's throwing hand 'L'/'R'
awayScore away team score at end of PA int
homeScore home team score at end of PA int
awayScorePre away team score just before batter event of PA int
homeScorePre home team score just before batter event of PA int
b balls in count at end of PA int
s strikes in count at end of PA int
o outs at end of PA int
oPre outs just before batter event of PA int
event the outcome of the PA char(??)
eventType the outcome of the PA, encoded char(??)
startTime the time the PA starts datetime
endTime the time the PA ends datetime
description description of the play text
runsScored runs scored by the batter event of this play int
rbi runs batted in by the batter in this PA int
earnedRuns runs earned against the pitcher in this PA int
on1Bpre is there a runner on 1B just before batter event of PA boolean
on2Bpre is there a runner on 2B just before batter event of PA boolean
on3Bpre is there a runner on 3B just before batter event of PA boolean
on1Bpost is there a runner on 1B just after batter event of PA boolean
on2Bpost is there a runner on 2B just after batter event of PA boolean
on3Bpost is there a runner on 3B just after batter event of PA boolean

The terminology "batter event of the PA", distinguishes the causality of steals, pickoffs, errors, etc., that occur independently of the batter's attempts to put the ball in play. This way, we can make sure that the contributions of the batter's PA in terms of transitional base-out states, etc., versus the ones that should be allocated to the steal, defensive error, etc. will be allocated correctly.

Next up, the pitch and hit tables.