Towards a distributional approach for Run Expectancy

Runs are the currency of baseball, and for that reason they are the ground floor of how we estimate player contributions to team success.

Let's say we have a player standing in the batters' box. We can describe a generic possible outcome of the plate appearance (PA) with an event, which we denote $A$. $A$ could represent the event of getting a single, or reaching base, or even reaching on error. For the current plate appearance, we can describe the as-yet undetermined outcome as a Bernoulli random variable $X$, which you can think of as a hypothetical future coin flip with a weighted coin. $X$ will be one if $A$ occurs, and zero if it does not. We can describe the probability of $A$ using:

$$
\mathbb{P}(X \in A)= p_A
$$

In the long run, I want to think about how we value marginal and situational player value, where you can compute, in some sense, more appropriate values of $p_A$ for predicting the outcome of a plate appearance given more knowledge than just "it's a PA"– e.g. the ballpark, batter, pitcher, and league environment involved. Before I get on to that bit, I want to convince myself that doing so would actually give us a way to quantify the productivity output from such a prediction.

So let's say we're considering a plate appearance of interest. Any runs that the PA would contribute are scored between the start of the PA and the end of the inning. So, we could say that there is a random variable $Y$ that describes the number of runs scored from the start of the PA to the end of the inning. Theoretically, $Y$ takes values in $\mathbb{Z}_0^+$, the space of positive integers (including zero). Obviously, most possible values are theoretically possible but practically unreachable. Now, we can take the expectation of $Y$, $\mathbb{E}(Y)$, which represents the "expected run production of any given plate appearance", and also the expectation of $Y | A$, $\mathbb{E}(Y | A)$. This second value represents the "expected run production of a plate appearance that results in $A$". The difference between the two $\mathbb{E}(Y | A) - \mathbb{E}(Y) \in \mathbb{R}$, is thus the the expected added value of getting outcome $A$ in a PA. Luckily, with the substantial amount of data available, $\mathbb{E}(Y|A)$ and $\mathbb{E}(Y)$ can be well approximated, as can the distributions.

The result, ultimately, could be compared to the results of the linear weights in, say, FanGraphs's wOBA calculation. But the sneaky bit about this method could open the door to distributional descriptions of $Y | A$ and $Y$, and it could also open up our ability to describe a posterior estimate along the lines of:

$$
Y_\mathrm{team} | A, \mathrm{team}, Y_\mathrm{mlb}
$$

where we generate an estimate of the expected run contribution of event $A$ on a team-by-team basis, using the MLB data as a prior estimate. This would, theoretically, give an updated idea of how a player's value might contribute to a specific team as opposed to how they contribute in the league-average sense. At some point, I'll circle back and actually run the numbers for this, hopefully.