Probabilistic Modeling of Flight Model: Inferring Parameters

In another post, I developed a model for baseball flight:
$\def\y{\mathbf{y}}
\def\x{\mathbf{x}}
\def\xdot{\mathbf{\dot{x}}}
\def\ydot{\mathbf{\dot{y}}}
\def\bomega{\boldsymbol\omega}
\def\btheta{\boldsymbol\theta}
\def\trans{^\mathrm{T}}
\def\bigY{\mathbf{Y}}
\def\t{\mathbf{t}}
\def\tdata{\t_\mathrm{data}}
\def\Ydata{\bigY_\mathrm{data}}
\def\given{~|~}
\def\normal{\mathcal{N}}
\def\lognormal{\log \mathcal{N}}$

$$
\ydot= f(\y, t, \btheta)
$$
where
$$
\y= \begin{pmatrix} \xdot \\ \x \\ \bomega \\ \end{pmatrix}
$$
which is parametrized by:
$$\boldsymbol\theta= \begin{pmatrix} \rho \\ A_\mathrm{BB} \\ C_{L_0} \\ \lambda_L \\ C_{D_0} \\ C_{D_\infty} \\ k_D \\ V_\mathrm{crit} \\ m_D \\ \lambda_t \end{pmatrix}
$$

Using the ODE solver, we can also write a forward function:
$$
\mathbf{Y}= g(\t | f, \y_0, \btheta)
$$
which simply maps a handful of discrete times
$$
\mathbf{t}= \begin{pmatrix} t_1 \\ \vdots \\ t_N \end{pmatrix}
$$
to a data matrix of states under the model:
$$
\mathbf{Y}= \begin{bmatrix} \y_1 \trans \\ \vdots \\ \y_N \end{bmatrix}
$$
conditioned on the initial condition $\y_0$, the physical model $f$, and the parametrization of the model $\btheta$.

In the modern baseball data landscape, we often have available a set of observed data $\Ydata$ sampled at a set of times $\tdata$. The credulity of this data, for the moment, we will assume. The question at hand is how we utilize such a dataset to parametrize the model $f(\cdot, \btheta)$.

Inferring parameters from a single trajectory

Consider Bayes' theorem, applied to a given ball trajectory, say a fly ball, with a known launch/initial trajectory $\y_0$:
$$
p(\btheta \given \Ydata)= \frac{p(\Ydata \given \btheta)~p(\btheta)}{p(\Ydata)}
$$
or, to a constant factor:
$$
p(\btheta \given \Ydata) \propto p(\Ydata \given \btheta) ~p(\btheta)
$$
We can describe $p(\btheta)$ using a set of hand-chosen prior distributions, which match observation data, theoretical knowledge and intuition, etc.

Given a $\btheta$ sampled from the distribution, we can write:
$$
\bigY= g(\tdata; f, \y_0, \btheta)
$$
and if we write a probabilistic observation model for $\Ydata$:
$$
p(\Ydata \given \btheta)= p(\Ydata \given g(\tdata; f, \y_0, \btheta))
$$
we can now write the probability:
$$
p(\Ydata \given \btheta) ~p(\btheta)
$$
or, equivalently, we can describe the distribution $p(\btheta \given \Ydata)$ up to a constant factor. That means that, using Markov-chain Monte Carlo (MCMC), we can find a posterior distribution on $p(\btheta \given \Ydata)$, and, therefore, parametrize the single-pitch model.

Inferring the parameters from a season's worth of trajectory data

Ultimately, gathering the data from a single trajectory in this way is very time consuming. Especially considering the "juiced ball" findings of Rob Arthur, which suggest that the properties of the baseball vary not only from pitch to pitch, but also season-to-season and even week-to-week.

His findings are based on:

a constant-acceleration physical model
assumption of standard spin rate off the bat

These lead me to believe, alongside, e.g. Meredith Wills' writings on the seams that it is rather likely that changes from spin-related effects, like hang, are being disguised as drag, and that the drag is not necessarily changing, at least not in the constant factor sense suggested by Rob's work.

Academic insano-mode hierarchical model

We could just as easily write a hierarchical prior that incorporates assumptions about the seasonal, weekly, and pitch to pitch changes. Consider $C_D$ as a simplified example. The drag coefficient for any given pitch on a given week (around 20,000 pitches), $C_D$ might have a Gaussian distribution:
$$
C_D \sim \normal(\mu_p, \sigma_p)
$$
where, for any given week of a season (17 in regular season) the weekly hyperparameters might have Gaussian and lognormal distributions:
$$
\begin{aligned}
\mu_p & \sim \normal(\mu_w, \sigma_w) \\
\sigma_p & \sim \lognormal(\mu_{\sigma_w}, \sigma_{\sigma_w})
\end{aligned}
$$
and finally, seasonal hyperparameters might also have normal and lognormal distributions:
$$
\begin{aligned}
\mu_w & \sim \normal(\mu_\alpha, \sigma_\beta) \\
\sigma_w & \sim \lognormal(\mu_\beta, \sigma_\beta) \\
\mu_{\sigma_w} & \sim \normal(\mu_\gamma, \sigma_\gamma) \\
\sigma_{\sigma_w} & \sim \lognormal(\mu_\delta, \sigma_\delta) \\
\end{aligned}
$$

Given infinite time, money, computer precision, and computer speed, this could be an MCMC problem. We could build up this hierarchical model as a posterior, then describe a massive vector of trajectories $\{(\tdata, \Ydata)_i\}_{\forall (i) | 0 \leq i \leq N_\mathrm{pitch}}$, which will be something on the order of 100 datapoints for each of the 22.5k pitches per week through the 17 weeks of a given season (if you're counting 34M datapoints). Then for each draw in the MCMC (about 10-100k, typically), we could describe the logprobability of that 34M data.

The result would be a mean $C_D$ for the season, for each week of the season, and for every pitch of the season, as well as a description of the posterior distributions of each. You can easily imagine extending this methodology to the full set of parameters, $\btheta$ (or at least the probabilistic subset of them), of course with a necessary increase in the total number of variables.

This is all fine and good. But the sheer amount of data makes one queasy. The questions that I now have are:

For reference: on my computer, I can run about 1000 forward evaluations in ~15 seconds; for every pitch in a year, dividing pitches in parallel across 16 cores, the time cost per MCMC evaluation would be 4 days- so IDing the gigangic hierarchical model would be stupid expensive.
Would it be reasonable to ID $\btheta$ pitch-wise for a random sampling of pitches, and develop low-order models and canonical distributions for the seasonal/weekly effect parts of the hierarchical model? Then these could be used as priors for a pitch on a given day.
Can we do some kind of gradient descent algorithm or use machine learning techniques to ID these parameters?

Practical ID for the model

Skip insano-mode for now. Let's get a workable system ID:

Assume that each parameter has a distribution over the season, and individual pitches are drawn from that sample, regardless of week, etc. Call the seasonal hyperparameter matrix $\boldsymbol\Theta$
Use gradient descent to find the hyperparameter matrix $\boldsymbol\Theta$ that minimizes the negative logprobability of a random sampling of data from each pitch, a random sampling of pitches from the season, $\mathbb{\tilde{Y}}_\mathrm{data}$:
$$
\DeclareMathOperator{\argmin}{argmin}
\boldsymbol\Theta \approx \argmin_{\boldsymbol\Theta^*} (-\log p(\mathbb{Y}_\mathrm{data} | \boldsymbol\Theta^*))
$$
This implies we would need smooth derivatives of $g$ with respect to $\btheta$?? Might be too hard already... Maybe there's a cheap adjoint to get sensitivities?