I spent a few months last year putting together a scraper to build a PostgreSQL database of the MLB database in Python. I decided to do so for a few reasons, namely:
- I wanted some practice to get better at running out a software project from end-to-end
- For all the success of
pitchRx, and I really can't say enough about how great Carson Sievert's work there is, I had been frustrated that there wasn't a great mechanism within it to update the database if/when MLB went back and improved its own data for old games
- I wanted a native Pythonic implementation of everything
pitchRxcould do, with some particular additional capabilities that I was interested in
- In the words of Eleanor Roosevelt, "hot nasty bad-ass speed". I wanted to see how fast I could do the equivalent of
I tried to do this, and thought that, for the most part that I had succeeded, only for MLB to tear down the old-school Gameday data back-end within a few weeks.
I'd be mad, except for the new backend is really cool. I'm working on a new scraper now, and, in the tradition of the blogs and reddit posts that I used to get myself to the point that I could scrape data, I'm gonna post about it here.
The new MLB data paradigm
In order to start looking at the new MLBAM data, you'll need to be able to parse JSON files. They're basically just structured files that store data. Luckily, Firefox has a great parser built in for them.
Let's start with the aggregation of games. Here's a sample link to the games for 3 April 2020:
When scraping, we'll want to change the parameters, on the right-hand-side, to get the data of interest.
sportId=1 refers to the MLB data, for instance, and
date=2020-04-03 is the date of interest. By switching these around, we can get to other leagues and other dates. I don't know how to do other leagues, that's for future experimentation.
From this page, under
date/0/games/, we can collect all of the games that happened on a given day, as well as view their metadata. For the most part, this stuff is all human-readable, but I'll point out in particular the piece of data for any given game that is keyed by
gamePk. This is the primary key of the game, the main index by which you'll want to match game data for a unique game. After that, just explore. You'll notice that these files exclusively include scheduling metadata, nothing about scores, but we can use them to generate a list of the games we're interested in, as well as collecting information about their location, time, and the teams going into them.
You probably noticed in the last section, for example, that there are a bunch of different ways of identifying the team that's currently playing, notably
team/id values, mostly three-digit numbers. We're gonna want to grab data about teams, and, of course, there's a page for that:
If you're looking for team data, this is the motherlode. Digging through, each team has a
venue option, and for that:
is the place to go! Using these two pages, we can build a table of teams and a table of venues in our database.
From there: game data
With this stuff in play, we can aggregate games, teams, and locations. The next step: games. I'll leave it at that for now, and hint at the Grand Unified Master Baseball Object (GUMBO) as the next target for a post.