The Great SCOTUS Data Wrangle, Part 1: A First Look at the Supreme Court Database
In an upcoming post, I’ll be answering a question posed on the Opening Arguments podcast to their mathematician/data-oriented listeners about voting behavior on the U.S. Supreme Court. (If you’re unfamiliar with OA, check it out! It’s a phenomenal resource for laypeople interested in the law.) I’ll save the details for the follow-up post, but the question requires knowing the voting record of SCOTUS justices or, at very least, the distribution of majority and minority votes in all SCOTUS rulings from the last fifty years.
Being new to data science in the legal domain, I went on the hunt for a suitable dataset with tempered expectations, preparing myself to do some webscraping from sources like SCOTUS blog. Much to my surprise, I quickly discovered the Supreme Court Database, a resource that’s purpose-built for these sorts of analyses and goes far beyond anything I would normally expect to discover.
Several people have expressed interest in this data source to me,
and I hope this post provides a reasonable overview of how one can
get started accessing the data in Python. I also started an
initiative at work the other year where I show coworkers (usually
software engineers) the ways of the force data science
from the ground up, and I’m hoping this will be informative for
some of its members.
It also serves as a great excuse to start adding some blog posts to this website, which is something I’ve had on the radar for far too long.
import datetime as dt
import functools
import io
import zipfile
from pathlib import Path
import git
import numpy as np
import pandas as pd
import pyreadstat
import requests
from tqdm.notebook import tqdm
REPO_ROOT = Path(git.Repo('.', search_parent_directories=True).working_tree_dir).absolute()
DATA_PATH = REPO_ROOT / 'data'
Data Sources
I’ll be using the “case-centered” forms of the Supreme Court Database (SCDB)1 for my analysis, specifically the Legacy 06 and Version 2020 Release 01 revisions. If you’re as new to the SCDB as I was before this project, it is a collection of meticulously-curated datasets comprised of every Supreme Court decision in US history. Each decision is enriched with a large amount of metadata ranging from simple case identifiers to topics and legal provisions at issue. While the case-centric form of the database suits our immediate needs, it’s worth noting that there are also “justice-centered” forms of the databases that break down the voting records of every justice on the court across history. I’ve kept myself focused on the task at hand for this post, but I’m looking forward to digging deeper into the other forms of the database at a later date.
Some Remarks on Available Data Formats
SCDB files can be downloaded in a wide variety of formats. Unfortunately, these formats cater to Excel, R, SAS, SPSS, and Stata users but not to us Pythonistas. Of the formats with out-of-the-box support in Pandas (without installing extras), only CSV is an option provided by the SCDB team. Using CSVs wouldn’t be a major issue all other concerns being equal, but the available CSV files are mostly made up of categorical features that have been encoded as integers. The user is expected to interpret these integers by referencing a codebook the SCDB team maintains as a PDF and a web app. This is an artifact of the age and funding of the SCDB, which has been maintained and updated for nearly fourty years at this point.
Scraping metadata from a PDF or the SCDB website seemed a bridge
too far for the post. Before going down that path, however, I
noticed that the SCDB documentation states the database is
exported with “plain English” fields to data formats that support
them, which appears to mean formats for SPSS and the other
statistical software packages. This is great news, since it means
we can use the
read_spss
method in Pandasread_sav
method in
pyreadstat
to access the data in its full glory.2
[1]: Harold J. Spaeth, Lee Epstein, Andrew D. Martin, Jeffrey A. Segal, Theodore J. Ruger, and Sara C. Benesh. 2020 Supreme Court Database, Version 2020 Release 01. URL: http://supremecourtdatabase.org
[2]: The
.sav
files are ISO-8859-1-encoded, but Pandas does not provide a way
to specify which encoding is at play when invoking
read_spss
. Instead,
read_spss
relies on
pyreadstat
’s automatic encoding detection, which appears to fail on the
Legacy dataset, resulting in
pyreadstat
choking. Assuming I, a non-SPSS user, can generate some small
.sav
test files for this, with any luck I’ll be submitting a PR for
Pandas that will allow passing an encoding to
pyreadstat
for SPSS datasets Soon™. Who knew that I’d ever be a
repeat contributor
to the SPSS side of Pandas?
SCDB Dataset Retrieval
There are eight different versions of the modern SCDB, corresponding to a choice between two units of analysis and four different interpretations of what should constitute a single case/vote. Specifically, we can choose versions of the SCDB with
- one record per case or one record per vote of an individual justice
- cases broken down by citation (one record per consolidated SCOTUS case), docket (one record per case without consolidation), individual issue facing the court (one record for every major issue/legal provision), or distinct voting blocs on individual issues.
The most relevant version of the SCDB for a first pass at addressing the question from OA will be the case-centered dataset broken down by citation, although the “justice-centered” and per-issue-voting-bloc variants of the SCDB could definitely play a role in some of the visualization work I’m interested in performing in the future. For now we’ll write a simple method that can retrieve all of these datasets from the SCDB along with the “legacy” versions of the per-citation datasets containing a complete record back to the first case in 1791. (Legacy versions of the other dataset variants are not currently available.)
def download_and_extract_zip(url, destination_dir):
zip_file_stream = io.BytesIO()
zip_file_stream.write(requests.get(url).content)
zipfile.ZipFile(zip_file_stream).extractall(destination_dir)
def retrieve_scdb_dataset(scdb_version, dataset_version, usecols=None, data_dir=DATA_PATH / 'raw', **kwargs):
file_name = f'SCDB_{scdb_version}_{dataset_version}.sav'
url = f'http://supremecourtdatabase.org/_brickFiles/{scdb_version}/{file_name}.zip'
local_file = Path(data_dir) / file_name
if not local_file.exists():
download_and_extract_zip(url, data_dir)
try:
dataset = pd.read_spss(local_file)
except pyreadstat._readstat_parser.ReadstatError:
dataset, _ = pyreadstat.read_sav(
str(local_file),
apply_value_formats=True,
encoding='iso-8859-1'
)
return dataset
case_decisions = pd.concat(
[retrieve_scdb_dataset(scdb_version, 'caseCentered_Citation')
for scdb_version in ['Legacy_06', '2020_01']],
ignore_index=True
)
justice_decisions = pd.concat(
[retrieve_scdb_dataset(scdb_version, 'justiceCentered_Citation')
for scdb_version in ['Legacy_06', '2020_01']],
ignore_index=True
)
other_modern_decisions = {
(unit_of_analysis, granularity): (
retrieve_scdb_dataset('2020_01', f'{unit_of_analysis}Centered_{granularity}')
)
for unit_of_analysis in ['case', 'justice']
for granularity in ['Docket', 'LegalProvision', 'Vote']
}
Inspecting & Cleaning the Datasets
These datasets are incredibly rich with case metadata. As an
example, let’s take a look at Marbury v. Madison in
case_decisions
.
with pd.option_context('display.max_colwidth', 200):
display(
case_decisions.loc[
case_decisions.caseName.str.contains(r'Marbury.*v\..*Madison.*', case=False)
].T.iloc[:, 0]
)
caseId 1803-005
docketId 1803-005-01
caseIssuesId 1803-005-01-01
voteId 1803-005-01-01-01
dateDecision 6.95399e+09
decisionType opinion of the court
usCite 5 U.S. 137
sctCite
ledCite 2 L. Ed. 60
lexisCite 1803 U.S. LEXIS 352
term 1803
naturalCourt Marshall 1
chief Marshall
docket
caseName WILLIAM MARBURY v. JAMES MADISON, SECRETARY OF STATE OF THE UNITED STATES
dateArgument 6.95287e+09
dateRearg NaN
petitioner judge
petitionerState District of Columbia
respondent Department of State
respondentState NaN
jurisdiction mandamus
adminAction NaN
adminActionState NaN
threeJudgeFdc no mentionof 3-judge ct
caseOrigin NaN
caseOriginState NaN
caseSource NaN
caseSourceState NaN
lcDisagreement no mention of dissent
certReason no cert
lcDisposition NaN
lcDispositionDirection conservative
declarationUncon fed unconstitutional
caseDisposition pet. denied, app. dismiss
caseDispositionUnusual no unusual disposition
partyWinning petitioner lost
precedentAlteration precedent unaltered
voteUnclear vote clearly specified
issue jud. admin.: S.Ct. orig. juris
issueArea Judicial Power
decisionDirection conservative
decisionDirectionDissent dissent in opp. direction
authorityDecision1 statutory construction
authorityDecision2 NaN
lawType Infrequent Litigate (Code)
lawSupp Infrequent litigate (Code)
lawMinor Judiciary Act of 1789
majOpinWriter JMarshall
majOpinAssigner JMarshall
splitVote 1st vote on issue/provision
majVotes 6
minVotes 0
Name: 73, dtype: object
The justice-centered datasets are identical aside from a handful of new columns that break down majority and minority voting coalitions.
justice_decisions.loc[
justice_decisions.lexisCite == '1803 U.S. LEXIS 352',
set(justice_decisions.columns).symmetric_difference(case_decisions.columns)
]
secondAgreement | direction | firstAgreement | justiceName | vote | justice | majority | opinion | |
---|---|---|---|---|---|---|---|---|
426 | NaN | conservative | NaN | JMarshall | voted w/ majority, plurality | JMarshall | majority | justice wrote an opinion |
427 | NaN | conservative | NaN | WCushing | voted w/ majority, plurality | WCushing | majority | justice wrote no opinion |
428 | NaN | conservative | NaN | WPaterson | voted w/ majority, plurality | WPaterson | majority | justice wrote no opinion |
429 | NaN | conservative | NaN | SChase | voted w/ majority, plurality | SChase | majority | justice wrote no opinion |
430 | NaN | conservative | NaN | BWashington | voted w/ majority, plurality | BWashington | majority | justice wrote no opinion |
431 | NaN | conservative | NaN | AMoore | voted w/ majority, plurality | AMoore | majority | justice wrote no opinion |
Out of the box, the records are very well-groomed, with modern decisions being entered with laudable consistency.
case_decisions.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28891 entries, 0 to 28890
Data columns (total 53 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 caseId 28891 non-null object
1 docketId 28891 non-null object
2 caseIssuesId 28891 non-null object
3 voteId 28891 non-null object
4 dateDecision 28891 non-null float64
5 decisionType 28890 non-null object
6 usCite 28891 non-null object
7 sctCite 28891 non-null object
8 ledCite 28891 non-null object
9 lexisCite 28891 non-null object
10 term 28891 non-null float64
11 naturalCourt 28891 non-null object
12 chief 28891 non-null object
13 docket 28891 non-null object
14 caseName 28891 non-null object
15 dateArgument 23059 non-null float64
16 dateRearg 628 non-null float64
17 petitioner 28889 non-null object
18 petitionerState 3730 non-null object
19 respondent 28887 non-null object
20 respondentState 6297 non-null object
21 jurisdiction 28889 non-null object
22 adminAction 7213 non-null object
23 adminActionState 2248 non-null object
24 threeJudgeFdc 28817 non-null category
25 caseOrigin 27952 non-null object
26 caseOriginState 7195 non-null object
27 caseSource 28331 non-null object
28 caseSourceState 6942 non-null object
29 lcDisagreement 28859 non-null category
30 certReason 28803 non-null object
31 lcDisposition 16038 non-null category
32 lcDispositionDirection 27730 non-null category
33 declarationUncon 28889 non-null category
34 caseDisposition 28635 non-null category
35 caseDispositionUnusual 28888 non-null category
36 partyWinning 28874 non-null category
37 precedentAlteration 28889 non-null category
38 voteUnclear 28886 non-null category
39 issue 28775 non-null object
40 issueArea 28775 non-null category
41 decisionDirection 28797 non-null category
42 decisionDirectionDissent 28658 non-null category
43 authorityDecision1 28773 non-null category
44 authorityDecision2 4191 non-null category
45 lawType 27556 non-null category
46 lawSupp 27556 non-null object
47 lawMinor 28891 non-null object
48 majOpinWriter 26664 non-null object
49 majOpinAssigner 28589 non-null object
50 splitVote 28891 non-null category
51 majVotes 28891 non-null float64
52 minVotes 28891 non-null float64
dtypes: category(17), float64(6), object(30)
memory usage: 55.5 MB
The SCDB’s
online codebook
covers the encodings for every field in the database in grim
detail. I won’t go into many of the details here, but as you might
expect from the
case_decisions.info
call, for the most part
NaN
s are
used throughout the dataset to convey that a field is not
applicable for a given record. Little imputation is necessary
throughout the data, although in categorical fields that we will
rely on in later work we will replace
NaN
values with additional categories to make our lives easier.
Sparse Row Cleanup
Before attempting to address issues in each column, let’s first look at the records to see if any standout as glaringly problematic. This will also give us a chance to look at the meaning of each field in a bit more detail.
I’m particularly interested in decisions with a large number of missing fields.
is_modern_decision = case_decisions.term >= 1946
missing_fields = (
case_decisions[is_modern_decision]
.pipe(lambda df: df.isna() | df.astype(str).isin({'', 'NA', 'NULL', 'na', 'null'}))
.sum(axis='columns')
)
print('5 Sparsest Records:')
display(missing_fields.nlargest())
print('Missing Field Count Deciles:')
display(missing_fields.quantile(np.linspace(0, 1, 11)))
print('Record Count by Number of Missing Values:')
display(missing_fields.value_counts(sort=False).sort_index())
5 Sparsest Records:
26661 28
27137 24
27138 24
20345 23
20243 22
dtype: int64
Missing Field Count Deciles:
0.0 2.0
0.1 6.0
0.2 7.0
0.3 7.0
0.4 8.0
0.5 8.0
0.6 8.0
0.7 9.0
0.8 9.0
0.9 11.0
1.0 28.0
dtype: float64
Record Count by Number of Missing Values:
2 7
3 63
4 189
5 323
6 1220
7 1457
8 2392
9 1586
10 661
11 431
12 228
13 202
14 99
15 56
16 51
17 24
18 5
19 4
20 7
21 9
22 12
23 1
24 2
28 1
dtype: int64
Yikes! After imputing values column-by-column, it may behoove us to scrutinize the remaining sparse records in the database. For now, however, one case stands out among all modern decisions with over half its fields missing.
The Great Granite State v. Tandy Corp. Mystery
case_decisions.iloc[26661]
caseId 1992-121
docketId 1992-121-01
caseIssuesId 1992-121-01-01
voteId 1992-121-01-01-01
dateDecision 1.2954e+10
decisionType per curiam (oral)
usCite 507 U.S. 1026
sctCite 113 S. Ct. 1836
ledCite 123 L. Ed. 2d 463
lexisCite 1993 U.S. LEXIS 2702
term 1992
naturalCourt Rehnquist 5
chief Rehnquist
docket 91-2086
caseName GRANITE STATE INS. CO. v. TANDY CORP.
dateArgument 1.29498e+10
dateRearg NaN
petitioner insurance company, surety
petitionerState NaN
respondent business, corporation
respondentState NaN
jurisdiction cert
adminAction NaN
adminActionState NaN
threeJudgeFdc NaN
caseOrigin NaN
caseOriginState NaN
caseSource NaN
caseSourceState NaN
lcDisagreement NaN
certReason NaN
lcDisposition NaN
lcDispositionDirection NaN
declarationUncon NaN
caseDisposition pet. denied, app. dismiss
caseDispositionUnusual NaN
partyWinning petitioner lost
precedentAlteration NaN
voteUnclear NaN
issue NaN
issueArea NaN
decisionDirection NaN
decisionDirectionDissent NaN
authorityDecision1 NaN
authorityDecision2 NaN
lawType NaN
lawSupp NaN
lawMinor NA
majOpinWriter NaN
majOpinAssigner WHRehnquist
splitVote 1st vote on issue/provision
majVotes 9
minVotes 0
Name: 26661, dtype: object
Roughly half of the fields in this record are null, many of which are not permitted to be per the documentation. It’s worth validating that this isn’t an artifact of our import process, and indeed the Granite State v. Tandy Corp. page on the SCDB’s website seems to be equally lacking in details.
Let’s break down the missing values in Granite State and see if we can’t fill them in by getting our hands a bit dirty.
-
petitionerState
andrespondentState
specify which state actor is party to the case when applicable. Since neither Granite State Insurance Company nor Tandy Corporation are government agents, these values being set toNaN
is consistent with how the remainder of the dataset is supposed to be coded. We’ll leave them be until we address these columns individually later in the series. -
adminAction
andadminActionState
indicate which, if any, administrative agency or agency official and its state (if not a federal agency) performed an action related to the case before litigation began. Again, these variables are irrelevant to the case in question, and we leave them be. -
threeJudgeFdc
flags cases that were heard by a three-judge federal district court during the early 1900s. As mentioned in the documentation, this court setup was legislated into existence at the turn of the twentieth century and gradually phased out, to the point where this form of case is almost non-existent in the Rehnquist and Roberts Courts.This case was presided over by Judge Hittner in the Federal District Court for the Southern District of Texas, so we may safely fill this field in with
'no mentionof 3-judge ct'
, the unfortunate corresponding string. -
caseOrigin
andcaseSource
provide the lower courts where the case originated and was appealed from, respectively, withcaseOriginState
andcaseSourceState
being used to capture the state in question for lower state courts. These variables are only ever all supposed to beNaN
in cases of original jurisdiction for SCOTUS. As we just mentioned, Granite State originated in the Federal District Court for the Sourthern District of Texas, and it was then appealed to the Fifth Circuit. We leave the two[…]State
fields asNaN
s for the time being and fillcaseOrigin
andcaseSource
with the appropriate'Texas South. U.S. Dist. Ct.'
and'U.S. Ct. App., Fifth Cir.'
values found throughoutcase_decisions
. -
Variables beginning with
lc
reflect aspects of the proceedings of the case at a lower court. ThelcDisagreement
field is a bit of an odd duck, stating whether the SCOTUS opinion states that any lower court justices dissented, even if another authoritative source says otherwise. The Supreme Court’s decision in this case is a simple dismissal. Absent are all details, let alone the breakdown of the votes of Fifth Circuit judges in their decision. -
The remaining two lower court variables,
lcDisposition
andlcDispositionDirection
, indicate how thecaseSource
court reviewed thecaseOrigin
court’s decision and whether or not thecaseSource
court’s review resulted in a liberal or conservative outcome. What constitutes a liberal or conservative decision is laid out carefully in the documentation of the relateddecisionDirection
variable, which is identical tolcDispositionDirection
in meaning except that (a) it considers the Supreme Court’s decision in relation to thecaseSource
court’s decision and (b) its value is also impacted by thedecisionDirectionDissent
variable discussed shortly.In Granite State, the Fifth Circuit affirmed the district court’s decision, and SCOTUS dismissed Granite State’s cert petition. I don’t believe the nature of the parties and the procedural nature of the case, I don’t believe a liberal or conservative angle exists in this case, and I’ve bolstered my confidence by reading the Fifth Circuit decision and listening to oral arguments. I believe that
'unspecifiable'
is completely appropriate for both of these direction variables. -
The
decisionDirectionDissent
variable captures the relative direction of the majority and dissenting opinions when they agree on the case’s issue but differ on what relief is deserved. In this case, unfortunately, the SCDB changes the meaning of what is encoded in thedecisionDirection
field to reflect whether thedecisionDirection
(the direction of the majority opinion) is more liberal or conservative than the dissenting opinion. This isn’t a concern for the unanimous Granite State decision, but it is worth keeping in mind for analytics purposes. -
caseDispositionUnusual
indicates whether or not the Court’s opinion fits into of the standardcaseDisposition
categories. This isn’t a problem in this decision. -
Petitions for writs of certiorari are the most common petitions received by SCOTUS by a considerable marging, and
certReason
captures the reason they granted cert as stated in the case’s majority opinion. There are a number of issues withcertReason
that we’ll cover in the sequel, so we’ll leave this field as-is for the time being. -
authorityDecision1
andauthorityDecision2
state the grounds for the Court’s authority in each case. While they appear rather complicated to evaluate, this is clearly a case of supervision of a lower court. -
declarationUncon
andprecedentAlteration
provide whether or not SCOTUS struck down legislation as unconstitutional or altered their own precedent. Neither of these is relevant to a dismissal. -
voteUnclear
flags the rare opinions where it’s unclear who voted in the majority and minority. Again, this is irrelevant for Granite State. -
majOpinWriter
stores the Justice who wrote the majority opinion. This dismissal was reported as an order without a declared author, so leaving this missing is appropriate for now.
That brings us to the final five missing fields:
issue
,
issueArea
,
lawType
,
lawSupp
,
and
lawMinor
. These fields provide details about the subject matter of a case
and the main statutes, orders, etc. at play.
REVISE: Unfortunately, I do not feel qualified to fill these in with any confidence. If I had to hazard guesses, however, I would say the following:
-
The
issue
in the case is “judicial administration: review of non-final order” (or'jud. admin.: review of order'
in our dataset), which makes'Judicial Power'
theissueArea
.-
It’s possible that the
issue
for this case should be “judicial administration: jurisdiction or authority of federal district courts or territorial courts” (or'jud. admin.: Dist. Ct. jurisd.'
in our dataset), but my first guess seems to be a better fit.
-
It’s possible that the
-
The fields
lawType
,lawSupp
, andlawMinor
together describe the specific clauses of the Constitution, sections of the U.S. Code, etc. that are at issue in the case. This is where I am least equipped to answer as a non-expert, non-SCDB contributor. This is compounded by the lack of an actual Supreme Court opinion in this case. Brief references to Granite State appear in the 506th and 507th bound volumes of the United States Reports as short “orders”, granting the cert petition and dismissing it under SCOTUS Rule 46. If I had to guess, Granite State requested their petition be dismissed in anticipation of an unfavorable opinion following oral arguments. Accordingly, I have to go on the opinions from the Federal District Court and Fifth Circuit together with Supreme Court oral arguments.After reading through the transcript of oral arguments and skimming the lower court opinions, I think it’s fair to say that interpretation of the Declaratory Judgment Act (28 U.S.C. §2201) is the core issue before the Court. There are some other statutes at play, predominantly in arguments before and opinions of the lower courts, that address jurisdictional questions, including 28 U.S.C. §1292(a)(3), 28 U.S.C. §1291, and the collateral order doctrine. These, however, take a backseat to the Declaratory Judgment Act in Granite State Insurance Corporation’s oral arguments before the Supreme Court. If my interpretation is consistent with that of SCDB authors, we should then set
lawType
andlawSupp
equal to'Infrequent Litigate (Code)'
and'Infrequent litigate (Code)'
(resp.) andlawMinor
equal to"28 U.S.C. §2201"
(or whichever formatting of this statute has been adopted in the rest of the SCDB … they are a bit inconsistent here).
is_granite_state_v_tandy_corp = case_decisions.caseId == '1992-121'
case_decisions.loc[
is_granite_state_v_tandy_corp,
[
'threeJudgeFdc',
'caseOrigin',
'caseSource',
'lcDisagreement',
'lcDisposition',
'lcDispositionDirection',
'decisionDirection',
'caseDispositionUnusual',
'declarationUncon',
'precedentAlteration',
'voteUnclear',
'authorityDecision1',
'issue',
'issueArea',
'lawType',
'lawSupp',
'lawMinor'
]
] = [
'no mentionof 3-judge ct',
'Texas South. U.S. Dist. Ct.',
'U.S. Ct. App., Fifth Cir.',
'no mention of dissent',
'affirmed',
'unspecifiable',
'unspecifiable',
'no unusual disposition',
'no unconstitutional',
'precedent unaltered',
'vote clearly specified',
'original juris. or supervis.',
'jud. admin.: review of order',
'Judicial Power',
'Infrequent Litigate (Code)',
'Infrequent litigate (Code)',
'28 U.S.C. § 2201'
]
Numeric Columns
With Granite State out of the way, let’s move on to analyzing the fields in the case decisions dataset, beginning with numerics.
display(case_decisions.select_dtypes(include='number').describe().T)
case_decisions.select_dtypes(include='number').info()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
dateDecision | 28891.0 | 1.067607e+10 | 1.545968e+09 | 6.589210e+09 | 9.567374e+09 | 1.052749e+10 | 1.196480e+10 | 1.381363e+10 |
term | 28891.0 | 1.919980e+03 | 4.882050e+01 | 1.791000e+03 | 1.885000e+03 | 1.915000e+03 | 1.961000e+03 | 2.019000e+03 |
dateArgument | 23059.0 | 1.081233e+10 | 1.557930e+09 | 6.589123e+09 | 9.688680e+09 | 1.076786e+10 | 1.209764e+10 | 1.380871e+10 |
dateRearg | 628.0 | 1.031189e+10 | 1.665693e+09 | 6.747235e+09 | 8.551678e+09 | 1.067848e+10 | 1.152088e+10 | 1.376698e+10 |
majVotes | 28891.0 | 7.721297e+00 | 1.408301e+00 | 1.000000e+00 | 7.000000e+00 | 8.000000e+00 | 9.000000e+00 | 1.000000e+01 |
minVotes | 28891.0 | 7.300890e-01 | 1.266678e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 4.000000e+00 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28891 entries, 0 to 28890
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 dateDecision 28891 non-null float64
1 term 28891 non-null float64
2 dateArgument 23059 non-null float64
3 dateRearg 628 non-null float64
4 majVotes 28891 non-null float64
5 minVotes 28891 non-null float64
dtypes: float64(6)
memory usage: 1.3 MB
As the documentation suggests,
dateDecision
and
term
values are provided for every record, with supplementary
dateArgument
and
dateRearg
provided for cases that were argued before SCOTUS at least once.
Nothing jumps out immediately as out of place here except the data
types.
Correcting the integral columns is straightforward.
case_decisions = case_decisions.astype({
'term': 'uint16',
'majVotes': 'uint8',
'minVotes': 'uint8'
})
It’s a
DATE
…
12
?
Slightly more interesting here is why
pyreadstat
struggles with the date columns
dateDecision
,
dateArgument
, and
dateRearg
. Does
pyreadstat
recognize these as date fields? We can find out by looking at the
metadata it generates.
_, metadata = pyreadstat.read_sav(
str(DATA_PATH / 'raw' / 'SCDB_Legacy_06_caseCentered_Citation.sav'),
metadataonly=True
)
{metadata.original_variable_types[column_name]
for column_name in ['dateDecision', 'dateArgument', 'dateRearg']}
{'DATE12'}
Ah yes …
DATE12
.
Not being an SPSS user, I wasn’t sure how to interpret this data
type (beyond confirming that it’s some date format), and,
judging from its
source code, neither was
pyreadstat
! Online documentation of data formats also appears to be a weak
point for SPSS, so I was initially in the dark on this.
Here GNU came to the rescue. Apparently GNU
PSPP
is a thing, a thing that can even run on macOS … arguably. With
PSPP set up I was able to load one of the
.sav
files. The
DATE12
columns appeared to be in the same date format found throughout
the SCDB CSV files, which is to say
DD-MON-YYYY
, where
MON
is
the first three letters in the English name of the month, fully
capitalized (e.g.,
NOV
for
November). (For Python users playing along at home, this is
%d-%b-%y
strftime
format except with the locale-specific month short name fully
capitalized.)
Unfortunately, this didn’t explain the meaning of the floats in these columns within the dataset, which clearly are not in Unix time (given that they’re positive and date back to 1791). Without understanding them, our only options would involve joining the corrected versions of these columns from the dataset in a different file format, a far less than ideal outcome.
Fortunately, playing with common date/time constants gives us a
huge clue regarding the
DATE12
data type.
case_decisions.dateDecision.dropna().astype(int) // 86400
0 76264
1 76264
2 76459
3 76634
4 76638
...
28886 159866
28887 159870
28888 159863
28889 159877
28890 159880
Name: dateDecision, Length: 28891, dtype: int64
As with other SPSS date formats handled by SPSS, the floats appear to encode seconds in a certain number of days. Again, however, this isn’t epoch time. What is the “epoch” that these timestamps have been counting up from? If they are consistent, we can find out by looking at the date values for West v. Barnes (or any other case of our choice).
west_v_barnes = case_decisions.iloc[0]
display(west_v_barnes[['caseName', 'dateArgument', 'dateDecision']])
west_v_barnes[['dateArgument', 'dateDecision']].astype(int) // 86400
caseName WEST, PLS. IN ERR. VERSUS BARNES. et al.
dateArgument 6.58912e+09
dateDecision 6.58921e+09
Name: 0, dtype: object
dateArgument 76263
dateDecision 76264
Name: 0, dtype: int64
We know
from Wikipedia
that West v. Barnes was argued on August 2nd, 1791 and
decided the following day. The “epoch” for
DATE12
dates appears to be $76,263$ days before August 2nd, 1791, which
would be …
dt.datetime(1791, 8, 2) - dt.timedelta(days=76_263)
datetime.datetime(1582, 10, 14, 0, 0)
Right … October 14th, 1582. Naturally.
Unexpected though this date representation may seem, it actually makes perfect sense for SPSS. This is Gregorian Epoch Time! October 15th, 1582 was the start of Gregorian Calendar usage in Europe, following a period of days that were simply skipped for participating countries. This is about as far back as one can go with a date or datetime field without having to add Julian-to/from-Gregorian logic to everything in one’s program, which makes it a reasonable minimum date for SPSS files.
For further evidence, notice that this date also serves as the
spss_origin
variable in
pyreadstat
, which is used throughout the library to perform SPSS date and
datetime conversions analogous to the following:
date_column_names = ['dateArgument', 'dateRearg', 'dateDecision']
def gregorian_epoch_time_to_datetime64(timestamp_seconds):
return dt.datetime(1582, 10, 14) + dt.timedelta(seconds=timestamp_seconds)
case_decisions = case_decisions.assign(
**{
date_column_name: functools.partial(
lambda df, column_name: df[column_name].map(
gregorian_epoch_time_to_datetime64, na_action='ignore'
),
column_name=date_column_name
)
for date_column_name in date_column_names
}
)
justice_decisions = justice_decisions.assign(
**{
date_column_name: functools.partial(
lambda df, column_name: df[column_name].map(
gregorian_epoch_time_to_datetime64, na_action='ignore'
),
column_name=date_column_name
)
for date_column_name in date_column_names
}
)
Tying It All Together with DVC
This post originally was several times the length of what’s above,
with initial data munging performed across the majorty of the
columns in the
case_decisions
dataset. I decided it worth breaking up into a series of posts for
readability, and this decision has at least one auxiliary benefit;
it gives me an excuse to try out
DVC!
DVC (short for “Data Version Control”) is a lightweight tool for versioning data, experiments, models, and the pipelines that connect them together. It’s like Git and Make had a baby, gave it a Python SDK for its first birthday, and taught it good practices in data engineering, data science, and MLOps along the way. DVC is designed to complement Git, providing a similar CLI (and Git hooks) that allows for defining data pipelines and tracking the evolution of data, model, and experimental artifacts without bogging down Git with large files. Avoiding committing large files to version control is accomplished with a pointer system similar to the one used by Git-LFS but without assuming you have a Git LFS server set up (or will use GitHub’s or Atlassian’s). Instead, DVC supports a wide variety of remote storage options including object storage in the big three public clouds, HDFS, SSH/SFTP, read-only HTTP, and even local paths. Local and in-repository storage are also supported, and remote data is cached locally when that’s convenient and physically possible.
Multi-stage ML pipelines are defined in YAML files (or interactively via the CLI), and like Make, the essential ingredients of each stage in the pipeline are a name, a set of inputs (dependencies), a set of outputs, and a command to execute. And, again just like Make targets, these stages, changes in their dependencies, and the DAG they form (that is, the ML pipeline) are all known to DVC, which makes keeping data up-to-date easy and efficient.
There’s a lot to like here with DVC. It’s not just “Make with Git, storage integrations, and experiment metrics”, since it handles all of the I/O and pathing logic for versioned remote file storage, cached local file storage, and so on. Its CLI also seems intuitive as a Git user, and I’m a big fan of how it remains lightweight while also checking off all the essential boxes for data versioning and provenance. Tools like Pachyderm are great for enterprise settings, but who wants to keep a Kubernetes cluster running somewhere to track the data lineage of a side project? And DVC doesn’t depend on tab characters, which needless to say puts it lightyears ahead of Make. It’ll be interesting to see how DVC further distinguishes itself from more traditional and familiar tools as I try it out.
DVC Basics
Getting started with DVC (after installing its CLI) should be incredibly familiar to Git users. From within a pre-existing Git project, you might do something like the following to get off the ground:
-
dvc init
in the project’s root to create a.dvc
directory and a minimal set of DVC files in your repository. These files will be staged by DVC so that you can commit them with Git. -
Register a remote for DVC-managed files with
dvc remote add […]
. This can be something like an S3 bucket and key prefix. -
UseEdit: While you can usedvc add […]
to start tracking your data, experiment, and model artifacts. Say you have a datasetpath/to/some_data.feather
in your repository. If you add the dataset to DVC (usingdvc add path/to/some_data.feather
), DVC will create a newpath/to/some_data.feather.dvc
file in your repository and stage it with Git. This new.dvc
file contains some simple YAML with all of the information DVC needs to retrieve yoursome_data.feather
dataset from remote storage. You’ll want to commit this.dvc
file in your Git repository in lieu of the dataset.dvc add
for this purpose, if you’re planning on ingesting, processing, or emitting a data file as part of a pipeline, I recommend specifying as an input or output to a pipeline stage usingdvc run […]
instead (see what we do below in this repo for more details). -
Use
dvc push […]
commands to upload all of your DVC-tracked files to a remote.
From there, you can start defining a multi-stage pipeline for
retrieving, wrangling, and transforming your data and then
iteratively training and evaluating models. You can re-run an
entire data pipeline by running
dvc repro
, as can collaborators. If your collaborators have access to your
DVC remotes (as they should), they can also skip the pipeline and
immediately download the most recent data files using
dvc pull
(or
dvc fetch
followed by
dvc checkout
, for further Git analogies).
Using DVC in This Series
By breaking up my original post into a series on different aspects of SCDB data processing, I’m going to be able to get a better sense of DVC’s feature set. I’ll take this a step farther by breaking up the processing done above into logical stages analogously to how I would in an actual complex data pipeline. A multi-stage pipeline is arguably overkill for what we’ve done and are planning to do to the data, but it’ll give us a way to test out more of the features in DVC as I update and expand this pipeline in future posts.
While DVC does support running a notebook as a stage (or part of a stage) in a pipeline3, I strongly discourage this practice for reasons4 that are mostly beyond the scope of this post. As part of fitting data science into agile team processes, I try to ensure that work going into a notebook appears there as part of a spike. Normally, after laying out changes to be made to an ML pipeline through data exploration and prototyping in a notebook, I would proceed to test-drive those transformations into a Python module. For the sake of keeping this in blog post format, however, I’ll just pare down what we’ve done above and logically distribute the code into several scripts representing different data pipeline stages: acquisition, featherization, and wrangling.
[3]: as shown in an
example
dvc.yaml
file
[4]: This article by Kristina Young summarizes many of the core issues with this practice.
%mkdir -p "$REPO_ROOT/data_pipeline"
Data Acquisition Script
Here we’ll pull in our SCDB data sources. While we focused mostly on the case- and citation-centric version of the dataset, we’ll keep things interesting by pulling in the remaining datasets, as well. We’ll convert all of these datasets to feathers in the next stage, for now with the only goal being exposing another convenient format to Pythonistas.
%%writefile "$REPO_ROOT/data_pipeline/acquire.py"
#!/usr/bin/env python3
import io
import zipfile
from pathlib import Path
import git
import requests
REPO_ROOT = Path(git.Repo('.', search_parent_directories=True).working_tree_dir).absolute()
DATA_PATH = REPO_ROOT / 'data'
def acquire_scdb_datasets():
download_path = DATA_PATH / 'raw' / 'scdb'
for unit_of_analysis in ['case', 'justice']:
acquire_scdb_dataset(scdb_version='Legacy_06',
dataset_version=f'{unit_of_analysis}Centered_Citation',
data_dir=download_path)
for record_granularity in ['Citation', 'Docket', 'LegalProvision', 'Vote']:
acquire_scdb_dataset(scdb_version='2020_01',
dataset_version=f'{unit_of_analysis}Centered_{record_granularity}',
data_dir=download_path)
def acquire_scdb_dataset(scdb_version, dataset_version, data_dir):
data_dir.mkdir(parents=True, exist_ok=True)
file_name = f'SCDB_{scdb_version}_{dataset_version}.sav'
url = f'http://supremecourtdatabase.org/_brickFiles/{scdb_version}/{file_name}.zip'
local_file = Path(data_dir) / file_name
if not local_file.exists():
download_and_extract_zip(url, data_dir)
def download_and_extract_zip(url, destination_dir):
zip_file_stream = io.BytesIO()
zip_file_stream.write(requests.get(url).content)
zipfile.ZipFile(zip_file_stream).extractall(destination_dir)
if __name__ == '__main__':
acquire_scdb_datasets()
Now we can make DVC run
acquire.py
as the first stage in the repository’s one and only pipeline.
%pushd "$REPO_ROOT"
!dvc run --name acquire \
--outs-persist data/raw/scdb/SCDB_2020_01_caseCentered_Citation.sav \
--outs-persist data/raw/scdb/SCDB_2020_01_caseCentered_Docket.sav \
--outs-persist data/raw/scdb/SCDB_2020_01_caseCentered_LegalProvision.sav \
--outs-persist data/raw/scdb/SCDB_2020_01_caseCentered_Vote.sav \
--outs-persist data/raw/scdb/SCDB_2020_01_justiceCentered_Citation.sav \
--outs-persist data/raw/scdb/SCDB_2020_01_justiceCentered_Docket.sav \
--outs-persist data/raw/scdb/SCDB_2020_01_justiceCentered_LegalProvision.sav \
--outs-persist data/raw/scdb/SCDB_2020_01_justiceCentered_Vote.sav \
--outs-persist data/raw/scdb/SCDB_Legacy_06_caseCentered_Citation.sav \
--outs-persist data/raw/scdb/SCDB_Legacy_06_justiceCentered_Citation.sav \
poetry run python data_pipeline/acquire.py
%popd
This results in DVC running the acquire.py script and raising an
error if one of the files set as an output (using an
--outs*
parameter) fails to be created. This is why I specified every
output file manually in the
dvc run
call, even though DVC allows for directories to be outputs, as
well; I want DVC to fail if any of the datasets fail to download.
If all downloads are successful, the acquire stage emits a message like the following on its first run:
WARNING: Build cache is ignored when persisting outputs.
Running callback stage 'acquire' with command:
poetry run python data_pipeline/acquire.py
Adding stage 'acquire' in 'dvc.yaml'
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add dvc.lock dvc.yaml data/raw/.gitignore
Note that using
--outs-persist
here rather than
--outs
tells DVC to not re-run this stage when
dvc repro
is called if all of the stage’s outputs already exist locally. I
try to avoid bogging down someone else’s servers with unnecessary
requests, especially when they’re the hand that feeds me.
Featherization Script
Next up, we transform the
.sav
files to feathers. I’ll probably wind up moving these to a more
permanent location than
data/interim
once I make the underlying repository public.
%%writefile "$REPO_ROOT/data_pipeline/featherize.py"
#!/usr/bin/env python3
from pathlib import Path
import git
import pandas as pd
import pyreadstat
import requests
REPO_ROOT = Path(git.Repo('.', search_parent_directories=True).working_tree_dir).absolute()
DATA_PATH = REPO_ROOT / 'data'
def featherize_scdb_datasets():
sav_files_dir = DATA_PATH / 'raw' / 'scdb'
interim_data_dir = DATA_PATH / 'interim' / 'scdb'
interim_data_dir.mkdir(parents=True, exist_ok=True)
for scdb_sav_path in sav_files_dir.glob('*.sav'):
scdb_feather_path = interim_data_dir / f'{scdb_sav_path.stem}.feather'
scdb_dataset = scdb_sav_to_dataframe(scdb_sav_path)
categorical_columns = scdb_dataset.select_dtypes(include='category').columns
(scdb_dataset
.pipe(lambda df: df.astype({column: str for column in categorical_columns}))
.pipe(lambda df: df.fillna({
column_name: 'MISSING_VALUE'
for column_name in df.select_dtypes(include='object').columns
}))
.pipe(lambda df: df.astype({column: 'category' for column in categorical_columns}))
.to_feather(scdb_feather_path))
def scdb_sav_to_dataframe(scdb_sav_path):
try:
dataset = pd.read_spss(str(scdb_sav_path)) # <-- str(…) due to a bug in pandas before v1.1.3
# see https://github.com/pandas-dev/pandas/pull/36174
except Exception:
dataset, _ = pyreadstat.read_sav(
str(scdb_sav_path),
apply_value_formats=True,
encoding='iso-8859-1'
)
return dataset
if __name__ == '__main__':
featherize_scdb_datasets()
Given that (a) the first stage will fail unless each of the desired SCDB files is retrieved and (b) a stage will fail if the command it runs exists abnormally, it should be safe to just to directories for the dependencies and outputs of the next stage.
%pushd "$REPO_ROOT"
!dvc run --name featherize \
--deps data/raw/scdb \
--outs data/interim/scdb \
poetry run python data_pipeline/featherize.py
%popd
Again this should result in an informative message with files to add/update in Git if it succeeds.
Running stage 'featherize' with command:
poetry run python data_pipeline/featherize.py
Adding stage 'featherize' in 'dvc.yaml'
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add data/processed/.gitignore dvc.lock dvc.yaml
Data Wrangling Script
And finally the main event. We drop our dataset unions, record
corrections, casts, and date conversions into a single
wrangle.py
script for the time being, with all of these transformations only
touch the case- and citation-centric dataset.
%%writefile "$REPO_ROOT/data_pipeline/wrangle.py"
#!/usr/bin/env python3
import datetime as dt
from pathlib import Path
import git
import pandas as pd
REPO_ROOT = Path(git.Repo('.', search_parent_directories=True).working_tree_dir).absolute()
DATA_PATH = REPO_ROOT / 'data'
def wrangle_scdb_data():
interim_data_dir = DATA_PATH / 'interim' / 'scdb'
processed_data_dir = DATA_PATH / 'processed' / 'scdb'
processed_data_dir.mkdir(parents=True, exist_ok=True)
scdb_datasets = {
'SCDB_Legacy-and-2020r1_caseCentered_Citation': pd.concat(
[pd.read_feather(interim_data_dir / f'SCDB_{scdb_version}_caseCentered_Citation.feather')
for scdb_version in ['Legacy_06', '2020_01']],
ignore_index=True
)
}
for dataset_name, scdb_dataset in scdb_datasets.items():
(scdb_dataset.pipe(
correct_record,
'1992-121',
threeJudgeFdc='no mentionof 3-judge ct',
caseOrigin='Texas South. U.S. Dist. Ct.',
caseSource='U.S. Ct. App., Fifth Cir.',
lcDisagreement='no mention of dissent',
lcDisposition='affirmed',
lcDispositionDirection='unspecifiable',
decisionDirection='unspecifiable',
caseDispositionUnusual='no unusual disposition',
declarationUncon='no unconstitutional',
precedentAlteration='precedent unaltered',
voteUnclear='vote clearly specified',
authorityDecision1='original juris. or supervis.',
issue='jud. admin.: review of order',
issueArea='Judicial Power',
lawType='Infrequent Litigate (Code)',
lawSupp='Infrequent litigate (Code)',
lawMinor='28 U.S.C. § 2201')
.pipe(set_dtypes)
.to_feather(processed_data_dir / f'{dataset_name}.feather')
)
def correct_record(scdb_df, scdb_case_id, **corrections):
scdb_df.loc[
scdb_df.caseId == scdb_case_id,
list(corrections.keys())
] = list(corrections.values())
return scdb_df
def set_dtypes(scdb_df):
return scdb_df.astype({
'term': 'uint16',
'majVotes': 'uint8',
'minVotes': 'uint8'
}).assign(
**{
date_column_name: functools.partial(
lambda df, column_name: df[column_name].map(
gregorian_epoch_time_to_datetime64, na_action='ignore'
),
column_name=date_column_name
)
for date_column_name in ['dateArgument', 'dateRearg', 'dateDecision']
}
)
def gregorian_epoch_time_to_datetime64(timestamp_seconds):
return dt.datetime(1582, 10, 14) + dt.timedelta(seconds=timestamp_seconds)
if __name__ == '__main__':
wrangle_scdb_data()
There are different ways we could handle the dependencies for this
stage, but for now we’ll just declare the whole
data/interim/scdb
directory as one big dependency for simplicity and to make relying
on or modifying other datasets in
wrangle.py
easy in future pipeline updates.
%pushd "$REPO_ROOT"
!dvc run --name wrangle \
--deps data/interim/scdb \
--outs data/processed/scdb/SCDB_Legacy-and-2020r1_caseCentered_Citation.feather \
poetry run python data_pipeline/wrangle.py
%popd
Again with the informative message regarding file changes.
Running stage 'wrangle' with command:
poetry run python data_pipeline/wrangle.py
Adding stage 'wrangle' in 'dvc.yaml'
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add dvc.lock dvc.yaml data/processed/scdb/.gitignore
Now calling
dvc dag
produces an interactive ASCII depiction of our data pipeline!
!dvc dag | cat
+---------------------+
| ../dvc.yaml:acquire |
+---------------------+
*
*
*
+------------------------+
| ../dvc.yaml:featherize |
+------------------------+
*
*
*
+---------------------+
| ../dvc.yaml:wrangle |
+---------------------+
This corresponds to the following contents of our
dvc.yaml
file:
!cat "$REPO_ROOT/dvc.yaml"
stages:
acquire:
cmd: poetry run python data_pipeline/acquire.py
outs:
- data/raw/scdb/SCDB_2020_01_caseCentered_Citation.sav:
persist: true
- data/raw/scdb/SCDB_2020_01_caseCentered_Docket.sav:
persist: true
- data/raw/scdb/SCDB_2020_01_caseCentered_LegalProvision.sav:
persist: true
- data/raw/scdb/SCDB_2020_01_caseCentered_Vote.sav:
persist: true
- data/raw/scdb/SCDB_2020_01_justiceCentered_Citation.sav:
persist: true
- data/raw/scdb/SCDB_2020_01_justiceCentered_Docket.sav:
persist: true
- data/raw/scdb/SCDB_2020_01_justiceCentered_LegalProvision.sav:
persist: true
- data/raw/scdb/SCDB_2020_01_justiceCentered_Vote.sav:
persist: true
- data/raw/scdb/SCDB_Legacy_06_caseCentered_Citation.sav:
persist: true
- data/raw/scdb/SCDB_Legacy_06_justiceCentered_Citation.sav:
persist: true
featherize:
cmd: poetry run python data_pipeline/featherize.py
deps:
- data/raw/scdb
outs:
- data/interim/scdb
wrangle:
cmd: poetry run python data_pipeline/wrangle.py
deps:
- data/interim/scdb
outs:
- data/processed/scdb/SCDB_Legacy-and-2020r1_caseCentered_Citation.feather
We can now ensure our data pipeline is tracked in Git by
committing each of the scripts executed during stages of the
pipeline and the files called out by each
dvc run
command (dvc.yaml
,
dvc.lock
, and a couple
.gitignore
files).
git add ${REPO_ROOT}/data_pipeline/{acquire,featherize,wrangle}.py
git add ${REPO_ROOT}/data/{raw,interim}/.gitignore
git add ${REPO_ROOT}/data/processed/scdb/.gitignore
git add ${REPO_ROOT}/dvc.{yaml,lock}
git commit -m 'Define initial DVC data pipeline'
Next Steps
We now have a minimal data pipeline defined in DVC that retrieves
data sources from the SCDB and performs some basic data formatting
tasks. In our next post, we’ll expand this data wrangling code,
wading neck deep into the waters of imputing and, primarily,
manually hunting down missing and corrupt values across the $53$
columns in the
case_decisions
dataset, or at least a subset thereof that is important for what
we’re hoping to analyze later.
There are plenty of opportunities for diversions along the way, including digging further into DVC, discussions of how to make this pipeline more readily maintainable, and about a thousand unexpected pieces of trivia about all of these cases. (SCOTUS cases are full of juicy and unexpected conflicts.) I’m hoping to get the “boring” data engineering work out of the way in the first couple of posts so that the remainder of the series can focus mostly on analyzing the data.
Beyond the next couple of blog posts, I’ll want to expand the data wrangling stages of the data pipeline to cover the other modern datasets, most especially the justice-centered one, in order to simplify analyses of per-justice voting records of the individual justices. I would also like to contribute my feather conversion script back to the SCDB project after increasing its flexibility, but that may occur outside this blog. Let me know if you have similar interests in the SCDB and would like to collaborate.
Updates
-
2021-05-23: Fixed a bug in the
preprocessing.py
timestamp column-processing code.