Estimating Gender Disparities in Federal Criminal Cases, Journal Article by Sonja Starr, 2012
Download original document:
This text is machine-read, and may contain errors. Check the original document to verify accuracy.
LAW AND ECONOMICS RESEARCH PAPER SERIES PAPER NO. 12-018 AUGUST 2012 ESTIMATING GENER DISPARITIES IN FEDERAL CRIMINAL CASES SONJA B. STARR THE SOCIAL SCIENCE RESEARCH NETWORK ELECTRONIC PAPER COLLECTION: HTTP://SSRN.COM/ABSTRACT=2144002 FOR MORE INFORMATION ABOUT THE PROGRAM IN LAW AND ECONOMICS VISIT: HTTP://WWW.LAW.UMICH.EDU/CENTERSANDPROGRAMS/LAWANDECONOMICS/PAGES/DEFAULT.ASPX Electronic copy available at: http://ssrn.com/abstract=2144002 Estimating Gender Disparities in Federal Criminal Cases Sonja B. Starr* University of Michigan Law School email@example.com August 29, 2012 This paper assesses gender disparities in federal criminal cases. It finds large gender gaps favoring women throughout the sentence length distribution (averaging over 60%), conditional on arrest offense, criminal history, and other pre-charge observables. Female arrestees are also significantly likelier to avoid charges and convictions entirely, and twice as likely to avoid incarceration if convicted. Prior studies have reported much smaller sentence gaps because they have ignored the role of charging, plea-bargaining, and sentencing fact-finding in producing sentences. Most studies control for endogenous severity measures that result from these earlier discretionary processes and use samples that have been winnowed by them. I avoid these problems by using a linked dataset tracing cases from arrest through sentencing. Using decomposition methods, I show that most sentence disparity arises from decisions at the earlier stages, and use the rich data to investigate causal theories for these gender gaps. * Thanks to Ing-Haw Cheng, John DiNardo, Nancy Gertner, Sam Gross, Jim Hines, JJ Prescott, Eve Brensike Primus, Adam Pritchard, and Marit Rehavi for helpful comments and conversations, to Ryan Gersowitz, Michael Farrell, Seth Kingery, and Adam Teitelbaum for research assistance, and to participants in the Law and Economics Lunch, Fawley Lunch Workshop, and Criminal Justice Roundtable at the University of Michigan Law School, the University of Michigan Labor Lunch, and the Ninth Circuit Judicial Conference. Electronic copy available at: http://ssrn.com/abstract=2144002 Starr—Estimating Gender Disparities in Federal Criminal Cases Estimating Gender Disparities in Federal Criminal Cases Introduction In the United States, men are fifteen times as likely to be incarcerated as women are. But can this gap be explained by differences in criminal behavior or circumstances, or are courts or prosecutors treating genuinely equivalent cases differently on the basis of gender? The latter would violate the Constitution, undercut the criminal justice system’s punishment objectives, and contribute to the social consequences of demographically concentrated mass incarceration. So the reasons for the gender gap are of considerable legal and policy interest. This study explores them using a dataset that traces federal criminal cases from arrest through sentencing. I find that gender gaps widen at every stage of the justice process and that men and women ultimately receive dramatically different sentences. Existing studies of demographic disparities in criminal justice focus on narrow slices of the justice process in isolation. Most assess the judge’s final sentencing decision, controlling for conviction severity or “presumptive sentence” measures that are themselves produced by discretionary decisions and negotiations. Ignoring disparities in those earlier stages could bias sentencing disparity estimates, both because the key control variable is endogenous and because of sample selection from the winnowing of cases at each procedural stage. Current sentencing literature typically ignores this “funnel.” There is a small literature addressing disparities in prosecutorial decisons, but it addresses only certain pieces of the process and does not estimate their ultimate sentencing consequences. These limitations represent a surprising gulf between the quantitative empirical scholarship and the theoretical literature on the criminal justice system, which widely recognizes that sentencing is heavily shaped by prosecutors’ capacious charging and bargaining discretion. This study seeks to close this gap, using a multi-agency linked dataset that traces cases from arrest through sentencing. I estimate sentence outcomes conditioned on characteristics that are fixed near the beginning of the justice process, rather than near the end of it: the arrest offense, criminal history, and other prior characteristics. This approach generates a measure of the aggregate gender disparity introduced in the post-arrest justice process. I then use sequential decomposition methods to assess how much of this gap appears to be explainable by decision-making at each procedural stage. See Altonji, Bharadwaj, and Lange 2008; DiNardo, Fortin, and Lemieux 1996. In short, I ask: do otherwise-similar men and women who are arrested for the same crimes end up with the same punishments, and if not, at what points do their fates diverge? Although the arrest offense is not a perfect proxy for underlying criminal conduct, it is a big improvement on the highly endogenous controls used in current research. I also use estimation strategies—reweighting of the mean and the distribution—that offer a useful solution to a problem with which sentencing researchers have long struggled: how to treat non-prison sentences. The leading approach is a Two-Part Model that separates the incarceration decision from the length decision, but that introduces serious sample selection concerns if there is disparity in the first stage. The best solution is simply to treat sentencing as a single process and estimate disparities in all sentences, including the zeros. Doing so with reweighting rather than regression obviates the functional form concerns that underlie many researchers’ preference for the Two-Part Model. 1 Electronic copy available at: http://ssrn.com/abstract=2144002 Starr—Estimating Gender Disparities in Federal Criminal Cases The estimated gender disparities are strikingly large, conditional on observables. Most notably, treatment as male is associated with a 63% average increase in sentence length, with substantial unexplained gaps throughout the sentence distribution. These gaps are much larger than those estimated by previous research. This is because, as the sequential decomposition demonstrates, the gender gap in sentences is mostly driven by decisions earlier in the justice process—most importantly sentencing fact-finding, a prosecutor-driven process that other literature has ignored. But why do these disparities exist? Despite the rich set of covariates, unobservable gender differences are still possible, so I cannot definitively answer the causal question. However, several plausible theories have testable implications, and I take advantage of the unusually rich dataset to explore them. I find substantial support for some theories (particularly accommodation of childcare responsibilities and perceived role differences in group crimes), but that these appear only to partially explain the observed disparities. 1. Discretion and Gender Disparity in Criminal Justice 1.1. Sources of Discretion in the Federal Criminal Justice Process Just as the states do, the federal justice system gives enormous power to prosecutors. The United States in effect has a system of negotiated justice, and prosecutors hold most of the chips. They have broad discretion to choose charges from numerous overlapping criminal statutes, and then to determine the terms of plea deals. Plea-bargaining does not necessarily focus mainly on dropping of charges—indeed, the lead charge was dropped only 17% of the time in this study’s sample. The parties also often negotiate stipulations to key “sentencing facts”—for instance, the quantity of drugs trafficked or the defendant’s major or minor role in a conspiracy. The prosecutor also may make non-binding sentencing recommendations or request special leniency to reward cooperators. Federal sentencing is guided by two main legal frameworks. First, each criminal statute specifies a sentencing range. Most are broad and start at zero (for instance, 0-20 years), but some specify a “mandatory minimum.” Second, since 1987, the statutory sentencing constraints have been supplemented by much narrower ranges (for instance, 27 to 33 months) found in the U.S. Sentencing Guidelines. The Guidelines sought to reduce unwarranted disparities in sentencing, including gender disparities (see Breyer 1988), by constraining judicial discretion. They were mandatory until 2005, when the Supreme Court’s decision in United States v. Booker (543 U.S. 220) rendered them advisory. But advisory does not mean unimportant—judges are still required to calculate the Guidelines sentence, and most sentences are still within the Guidelines range (U.S. Sentencing Commission 2010). The Guidelines sentencing ranges are found in the cells of a grid, the two axes of which are the “offense level” and the defendant’s criminal history. Judges determine the offense level based on the crime(s) of conviction and the “sentencing facts.” Although judges have independent factfinding authority, in practice they usually defer to the plea agreement’s stipulations (Stith 2008; Schulhofer and Nagel 1997; Powell and Cimino 1995). One survey found that 92% of judges said their findings of fact diverge from the plea agreement either “infrequently” or “never” (Gilbert and Johnson 1996). Legal scholars widely agree that the Guidelines greatly empowered prosecutors because the sentence was now far more constrained by the charges of conviction and especially by the negotiated “sentencing facts” (Stith 2008; Bibas 2009). Prosecutors thus 2 Electronic copy available at: http://ssrn.com/abstract=2144002 Starr—Estimating Gender Disparities in Federal Criminal Cases could both threaten long sentences and virtually promise much lower ones in exchange for guilty pleas, and plea rates rose from 87% to 97%, where they remain today (Alschuler 2005; Miller 2004). Although Booker expanded judicial discretion, the continued high rate of Guidelines compliance means these sources of prosecutorial influence have not disappeared. In addition, prosecutors can still firmly bind judges using mandatory minimums. Prosecutors have a variety of incentives to balance, including career incentives that push toward maximizing sentences and resource constraints that discourage going to trial (see, for example, Baker and Mezzetti 2001; Easterbrook 1983). In addition, prosecutors may be affected by sympathy or a sense of fairness. Schulhofer and Nagel (1997) review federal prosecutors’ case files and find evidence of deliberate charge manipulation to avoid excessive sentences. Prosecutorial discretion is often described as the power not to seek to maximize punishment—to be selectively lenient (see Stith 2008). Although there may be good policy reasons for allowing such discretion, it is a potential source of unwarranted disparity if it is influenced by legally irrelevant factors such as gender. 1.2. Existing Empirical Research Existing studies of demographic disparities in criminal justice have typically focused on single stages of the criminal process in isolation—usually, the judge’s final sentencing decision. In the federal-court literature, the usual approach is to estimate gaps in sentence outcomes when controlling for the Guidelines offense level and the defendant’s criminal history. These two key controls are often combined into a “presumptive sentence,” usually the lower end of the Guidelines range (U.S. Sentencing Commission 2010), or into dummies for the Guidelines grid cell (see, for example, Mustard 2001). Similarly, state-level studies generally control for some measure of conviction severity as well as criminal history (see, for example, Steffensmeier, Kramer, and Streifel 1993). Studies of gender disparity that take this approach have usually found that women receive shorter sentences, conditional on observables. The size of this effect has varied considerably, even among studies that use federal data. Sarnikar et al. (2007) find about a 30% unexplained gender gap in sentence length, as did a prominent recent U.S. Sentencing Commission (2010) study. Many studies, however, have estimated considerably smaller disparities—for instance, Stacey and Spohn (2006), Schanzenbach (2005), and Mustard (2001) all find average gender gaps in sentence length of around 10%. The problem with the dominant approach is that the key control variable is itself the result of a host of discretionary decisions made earlier in the justice process, which these studies ignore. The resulting sentencing disparity estimates are potentially biased by the endogeneity of the key control variable as well as sample selection introduced by the dismissal of cases prior to sentencing. Although there have been occasional studies of pleabargaining disparities (see, for example, Spohn and Spears 1997; Shermer and Johnson 2010), they concern only certain bargaining outcomes, such binary measures of whether any charges were dropped, and ignore negotiation over sentencing facts, which is the key aspect of bargaining in the modern federal system. Moreover, without assessing disparities in prosecutor’s initial choice of charges, the charge-bargaining results are not very meaningful.1 1 Spohn, Gruhl, and Welch (1987) found gender disparities favoring women in the rate of filing felony charges in Los Angeles County, but did not analyze charge severity as an outcome. 3 Starr—Estimating Gender Disparities in Federal Criminal Cases Further, the plea-bargaining studies tend to assess that stage in isolation too, rather than assessing its ultimate sentencing-disparity consequences. 1.3. The Dataset This study uses data from four different federal sources: the U.S. Marshals’ Service (USMS), the Executive Office of U.S. Attorneys (EOUSA), the Administrative Office of the U.S. Courts (AOUSC), and the U.S. Sentencing Commission (USSC); the Bureau of Justice Statistics provided inter-agency linking files that allow cases to be traced from arrest through sentencing. The main sample consists of federal property and fraud crimes, drug crimes, regulatory offenses, and violent crimes sentenced between FY 2001 and FY 2009.2 Immigration cases, which have different stakes centering on deportation, were excluded. To reduce common support concerns, offense categories that were over 95% male were dropped: weapons, sex and pornography, conservation, and family offenses. The data include rich offense and offender information, including arrest offense (which USMS identifies with 430 codes),3 gender, race, age, marital status, district, citizenship, a string field describing the offense, criminal history, number of dependents, education, Hispanic ethnicity, counsel type, co-defendant information, and county. AOUSC also lists the initial and final charges ; these statutory sections then had to be coded on a numeric charge severity scale. I constructed three such scales based on combined severity of all charges: the statutory maximum, the statutory minimum, and a Guidelines-based measure. If the statute prescribed varying sentences depending on case facts, I used default assumptions grounded in legal research. For further details, see the Data Appendix. 2. Analysis and Results 2.1. Filing and Conviction-Stage Disparity This study principally focuses on whether male and female arrestees ultimately receive the same sentences, but a threshold question is whether they are equally likely to be sentenced at all. Disparities in charging and conviction rates are important outcomes in their own right, and also are potential sources of sample selection bias in the sentencing analysis. To be included in the sentencing data, defendants must first face charges before a district court judge—a close proxy for felony charges because misdemeanors are usually handled by magistrates. Second, defendants must be convicted of a non-petty offense: a felony or a Class A misdemeanor. Accordingly, I begin by estimating the probability of these events. Columns 1 and 2 of Table 2 report the “male” odds ratios from logistic regressions.4 Conditional on arrest offense, district, race, citizenship, and age (the variables observed for all arrested defendants), male arrestees face a modestly but significantly higher probability of a charge before a district judge: 92.2% for the average male and 90.7% for the equivalent 2 For the filing and conviction analyses, the sample consists of cases charged or disposed of during that period. grouped certain closely related codes and subdivided certain drug codes based on a separate drug-type field. There were 123 arrest offenses after this recoding, and the results are robust to use of the original codes. 4 Except where other clustering is noted, all standard errors are clustered on arrest offense and district (combined), due to concern that local crime patterns or the U.S. Attorney’s Office’s priorities might introduce correlations. Results are robust to clustering on arrest offense or district alone. 3 I 4 Starr—Estimating Gender Disparities in Federal Criminal Cases female.5 Conditional on the same variables plus multi-defendant case structure, male district court defendants are also significantly more likely to be convicted of a non-petty offense (93.2% versus 91.4%; Table 2, Col. 2).6 Sample selection bias from filing and conviction are likely to downward-bias the sentencing disparity estimates reported below, but fairly slightly, because these initial disparities affect relatively few cases. I therefore do not correct the sentencing-stage estimates below for sample selection at these threshold stages. 2.2. The “Two-Part Model” of Incarceration Probability and Sentence Length When estimating sentencing disparity, a threshold question is how to treat non-prison sentences such as probation or fines (18% of the sample). This question has been hotly debated in sentencing research. The leading practice is to break sentencing into two decision processes, each estimated parametrically: whether to order incarceration and, if so, for how long (see, for example, Berk 1983). The theory is that non-prison sentences are have no obvious “prison equivalent,” and moreover, some covariates might be more influential in the incarceration decision than the length decision or vice versa. A practical advantage is that constraining the length sample to positive-length cases allows log transformation without having to assign some arbitrary small value to the zeros.7 This is ideal because sentencing law is structured so that inputs to sentencing will generally have multiplicative effects—each Guidelines grid cell is a multiplier of the ones adjacent to it. Although I prefer a different approach (discussed below), for comparability to the current literature, I begin with estimates for this “Two-Part Model” (TPM). Table 2, Column 3 shows the results of a logistic regression of an incarceration indicator on gender, arrest offense, criminal history, district, race, age, education level, U.S. citizenship, and the multidefendant case flag. The average male in the sample faces an 86% probability of incarceration; comparable females are nearly twice as likely to avoid incarceration (74%). Conditional on incarceration, men receive sentences that are approximately 34% longer. The complication is that the gender disparity in the incarceration decision almost surely means that the length estimates are downward biased by sample selection.8 Criminologists have often responded to this problem with Heckman-style corrections (see Heckman et al. 1988; see Ulmer and Bradley  for sentencing examples), but this approach is not ideal because there is no plausible exclusion restriction.9 In addition, the approach assumes that the estimand is the average treatment effect (the “ATE”) on the underlying population. In this context, that is a strange object: the gender disparity in prison sentence length that would be observed in a hypothetical world in which all defendants had to go to prison. This thought exercise is of improbable interest to policymakers. 5 This sample consisted of arrestees facing some charge. Cases that were entirely declined were dropped because they often represent unknown outcomes (transfers to other authorities or districts). When declinations citing a favorable reason (such as lack of evidence) are included as zeros, the gender disparity stays significant. 6 Petty offense convictions and jury acquittals are rare, so this disparity is driven by dismissals by prosecutors. 7 The resulting estimates would be extremely sensitive to the choice of small value. Note that there are also a very small number of life sentences, which I code as 540 months based on life expectancy data. 8 The direction of bias is clear because of the incarceration decision and the prison length decision are both driven by observable and unobservable factors affecting case severity. If selection-on-observables holds in the full sample, it almost surely will not hold in the sample of nonzero prison cases, because the incarceration regression indicates that conditional on the observed covariates, men are more likely to be incarcerated—that is, it takes less severe unobservables to push a given male case into the incarceration sample. 9 As Bushway, Johnson, and Slocum (2007) point out, the sentencing literature tends to ignore this problem. 5 Starr—Estimating Gender Disparities in Federal Criminal Cases If one is to follow the Two-Part Model at all, it is better instead to ask: If we went from treating everyone like women to treating everyone like men, (1) (2) what percentage of non-prison sentences would be replaced with prison, and among cases that already would have received prison sentences, how would the average length of those sentences change? More formally, the quantities of interest are: (1) E(PM|X) – E(PF|X) (2) E(YM|X, PF=1) - E(YF|X, PF=1) where P indicates a prison sentence, Y is prison sentence length, M and F denote the male and female treatment conditions, and X is the covariate distribution for the population noted.10 Object (2), in my view, is of more policy interest than the full-population ATE, requiring no speculation about a world in which probation and fines were not possible. With the estimand framed this way, the selection bias problem is not that the estimation sample contains too few females, but that it contains “extra” males who would not have been incarcerated if they were female. If it were possible to identify who those extra males were, OLS regression in a sample excluding them would be an unbiased estimator of object (2). Unfortunately, while the number of extra males can be readily estimated based on the incarceration logit,11 they cannot be identified; PF is unobserved for males (see Lee , who discusses an analogous problem). In Table 3, I apply varying assumptions as to which males were marginal to produce different trimmed-sample estimates. Table 3, Column 1 replicates the “male” coefficient on log prison sentence length from the full-sample OLS regression. Because sample selection bias is almost surely downward, this should be treated as a lower bound on the true sentence length disparity within the pool of cases that would have been subject to incarceration regardless of gender. Column 2 provides something roughly approximating an upper bound, based on a nearworst-case assumption about selection bias. The Column 2 sample has trimmed the males with the lowest (most negative) individual influence on the “male” coefficient.12 In this case, the Column 2 length-disparity estimate is about 67%—approximately double the estimate for the untrimmed sample. Columns 3 and 4 of Table 3 show results for samples trimmed based 10 This notation assumes monotonicity, such that PM=1 whenever PF=1. This assumes gender monotonically affects incarceration probability, a reasonable assumption: being male greatly increased that probability in every one out of dozens of analyzed subsamples. 12 Lee (2009) proposes a similar trimming method for estimating bounds on the effect of a randomly assigned treatment when treatment monotonically affects attrition. In that case worst-case bounds can be more readily estimated; the trim that will raise the treatment effect estimate by the most is just the lower tail of the treated outcome distribution (see Lee 2009). The trim I conduct in Table 3, Column 2 is based on the same intuition. But rather than assuming random treatment, I assume selection on observables within the full sentenced sample, and use regression to estimate the number of “extra males” and to model the outcome. This assumption could certainly be challenged, as I discuss below, but it already underlies both parts of the TPM; my method simply gives a near-worst-case adjustment for the second-part estimate assuming that the first part is correct. When there are covariates, one cannot just trim the lower tail; rather, the trim is based on the observations’ influence on the partial effect of being male. Estimating a true upper bound would require trimming the group with the most negative joint influence on the “male” coefficient. Identifying that group is computationally impossible. But ranking observations by individual influence is easy and is, in practice, probably a “bad enough” assumption about sample selection to provide useful guidance as to its possible scope. 11 6 Starr—Estimating Gender Disparities in Federal Criminal Cases on a plausibly realistic (rather than worst-case) assumption about who the marginal males are. The assumption is simply that they are those with short sentences—that is, that gender is likelier to be the deciding factor in closer cases. The Column 3 sample trims the males with the very shortest nonzero sentences (one year or less), while the Column 4 sample picks them randomly from the bottom quarter of the distribution (two years or less). The estimates for these two trimmed samples are 63% and 47%, respectively. This trimming exercise is not meant to “correct” sample selection bias, but rather to provide a general sense of its possible magnitude. Unfortunately, the potential bias here is large, rendering the TPM not ideally informative. The TPM remains appealing when the disparity in incarceration probability is small, such that selection bias is likely minor; for this reason, Rehavi and Starr (2012a) used it to assess racial disparity. In the gender context, however, more useful guidance can be found using other methods. 2.3. Inverse Propensity-Score Weighting Estimates of Gender Disparities The sample selection problem described above would not exist but for the choice to model the determination of sentences as two distinct decision processes, a choice that is not compelled by theory.13 I propose a simpler approach: keeping non-prison sentences in the sample for the length-disparity estimates, and treating them as zeros. While the Two-Part Model dominates the sentencing literature, a substantial minority of the literature rejects it. Researchers following the minority approach typically instead treat sentencing as a single process in which the non-prison cases are censored, applying a Tobit model that estimates average disparity in an underlying latent variable (see Tobin 1958; see Sarnikar et al. ; Bushway and Piehl ; Kurlychek and Johnson ; and Albonetti  for sentencing examples). This approach avoids the sample selection concern, but raises other practical problems. The Tobit is not robust to violations of its assumptions of normality and homoskedasticity (see, for example, Arabmazar and Schmidt 1982; Cameron and Trivedi 2010)—and in this sample, specification tests for the Tobit are decisively failed. Moreover, while the Tobit allows researchers to avoid assigning a specific value to the non-prison sentences, they still must choose a censoring point below which their value is assumed to fall. This choice is arguably equally arbitrary, and if the length variable is log-transformed, it will have a big effect on the Tobit estimates.14 The approach I propose is conceptually simpler than either the Tobit approach or the Two-Part Model, and avoids the practical weaknesses of both. If incarceration disparities are the outcome of policy interest, then there is nothing unknown about the value of non-prison sentences: they are correctly valued at zero. The main practical drawback of including them is that it precludes log transformation, but this functional form concern is only a problem for parametric estimation. I instead estimate the average length disparity in months by inverse propensity score weighting (“IPW”), without specifying any functional relationship between 13 Bushway and Piehl (2001) provide strong reasons that a single-decision model (in particular, the Tobit) is a better fit to the Guidelines process, in which zeros are just values in the lower end of the sentencing grid. 14 For instance, using a lower limit of half a day in the the Tobit log prison model (and the same covariates as in the TPM above) produces a gender disparity estimate of 128%, while a limit of one month produces an estimate of 72%. Either limit is theoretically defensible, as are many others. While the very lowest observed nonzero sentence is one day, only 0.3% are below one month. One might reasonably set the limit to censor these cases, to avoid giving excessive weight to large multiplicative differences between trivially short sentences. 7 Starr—Estimating Gender Disparities in Federal Criminal Cases the covariates and the outcome variable. I then extend this method to the distribution, allowing assessment of disparities in incarceration probability as well as other possible heterogeneity in gender effects on sentences of different lengths. The IPW estimates of average gender disparities in sentence length are given in Table 4. The probability of being male (E(M|Xi) for each observation (the “propensity score”) is first estimated by a logistic regression of “male” on the covariates X: gender, arrest offense, criminal history, race, age, education level, U.S. citizenship, and the multi-defendant case indicator.15 Estimates of average gender disparities are then produced via weighted regression where the weights are inverse functions of the propensity score. To refer to the estimands, I use the common language of “treatment effects,” where “treatment” refers to being male. But note that for these “effects” to be given a causal interpretation, one must assume there are no confounding variables; I return to this point below. In Column 1 of Table 4, I estimate the overall average gender disparity in sentence length conditional on the pre-charge covariates. This “average treatment effect” (ATE) represents the difference between two counterfactuals: the mean sentence if everybody were treated like males and the mean sentence if everybody were treated like females (see DiNardo 2002).16 Table 4, Columns 4 and 7 reflect separate estimates of the average effects of gender disparity on male and female sentences. The “average treatment effect on the treated” (TOT) reflects the estimated effect of being male on male sentences, and is estimated by comparing the observed male average to a reweighted female average (Col. 4).17 After this reweighting, the female endowments of covariates are similar to those of the males, so the reweighted female mean can be interpreted as a counterfactual mean if males were treated like females. The “average treatment effect on the untreated” (TUT) is conversely estimated by reweighting the males, and represents the counterfactual increase in sentence if females were treated like males (Col. 7). As Table 4 shows, even after reweighting, the average gender gaps in sentence length are strikingly large. The overall average disparity (the ATE) in Column 1 is 23 months, which translates into a 63% increase in sentence length. When measured in months, gender appears to have a bigger effect on males than females (compare Columns 4 and 7): being male increases male sentences by 25 months, and would increase female sentences by 15 months. But this difference is mostly because of a higher baseline average: in percentage terms, the TOT and TUT are not very different (64% versus 61%). A drawback of propensity score reweighting is its vulnerability to the problem of limited overlap between the male and female samples (see Busso, DiNardo, and McCrary 2008). Although the large sample size reduces this concern, women are only 19% of the sample and are thinly represented in certain offenses and high criminal history categories.18 The reweighting of the female distribution risks giving unduly high weight to women with unusual covariate values. In Table 4, Columns 2 and 5, I report the ATE and TOT for a 15 District fixed effects, which were included in the Two-Part Model, are not included in the weights. When reweighting, parsimony makes it easier to balance the most important variables, and gender composition does not vary much by district in any event. The results are robust to including the districts. 16 The weights are given by 1/(1-E(M|Xi)) for female observations and 1/ E(M|Xi) for males, before rescaling to average 1 (see Busso, DiNardo, and McCrary 2008). 17 The weights are E(M|Xi )/(1-E(M|Xi) for female observations, before rescaling to average 1. 18 See Figure 1a for the propensity score distribution. 8 Starr—Estimating Gender Disparities in Federal Criminal Cases sample that eliminates those problematic covariate combinations by trimming extreme propensity score values (see, for example, Heckman et al. 1998).19 The drawback with this method is that the sample to which the estimates apply is not very intuitively or transparently defined. In Columns 3 and 6, I report the ATE and TOT for an alternate sample that excludes the highest three criminal history categories.20 Both trimming strategies produce gender disparity estimates that are fairly similar in percentage terms to the full-sample estimates (compare Columns 1 through 3 and Columns 4 through 6). I report only the full-sample results for the TUT (the effect of gender on women), because estimating it depends on reweighting only the males, and no males have propensity scores anywhere near zero. For this reason, as I proceed below to analyze the gender disparity in more detail, I focus on the counterfactual effects if women were treated like men. The effects of gender on men and women are of equal policy interest, but analyzing the TUT is simpler because the full sample can be used without limited-overlap concerns. Table 5 accordingly shows TUT estimates for subsamples and alternate specifications. Column 1 replicates the main estimate from Table 4 for comparison purposes. Columns 2 and 3 show estimates for two large offense-type categories: drug offenses (Column 2) and property, fraud, and regulatory offenses (Column 3). In percentage terms the effects are similar. The disparity is likewise almost identical in percentage terms before and after the watershed Booker decision (Columns 4 and 5).21 It is smallest for non-parents and largest for single parents (51.6% versus 67.3%; compare Columns 6-8). It is larger for defendants in multi-defendant cases than for sole defendants (66% vs. 51.2%, Columns 910), much larger among blacks than non-blacks (74% vs. 51.1%, Columns 11-12), and slightly larger in states without federal women’s prisons (Columns 13-14). Many of these subsample comparisons are useful in assessing possible causal theories for the unexplained gender gap, and they will be further addressed in the Discussion. The remainder of Table 5 shows the robustness of the TUT estimates to alternate specifications of the gender-propensity model. Columns 15 and 16 show that the TUT is unchanged by the addition of a set of flags for case characteristics mentioned in a text field based on the arresting officers’ notes (in 2001-2007, the years the field is available). The flags are for mentions of guns, other weapons, drug seizures, official victims, minor victims, conspiracy and racketeering. Columns 17 through 20 show that the estimates are robust to adding controls for marital and parental status and defense counsel type. Disparities decline slightly when controlling for pleas and time elapsed before conviction (Col. 21). The gender disparities in drug cases decline slightly when drug quantity seized at arrest, as recorded in the EOUSA investigation files, is added to the controls. This check could only be performed for arrests before 2004 because of data limitations (compare Columns 22 and 23).22 19 The propensity-score cutoff (approximately 0.93) is optimized to minimize variance (see Crump et al. 2009). The trim drops about 4% of women and 21% of men from the sample. 20 The main sample already excludes the most male-dominated crime categories. Adding the criminal history constraint does not entirely eliminate the limited overlap problem, but mitigates it considerably (see Figure 1b). 21 This does not preclude the possibility that Booker changed disparities; this analysis does not seek to disentangle Booker’s causal effects from longer-term trends. 22 Results are also robust to the use of the original ungrouped arrest codes; the addition of district controls, Hispanic ethnicity, and county-level controls for poverty rate, unemployment, per capita income, and crime 9 Starr—Estimating Gender Disparities in Federal Criminal Cases Finally, a comparison of Column 1 and Column 24 of Table 5 illustrates the importance of the choice to condition on arrest offense rather than on the end result of sentencing fact-finding. The Column 24 reweighting substitutes the final Guidelines offense level instead of the arrest offense, and the estimated disparity is reduced by 63%. This comparison suggests that by conditioning on an endogenous variable and ignoring gender disparities introduced earlier in the justice process, the current literature may have substantially understated the size of the gender gap. In Figure 2, I extend the reweighting method to estimate the effect of gender on the distribution of sentences for females following the method proposed by DiNardo, Fortin, and Lemieux (1996). The white and black bars reflect the observed distribution of sentence lengths for male and female defendants, respectively; non-prison sentences have their own bin and need not be assigned a numeric value. The checkered bars represent the counterfactual distribution if females were treated like males. Comparison of the checkered to the black bars shows large unexplained gaps throughout the distribution. The unexplained gap in the share sentenced to non-prison sentences (about 11 percentage points) is similar to the regression estimate in Table 2. The gap is not confined to the low end—the whole reweighted male distribution is shifted to the right relative to the female distribution. 2.4. Decomposing the Gender Gaps The estimates presented above represent the aggregate disparities introduced throughout the post-arrest justice process, raising the further question of when in the justice process those disparities emerge. Table 6 shows a sequential decomposition of the observed average gender disparity into components explainable by pre-charge covariates and by each subsequent stage of the process: charging, charge-bargaining, sentencing fact-finding, and sentencing. The method is a sequence of inverse-propensity score reweightings, in which new variables are added to the propensity score estimation at each step (see, for example, Altonji, Bharadwaj, and Lange 2008; DiNardo, Fortin, and Lemieux 1996). In this part of the analysis, data limitations require separate assessment of drug and non-drug cases. For non-drug crimes, the initial and final charges were coded with the statutory minimum, maximum, and Guidelines measures described above. But in drug cases, the AOUSC charge data are too ambiguous to permit that coding; the same statutory subsections encompass a vast array of drug types, quantities, and sentences. The only usable measure of statutory severity available for drug cases is the mandatory minimum for the crime of conviction, which the Sentencing Commission records. Thus, in drug cases I cannot disentangle the effects of initial charging and subsequent charge-bargaining. The mandatory minimum variable represents the combined effect of those stages. The non-drug decomposition is shown in Panel A of Table 6. Column 1 shows the raw observed gender gap to be decomposed. In Column 2, the men have been weighted based on pre-charge covariates. Columns 3, 4 and 5 sequentially add the initial charge severity measures, the conviction measures, and the final offense level (the product of sentencing fact-finding). The drug decomposition (Panel B) has one stage fewer: the conviction mandatory minimum substitutes for the separate charging and conviction variables. The explanatory value attributed to each stage is the change in the unexplained rate; and various exclusions from the sample: cases in which the indictment was issued before the arrest, cases from the South, and arrests by each of the two enforcement agencies (the FBI and the DEA). 10 Starr—Estimating Gender Disparities in Federal Criminal Cases gender gap when one adds that stage’s measures. What remains after the final reweighting is attributed to the sentencing decision. In the last two lines of each panel, I express each component as percentages of the raw observed gender gap and of the gender gap that was unexplained by the pre-charge covariates. That is, the last line decomposes the gender disparity that appears to be introduced during the criminal justice process. This method of decomposition is path-dependent: explanatory value is preferentially attributed to the covariates that are added first. Path-dependence is often a drawback to sequential decomposition, because in many contexts, when multiple correlated covariates together explain a certain portion of an outcome gap, there is no theoretical reason to “blame” one over the others (see Fortin, Lemieux, and Firpo 2011; DiNardo, Fortin, and Lemieux 1996). But here path-dependence is desirable, because the justice process is itself path-dependent: earlier decisions constrain later ones.23 The decomposition tracks the divergence of men’s and women’s fates as the process advances, so it would not make sense to attribute to a later stage a disparity that already existed. When there is a natural ordering like this, sequential decomposition is appropriate (see Altonji, Bharadwaj, and Lange 2008). The decompositions show that significant new disparity favoring women is introduced at every stage of the justice process, but sentencing fact-finding is especially crucial. In non-drug cases, an eight-month gender gap remained unexplained after reweighting by arrest offense and the other pre-charge covariates—this is the gap attributed to the justice process as a whole. Initial charging and charge-bargaining contribute about 9% and 4% of the gap, respectively; Guidelines fact-finding explains 60%, leaving 27% for the final sentencing stage to explain. In drug cases, the mandatory minimum can explain one third of the 23-month gender gap attributed to the justice process. Guidelines fact-finding can explain 29.5%, leaving 37% attributed to the final sentencing decision. In Figures 3a through 3d, I show a similar sequential decomposition of the sentencing distributions (see DiNardo, Fortin, and Lemieux 1996). Figure 3a shows the distribution of non-drug sentences observed for males and females and, between them, the distributions produced by the same series of reweightings described above. Each step in the sequence makes the male distribution look somewhat more like the female. Figure 3b presents these results in a way that (while it does not show the underlying distributions) allows the procedural sources of the gaps in the distribution to be more readily discerned. The full height of each bar represents the gap in the cumulative distribution at the denoted sentence threshold after reweighting by the pre-charge covariates—that is, the gap in the probability of getting a sentence exceeding the threshold. The patterned sections decompose these gaps into charging, charge-bargaining, fact-finding, and sentencing components. Figures 3c and 3d repeat these exercises for drug cases. The decompositions again show the central role of sentencing fact-finding, especially in explaining gaps higher in the length distribution. Judges’ final sentencing decisions appear to be more important in explaining disparities at the lower end, particularly in the incarceration decision (Figs. 3b, 3d). Because fact-finding and Guidelines departures are both stages in which men’s and women’s outcomes appear to diverge substantially, it is worth inquiring whether any particular findings of fact and departures appear to be key factors. Table 7 shows the 23 For instance, the initial charges define the range of possible outcomes to charge-bargaining; charges are almost never added (and in most cases are not dropped). 11 Starr—Estimating Gender Disparities in Federal Criminal Cases explanatory value attributed to each of several findings and departures when they are added to the mean decompositions from Table 6. These variables were not added sequentially with one another because there is no natural ordering among them; each was added independently. If they are correlated, the sum of the shares reported likely overstates their collective importance.24 Each share is thus best interpreted as the maximum the variable can explain. The factors listed in Table 7 were assessed because they are factors that one might expect to vary by gender. Their relevance to possible causal theories for gender disparity are addressed in the Discussion below. Other than the factors analyzed here, sentencing factfinding involves a vast array of context-specific inquiries. Likewise, other stated reasons for departures vary widely, and are often vague, such as “the interests of justice.” 3. Discussion The unexplained gender disparities identified above are large—much larger than those estimated via the prevailing method of conditioning on presumptive sentence. The key interpretive question is why these gaps exist—and, in particular, whether unobserved differences between men and women might justify them. One cannot instrument for inborn traits or manipulate them, so estimation of demographic disparities always risks omitted variables bias, and one must be cautious about inferring gender discrimination. Still, some often-advanced causal theories have testable implications. In this Part, I consider the leading theories suggested in the literature and in my informal conversations with criminal lawyers. 3.1. Unobserved differences in offense severity. One obvious question is whether the crimes differ in ways not captured by the arrest offense codes. The arrest offense is not a perfect proxy for underlying criminal conduct, and if it overstates the severity of female conduct relative to that of men, that might explain some of the observed disparity. In particular, one might wonder whether the disparities introduced at sentencing fact-finding merely represent the process’s proper accounting for nuance differences in facts within offense categories, which is, after all, fact-finding’s purpose. Unobserved differences naturally cannot be ruled out, but there are good reasons to doubt that they explain much of the observed disparity. First, the observable covariates are detailed, capturing considerable nuance. They include not just the 430 arrest codes and the multi-defendant flag (a proxy for group criminality, an important severity criterion), but also additional flags based on the written offense description (see Table 4, Rows 15-16). Second, the disparities are similar across all case types (and across arresting agencies), suggesting it is not a matter of a few crimes being “worse” when men commit them. Such differences would have to be prevalent across a variety of crimes and agencies to explain the result. Third, there is some reason to believe unobserved divergences between the arrest offense and actual criminal conduct may bias disparity estimates downward. If police tend to treat men more harshly, one might expect them to record arrest offenses that overstate men’s culpability relative to women’s. The empirical evidence on gender and policing is limited. Traffic stop studies reach divergent conclusions about whether there is bias against men (compare Rowe 2009 with Persico and Todd 2006), but at least do not suggest bias against women. A study covering a wider range of crimes (Stolzenberg and D’Alessio (2004)) found 24 This is almost surely the case with the fact-finding results in drug cases, where the shares reported in Table 7 add up to slightly more than the total months of disparity attributed to fact-finding in Table 6. 12 Starr—Estimating Gender Disparities in Federal Criminal Cases that other factors equal, reported crimes with female offenders are substantially less likely to lead to arrests, results that they interpret to show police leniency toward women. Nonetheless, there are some easily imaginable differences between male and female cases that might not be observed. For instance, men might well commit violent crimes with greater force, a difference not fully captured by the arrest code (beyond the labeling of some assaults as “aggravated”). There are fewer obvious potential differences in property, regulatory, or drug offenses, but perhaps women might commit smaller-scale offenses. Scale is captured to some degree by the arrest offense codes (for instance, pickpocketing versus vehicle theft), but not entirely—for instance, wire fraud could be in any amount. Findings of fact on loss value appear capable of explaining up to 20% of the otherwise-unexplained gap in non-drug crimes (Table 7). Unfortunately, there is no way to tell how much of that factfinding difference reflects true underlying differences in the facts. With respect to drug quantity, the data are more informative. Drug quantity and type determine eligibility for mandatory minimums, which explain 29.5% of the post-arrest gender gap in drug cases (Table 6); related Guidelines adjustments can explain a further 3% (Table 7).25 For arrests before FY 2004, the drug quantity and type seized at arrest is recorded in the EOUSA investigation file. Within that pool, there are substantial gender disparities in the drug quantity found at the sentencing stage, even after controlling for drug quantity at arrest and the other standard covariates. The estimated gender gap in sentences in pre-2004 drug cases is only slightly reduced by adding arrest-stage drug quantity controls to the reweighting (Table 5, Cols. 22-23). These findings suggest that quantity findings at sentencing diverge from the underlying facts in ways that differ by gender. Another key factor affecting drug sentencing is the “safety valve” loophole built into the drug mandatory minimum statutes and the related Guidelines safety valve. The safety valves can explain up to 9% of the sentence gap in drug cases, and one might wonder whether this reflects “real” case differences. Eligibility for the safety valve is defined by statute, and cases can be coded as seemingly eligible or not based on the case’s observed characteristics: criminal history, certain offense features, lack of aggravating role, and lack of obstruction. Conditional on apparent eligibility, women are significantly more likely to get safety-valve reductions. This is only suggestive evidence of disparate treatment, however, because the observables do not perfectly track the eligibility requirements.26 3.2. The “girlfriend theory.” In group offenses, another factor affecting culpability is relative role. Women might be viewed as minor players—perhaps mere accessories of their male romantic partners. Prosecutors and judges may consider such women less dangerous, less morally culpable, or useful sources of testimony. While leniency may be appropriate in such cases (see Raeder 25 Drug quantity findings drive both the application of mandatory minimums and the more nuanced gradations under the Guidelines. The 3% figure in Table 7 reflects only the latter component: the additional gender disparity explained by quantity findings after mandatory minimums had already been accounted for. 26 The key source of discretion in safety valve application is the prosecutor’s choice whether to characterize the defendant as having been fully truthful in describing the crime (see 18 U.S.C. 3553(e)). Beyond the absence of obstruction and the presence of acceptance-of-responsibility reductions, discussed above, the data do not provide a way to assess whether the defendant was in fact truthful. 13 Starr—Estimating Gender Disparities in Federal Criminal Cases ), some lawyers I spoke to suggested that such perceptions are not always justified by the facts; in cases involving couples, it may just be assumed that the female is the “follower.” The data provide no way to test whether role perceptions are well founded, but they do suggest that they can partially explain the gender gap. Other than its implications for cooperation departures, the “girlfriend theory” has two testable implications: first, the gender gap should be larger in multi-defendant cases, and second, part of it should be attributable to sentencing adjustments for role in the offense. Both predictions are supported by the data. The gender gap is significantly larger in multi-defendant cases: 66% compared to 51% (Table 5). Approximately 14% of the otherwise-unexplained disparity in non-drug cases and 20% in drug cases can potentially be explained by role adjustments (Table 7). The girlfriend theory appears to explain part, but not most, of the gender gap; it is hard for it to explain the large disparities that persist even in single-defendant cases.27 3.3. Parental responsibilities. Another possibility is that prosecutors and/or judges worry about the effect of maternal incarceration on children. The estimates are robust to controls for marital status and number of dependents, but these variables do not capture all differences in care responsibilities, including custody status. Other research shows that female defendants are far more likely than men to have primary or sole custody, and incarcerating women more often results in foster care placements (see Hagan and Dinovitzer  for a review of the literature; Koban 1983). In an experiment asking judges to give hypothetical sentences based on short vignettes, Freiburger (2010) found that mentioning childcare reduced judges’ probability of recommending prison, but mentioning financial support for children did not. The childcare theory suggests that one would expect to see the largest gender disparities among single parents, and the smallest among defendants with no children. That expectation is borne out by the data: compare Table 5, Columns 6-8. The TUT estimate is still over 50% among childless defendants, however, so the childcare theory appears not to fully explain the gender gap, but it probably explains part of it.28 On the other hand, the decompositions in Table 7 indicate that, at most, between 1% and 2% of the sentencing gap can be explained by disproportionate invocation of the official “family hardship” departure in the Sentencing Guidelines. Women in the sample receive that departure at three times the rate of men: 2.4% of cases versus 0.8%. But because the departure is so rare for both genders, it cannot explain much of the overall disparity. This is presumably because it requires “extraordinary circumstances,” and judges typically hold that single parenthood does not suffice (see U.S.S.G. 5H1.6; Raeder 2006). Likewise, the main federal sentencing statute, 18 U.S.C. 3553, does not mention family hardship, and the Guidelines affirmatively instruct that family ties are “not ordinarily relevant.” Federal sentencing law is not designed to provide much accommodation for defendants’ children. In short, the family status-gender interaction appears to be more substantial than the one formal legal mechanism for accommodating family hardship can explain. Prosecutors 27 The formal departure for duress or coercion (U.S.S.G. 5K2.12), while given to women at five times the rate of men (0.4% versus 0.08%), is far too rare to be a significant explanation for the gender gap. 28 The gender gap is also slightly smaller in states with federal women’s prisons (see Table 5, Columns 13-14), which may suggest that judges do not want to move women far from their families, although this is not a dramatic difference and other characteristics of those seven states might explain it. 14 Starr—Estimating Gender Disparities in Federal Criminal Cases and/or judges seem to use their discretion to accommodate family circumstances in sub rosa ways—but not for male defendants. Among single men, conditional on observables, having children significantly increases sentences, and among married men, children make no significant difference. There are many competing arguments concerning whether family status is a proper sentencing consideration (see, for example, Markel, Collins, and Leib 2007), and I will not address them here. However, if family hardship is a legitimate consideration, one might expect it to play at least some role in men’s cases as well. Numerous studies have suggested that paternal incarceration harms children even when the father was already a noncustodial parent (see Hagan and Dinovitzer  for a review). 3.4. Cooperativeness. Another often-advanced theory is that female defendants receive leniency because they are more cooperative with the government. These data provide, at best, limited support for that theory. Conditional on observables, women are modestly but significantly more likely to receive downward departures for cooperation in another case (20% versus 17%), have higher guilty plea rates (97.5 vs. 96.2%), and have their cases resolved about two weeks sooner on average (a 10% difference). But the interpretation of these differences is not clear. Plea rates, timing, and cooperation are all endogenous, turning on the deals being offered. Moreover, women could be being rewarded more for the same level of cooperation; the actual assistance they provide is unobserved. On all four charge- and conviction-severity scales, women receive modestly but significantly larger charge reductions in plea-bargaining than men do, and far more favorable findings of fact, suggesting that they may be offered better factual stipulations. If women really are inherently more cooperative (or risk-averse), one might think prosecutors could get away with offering them lesser discounts, and still induce frequent guilty pleas. Yet the opposite appears to be true. Whatever the merits of these indicators of cooperativeness, they seem to explain only fairly modest portions of the gender gap. Adding a plea and elapsed-time indicator to the reweighting reduces the unexplained disparity by about 8% (Table 5, Col. 21). Disparities in departures for cooperation can explain up to 9% of the otherwise-unexplained gap in drug cases, but no significant share in non-drug cases (Table 7). In addition, the “acceptance of responsibility” reduction and the obstruction of justice enhancement do not explain any substantial portion of the gender gap; in non-drug cases these offset one another, while in drug cases neither is significant (Table 7). Unlike that of the family hardship departure, the limited explanatory power of these adjustments and departures cannot be attributed to rarity or tight legal constraints—all are very common. Formal mechanisms for recognizing women’s purportedly greater cooperativeness are readily available, and yet they explain only a modest share of the disparity in drug cases and none in non-drug cases. 3.5. Mental health, addiction, abuse, and other sympathetic life circumstances. Another theory is that female defendants may have more troubled life circumstances, such as poverty, mental illness, addiction, and abuse histories. If so, they may be perceived as less morally culpable or as candidates for rehabilitation. Criminal defendants often come from difficult backgrounds. This could well be disproportionately true for females; perhaps because women more rarely commit crime, those who do are likelier to be in the upper tail of the life-hardship distribution. Prisoner studies show more self-reported mental illness and prior abuse among women. See James and Glaze (2006); Harlow (1999). 15 Starr—Estimating Gender Disparities in Federal Criminal Cases Socioeconomic status is not unobserved, however, and does not seem to explain the gender gap. The main specification includes education, and the results are robust to adding county-level socioeconomic controls and defense counsel type (a strong proxy for poverty). But mental health, addiction, and abuse are not observable unless judges cite them as the basis for a departure. The Guidelines permit departures for “unusual” mental and emotional conditions (U.S.S.G. 5H1.3) and for “significantly reduced mental capacity” (U.S.S.G. 5K2.13). They prohibit departuers for “disadvantaged upbringing” (U.S.S.G. 5H1.12) and in most cases for addiction (U.S.S.G. 5H1.4), although judges have more flexibility to disregard these restrictions after Booker. Together, all such cited bases for departures explain only between 1 and 2% of the otherwise-unexplained gap in sentence length; they are too rare too explain more. If prosecutors or judges take such factors into account in informal ways (as they seem to with family hardship, above), it would be unobservable. 3.6. Race-Gender Interactions. Columns 11-12 of Table 5 show that the gender gap is substantially larger among black than non-black defendants (74% versus 51%). The race-gender interaction adds to our understanding of racial disparity: racial disparities among men significantly favor whites,29 but among women, the race gap in this sample is insignificant (and reversed in sign). The interaction also offers another theory for the gender gap: it might partly reflect a “black male effect”—a special harshness toward black men, who are by far the most incarcerated group in the U.S. This possibility is not really an “explanation” for the gender gap, much less a reason to worry less about it—but it might cause policymakers to understand it differently, as an issue of intersectional race-gender disparity. This theory only goes so far, however—the gender gap even among non-blacks is over 50%, far larger than the race gap among men. 3.7. Gender discrimination: preference-based and statistical. Although several of the factors above appear to explain portions of the gender gap, that gap is large enough that it is plausible that gender discrimination also contributes. If so, several types of discrimination could be at play. The theoretical literature suggests “chivalry” and “paternalism” (see, for example, Franklin and Fearn ). Another theory is selective sympathy: perhaps circumstances like family hardship or “bad influence” appear more sympathetic when it is women who are in them. Psychology experiments have found that attributions of blame and credit are often filtered through expectations that males are “agentic” and active and women are “communal” and passive (see Eagly, Wood, and Diekman  for a review). If so, prosecutors or judges might more readily credit societal or situational explanations for females’ crimes than for males.’ Statistical discrimination is also possible. Perhaps the likeliest such mechanism is that prosecutors or judges might assume men are more dangerous than women. Studies generally find that women have lower recidivism rates, though some of the difference may be explained by characteristics that this study controls for (see Gendreau, Little, and Goggin  for a meta-analysis). I do not have recidivism data to test whether statistical discrimination might be “rational” here. Note that if recidivism risk perceptions are based on individual information about the offender (not based on gender), then it is perfectly permissible to consider them. But punishment decisions based on statistical generalizations 29 Rehavi and Starr (2012) explore these more extensively, finding a 10% unexplained disparity. 16 Starr—Estimating Gender Disparities in Federal Criminal Cases about men and women are unconstitutional. The Supreme Court has repeatedly ruled that reliance on gender stereotypes is impermissible even if those stereotypes are statistically well founded (see J.E.B. v. Alabama ex rel T.B., 511 U.S. 127 ). Conclusion This study finds dramatic unexplained gender gaps in federal criminal cases. Conditional on arrest offense, criminal history, and other pre-charge observables, men receive 63% longer sentences on average than women do. Women are also significantly likelier to avoid charges and convictions, and twice as likely to avoid incarceration if convicted. There are large unexplained gaps across the sentence distribution, and across a wide variety of specifications, subsamples, and estimation strategies. The data cannot disentangle all possible causes of these gaps, but they do suggest that certain factors (such as childcare and offense roles) are partial but not complete explanations, even combined. These estimates are much larger than those of prior studies, which have probably substantially understated the sentence gap by filtering out the contribution of pre-sentencing discretionary decisions. In particular, this study highlights the key role of sentencing factfinding, a prosecutor-dominated stage that existing disparity research ignores. Mandatory minimums—prosecutors’ most powerful tools—are also important contributors to gender gaps in drug sentencing. Understanding the relative roles of prosecutors and judges is important. Gender disparities have been cited to support constraints on judicial discretion, including when the Sentencing Guidelines were adopted. But such constraints typically empower prosecutors, so if prosecutors drive disparities, they could backfire. Policymakers might simply be untroubled by leniency toward women. They are a small minority of defendants, and when disparities favor traditionally disempowered groups, they might raise fewer concerns. But the gender disparity issue need not be framed in terms of how women are treated. One could ask: why are men treated so harshly, if women are (apparently) treated otherwise? It is hard to dismiss this question as trivial: over two million American men are behind bars. While males generally are not a disadvantaged group, men in the criminal justice system generally are; they are mostly poor and disproportionately nonwhite. The especially high rate of incarceration of men of color is a serious social concern, and gender disparity is one of its key dimensions. From this perspective, one might think differently about some of the possible explanations for the gender gap. Most defendants of both genders have suffered serious hardship, have mental health or addiction issues, have minor children, and/or have “followed” others onto a criminal path. Sentencing law provides very limited formal mechanisms to account for such factors—which is probably why, with women, they appear to mostly be considered sub rosa. If prosecutors, judges, and legislators are comfortable with those factors playing a role in the sentencing of women, then perhaps it is worth explicitly reconsidering their place in criminal sentencing more generally. 17 Starr—Estimating Gender Disparities in Federal Criminal Cases Reference List Albonetti, Celesta A. 1997. “Sentencing Under the Federal Sentencing Guidelines.” Law and Society Review 31:601–634. Altonji, Joseph G., Prashant Bharadwaj, and Fabian Lange. 2008. “Changes in the Characteristics of American Youth: Implications for Adult Outcomes.” Working Paper no. 13883. National Bureau of Economic Research, Cambridge, Mass. Alschuler, Albert W. 2005. “Disparity: The Normative and Empirical Failure of the Federal Guidelines.” Stanford Law Review 58:85-118. Arabmazar, Abbas, and Peter Schmidt. 1982. “An Investigation of the Robustness of the Tobit Estimator to Non-Normality.” Econometrica 50:1055-63. Ashcroft, John. 2003. “Department Policy Concerning Charging Offenses, Disposition of Charges, and Sentencings.” Memorandum, September 22. Baker, Scott, and Claudio Mezzetti. 2001. “Prosecutorial Resources, Plea Bargaining, and the Decision to Go to Trial.” Journal of Law, Economics, and Organization 17:149-67. Berk, Richard A. 1983. “An Introduction to Sample Selection Bias in Sociological Data.” American Sociological Review 48:386–98. Bibas, Stephanos. 2009. “Prosecutorial Regulation Versus Prosecutorial Accountability.” University of Pennsylvania Law Review 157:959-1016. Breyer, Stephen. 1988. “The Federal Sentencing Guidelines and the Key Compromises Upon Which They Rest.” Hofstra Law Review 17:1-50. Bushway, Shawn, and Anne Morrison Piehl. 2001. “Judging Judicial Discretion: Legal Factors and Racial Discrimination in Sentencing.” Law and Society Review 35:733–67. Bushway, Shawn, Emily Owens, and Anne Morrison Piehl. 2012. “Sentencing Guidelines and Judicial Discretion: Quasi-experimental Evidence from Human Calculation Errors.” Journal of Empirical Legal Studies 9:291-319. Bushway, Shawn, Brian D. Johnson, and Lee Ann Slocum. 2007. “Is the Magic Still There? The Use of the Heckman Two-Step Correction for Selection Bias in Criminology.” Journal of Quantitative Criminology 23:151-78. Busso, Matias, John DiNardo, and Justin McCrary. 2009. “Finite Sample Properties of Semiparametric Estimators of Average Treatment Effects.” Working paper. University of Michigan, Ann Arbor, Mich. Cameron, Colin, and Pravin K. Trivedi. 2010. Microeconometrics Using Stata, Revised Edition. College Station: Tex.: Stata Press. Crump, Richard K., V. Joseph Hotz, Guido W. Imbens, and Oscar A. Mitnik. 2009. “Dealing With Limited Overlap in Estimation of Average Treatment Effects.” Biometrika 96:187-99. DiNardo, John. 2002. “Propensity Score Reweighting and Changes in Wage Distributions,” Working paper. University of Michigan, Ann Arbor, Mich. 18 Starr—Estimating Gender Disparities in Federal Criminal Cases DiNardo, John, Nicole M. Fortin, and Thomas Lemieux. 1996. “Labour Market Institutions and the Distribution of Wages, 1973-1992: A Semiparametric Approach.” Econometrica 64:1001-46. Eagly, Alice H., Wendy Wood, and Alice B. Diekman. 2000. “Social Role Theory of Sex Differences and Similarities: A Current Appraisal.” 123-174 in The Developmental Social Psychology of Gender, edited by Thomas Eckes and Hanns Trauter. Sussex: Psychology Press. Easterbrook, Frank H. 1983. “Criminal Procedure as a Market System.” Journal of Legal Studies 12:289-332. Fortin, Nicole, Thomas Lemieux, and Sergio Firpo. 2011. “Decomposition Methods in Economics.” In Handbook of Labor Economics, vol. 4, 1-102, edited by Orley Ashenfelter and David Card. Amsterdam: Elsevier. Franklin, Cortney A., and Noelle E. Fearn. 2008. “Gender, Race, and Formal Court DecisionMaking Outcomes: Chivalry/Paternalism, Conflict Theory, or Gender Conflict?” Journal of Criminal Justice 36:279-90. Freiburger, Tina L. 2010. “The Effects of Gender, Family Status, and Race on Sentencing Decisions.” Behavioral Sciences and the Law 28:378-95. Gendreau, Paul, Tracy Little, and Claire Goggin. 1996. “A Meta-Analysis of the Predictors of Adult Offender Recidivism: What Works!” Criminology 34:575-608. Gilbert, Scott A., and Molly T. Johnson. 1996. “The Federal Judicial Center’s 1996 Survey of Judicial Experience.” Federal Sentencing Reporter 9:87-93. Hagan, John, and Ronit Dinovitzer. 1999. “Collateral Consequences of Imprisonment for Children, Communities, and Prisoners.” Crime and Justice: A Review of Research 26:121-62. Harlow, Caroline Wolf. 1999. “Prior Abuse Reported by Inmates and Probationers.” Bureau of Justice Statistics Report, NCJ 172879. Heckman, James, Hidehiko Ichimura, Jeffrey Smith, and Petra Todd. 1998. “Characterizing Selection Bias Using Experimental Data.” Econometrica 66:1017-98. James, Doris J., and Lauren Glaze. 2006. “Mental Health Problems of Prison and Jail Inmates.” Bureau of Justice Statistics Report, NCJ 213600. Koban, L. 1983. “Parents in Prison: A Comparative Analysis of the Effects of Incarceration on the Families of Men and Women.” Research in Law, Deviance, and Social Control 5:171-83. Kurlychek, Megan C., and Brian D. Johnson. 2004. “The Juvenile Penalty: A Comparison of Juvenile and Young Adult Sentencing Outcomes in Criminal Court.” Criminology 42:485-515. Lee, David S. 2009. “Training, Wages, and Sample Selection: Estimating Sharp Bounds on Treatment Effects.” Review of Economic Studies 76:1071-1102. 19 Starr—Estimating Gender Disparities in Federal Criminal Cases Markel, Dan, Jennifer M. Collins, and Ethan J. Leib. 2007. “Privilege or Punish: Criminal Justice and the Challenge of Family Ties.” University of Illinois Law Review 2007:1148-1228. Miller, Marc L. 2004. “Domination and Dissatisfaction: Prosecutors as Sentencers.” Stanford Law Review 56:1211-69. Mustard, David B. 2001. “Racial, Ethnic, and Gender Disparities in Sentencing: Evidence from the U.S. Federal Courts.” Journal of Law and Economics 44:285-314. Persico, Nicola, and Petra E. Todd. 2006. “Generalizing the Hit Rates Test For Racial Bias in Police Enforcement, With an Application to Vehicle Searches in Wichita.” The Economic Journal 116:F351-F367. Powell, William J., and Michael T. Cimino. 1995. “Prosecutorial Discretion Under the Federal Sentencing Guidelines: Is the Fox Guarding the Hen House?” West Virginia Law Review 97:373-95. Raeder, Myrna S. 2006. “Gender-Related Issues in a Post-Booker Federal Guidelines World.” McGeorge Law Review 37:691-756. Rehavi, M. Marit, and Sonja Starr. 2012. “Racial Disparity in Federal Criminal Charging and its Sentencing Consequences.” Working Paper no. 12-002. University of Michigan Law and Economics, Empirical Legal Studies Center, Ann Arbor, Mich. Rowe, Brian. 2009. “Gender Bias in the Enforcement of Traffic Laws: Evidence Based on a New Empirical Test.” Unpublished manuscript. University of Michigan, Department of Philosophy, September. Sarnikar, Supriya, Todd Sorensen, and Ronald L. Oaxaca. 2007. “Do You Receive a Lighter Prison Sentence Because You Are a Woman? An Economic Analysis of Federal Criminal Sentencing Guidelines.” Working paper no. 2870. Institute for the Study of Labor (IZA), Bonn, Germany. Schanzenbach, Max M. 2005. “Racial and Gender Disparities in Prison Sentences: The Effect of District-Level Judicial Demographics.” Journal of Legal Studies 34:57-92. Schulhofer, Stephen J., and I. H. Nagel. 1997. “Plea Negotiations Under the Federal Sentencing Guidelines.” Northwestern University Law Review 91:1284-1316. Scott, Ryan W. 2012. “Inter-Judge Sentencing Disparity After Booker: A First Look.” Stanford Law Review 63:1-66. Shermer, Lauren O’Neill, and Brian Johnson. 2010. “Criminal Prosecutions: Examining Prosecutorial Discretion and Charge Reductions in U.S. Federal District Courts.” Justice Quarterly 27:394-430. Spohn, Cassia, John Gruhl, and Susan Welch. 1987. “The Impact of the Ethnicity and Gender of Defendants on the Decision to Reject or Dismiss Felony Charges.” Criminology 25:175-92. Spohn, Cassia, and Jeffrey W. Spears. 1997. “Gender and Case Processing Decisions.” Women and Criminal Justice 8:29-59. 20 Starr—Estimating Gender Disparities in Federal Criminal Cases Stacey, Ann Martin, and Cassia Spohn. 2006. “Gender and the Social Costs of Sentencing: An Analysis of Sentences Imposed on Male and Female Offenders in Three U.S. District Courts.” Berkeley Journal of Criminal Law 11:43-75. Steffensmeier, Darrell, John Kramer, and Cathy Streifel. 1993. “Gender and Imprisonment Decisions.” Criminology 31:411-46. Stolzenberg, Lisa, and Stewart J. D’Alessio. 2004. “Sex Differences in the Likelihood of Arrest.” Journal of Criminal Justice 32:443-54. Stith, Kate. 2008. “The Arc of the Pendulum: Judges, Prosecutors, and the Exercise of Discretion.” Yale Law Journal 117:1420-97. Tobin, James. 1958. “Estimation of Relationships for Limited Dependent Variables.” Econometrica 26:24-36. Ulmer, Jeffrey T., and Mindy S. Bradley. 2006. “Variation in Trial Penalties Among Serious Violent Offenses.” Criminology 44:631-70. U.S. Sentencing Commission. 2010. Demographic Differences in Federal Sentencing Practices: An Update of the Booker Report’s Multivariate Regression Analysis. 21 Starr—Estimating Gender Disparities in Federal Criminal Cases Table 1 SUMMARY STATISTICS District court defendants sentenced for non-petty crimes: Male White Black Other Race Age (Years) U.S. Citizen Non-Parent Married Parent Single Parent Multi-Defendant Case Education: HS Dropout HS Diploma GED/Vocational College Criminal History: Category 1 (low) Category 2 Category 3 Category 4 Category 5 Category 6 (high) Offense Category: Property/Fraud Regulatory Drug Violent Sentenced to Prison Prison Sentence Length (Months) Prison Sentence Length (If Incarcerated) All arrestees in filing-stage sample Filed in District Court All district-court defendants in conviction-stage sample Convicted (Non-Petty) (1) Mean (2) Female Mean (3) Male Mean (4) Observations 0.808 0.646 0.310 0.044 34.1 73.7 0.368 0.300 0.333 0.473 0 0.652 0.295 0.053 34.5 82.6 0.374 0.244 0.383 0.472 1 0.645 0.313 0.042 34.0 71.6 0.366 0.313 0.321 0.473 231,694 231,694 231,694 231,694 231,694 231,694 187,651 187,651 187,651 231,694 0.418 0.213 0.130 0.239 0.342 0.236 0.123 0.300 0.436 0.208 0.132 0.224 231,694 231,694 231,694 231,694 0.565 0.106 0.127 0.066 0.038 0.097 0.737 0.093 0.091 0.034 0.018 0.028 0.524 0.109 13.6 0.074 0.043 0.114 231,694 231,694 231,694 231,694 231,694 231,694 0.282 0.055 0.590 0.073 0.818 56.9 69.5 0.468 0.054 0.446 0.032 0.639 25.2 39.5 0.237 0.055 0.625 0.083 0.861 64.4 74.8 231,694 231,694 231,694 231,694 231,617 231,617 161,032 0.919 0.905 0.922 386,205 0.928 0.913 0.932 286,709 22 Starr—Estimating Gender Disparities in Federal Criminal Cases Male Black Other Age Table 2 REGRESSION ESTIMATES OF MEAN GENDER DISPARITIES IN CASE PROCESSING* (1) (2) (3) (4) Filing in District Non-Petty Conviction Incarceration Log Prison Length Court (Odds Ratios) (Odds Ratios) (Odds Ratios) (If Incarcerated) Coefficient SE Coefficient SE Coefficient SE Coefficient SE 1.213*** .044 1.293*** .029 2.193*** .052 0.347*** .014 1.023 .045 0.919** .025 0.909*** .023 0.063*** .012 1.544** 1.009*** .201 .002 0.928 0.989*** .043 .001 0.929 1.001 .050 .001 0.0170 0.0063*** .029 .000 1.480** .215 1.061 .035 0.674*** .027 -0.037* .016 0.680*** .020 1.115*** .031 0.158*** .017 Ed. 2: HS Grad 0.864*** .020 -0.0205* .008 Ed. 3: GED 0.902*** .026 0.0217** .007 0.944* .027 0.001 .008 Crim. His. Cat. 2 2.165*** .070 0.261*** .015 Crim. His. Cat. 3 3.525*** .124 0.364*** .015 Crim. His. Cat. 4 7.336*** .370 0.511*** .016 Crim. His. Cat. 5 11.573*** .820 0.650*** .017 U.S. citizen Multi-defendant Ed. 4: College Crim. His. Cat. 6 N 379,148 282,938 19.424*** 1.238 231,613 0.944*** .014 189,498 NOTE. – Ed. Cat. = educational category; Crim His. Cat. = criminal history category. Odds ratios/coefficients are from logistic and OLS regressions that also include arrest-offense and district fixed effects. *Standard errors clustered on arrest-district, respectively. *p.<0.05; **p<0.01; ***p<0.001. 23 Starr—Estimating Gender Disparities in Federal Criminal Cases Table 3 POSSIBLE EFFECTS OF SAMPLE SELECTION ON ESTIMATION OF DISPARITY IN NON-ZERO PRISON SENTENCES: COMPARISON OF TRIMMED-SAMPLE ESTIMATES* (1) (2) (3) (4) Coefficient SE Coefficient SE Coefficient SE Coefficient SE Male 0.347*** .014 0.669*** .020 0.629*** .018 0.497*** .014 Sample Trim Untrimmed Influence-Based Shortest Random Short N 189,498 166,586 166,586 166,586 NOTE. – This table compares the "male" coefficient from Table 2, Column 4 to estimates for the same regression in samples that have male observations dropped so that the gender ratio in the trimmed sample matches the counterfactual ratio predicted by the Table 2, Column 3 regression if males were, conditional on observables, incarcerated only at the rate of women. The samples in Columns 2-4 are trimmed based on differing assumptions about which males are on the incarceration margin. Column 2 trims the male cases with the most negative individual influence on the "male" coefficient, Column 3 trims those with the shortest nonzero sentences, and Column 4 trims randomly from the male cases that have sentences no longer than 24 months. *Standard errors are clustered on the offense-district. *p<0.05, **p<0.01, ***p<0.001 24 Starr—Estimating Gender Disparities in Federal Criminal Cases Table 4 AVERAGE GENDER DISPARITIES IN PRISON SENTENCE LENGTH (INCLUDING ZEROS): INVERSE PROPENSITYSCORE REWEIGHTING ESTIMATES* Treatment Average Treatment Effect on (Treated=Male) Treatment on Treated (Men) Women (1) (2) (3) (4) (5) (6) (7) 23.23*** 17.60*** 17.29*** 25.13*** 18.67*** 18.55*** 15.34*** Male (2.716) (1.923) (1.373) (2.908) (1.936) (1.409) (1.701) 36.58*** 29.76*** 27.85*** 39.28*** 31.57*** 30.98*** 25.20*** Constant (3.393) (2.986) (2.254) (3.505) (2.985) (2.183) (2.472) 63.5 59.1 62.1 64.0 59.1 59.9 60.9 Percent Sample N Full 231,582 PS Trim 190,535 Low CH 173,407 Full 231,582 PS Trim 190,535 Low CH 184,787 Full 231,582 NOTE. – Columns 1-3 show the average increase in sentence in months associated with changing all cases from the female to the male treatment condition, estimated by inverse propensity-score reweighting. Covariates used to estimate propensity scores are arrest offense, criminal history, education, age, race, U.S. citizenship, and multi-defendant case flag. Column 1 shows full-sample results. The Column 2 sample is trimmed to eliminate extreme propensity score values (p(m)>.93), and the Column 3 sample is limited to cases in criminal history categories 1-3. For the same samples, Columns 4-6 shows the "average treatment effect on the treated" (men) obtained by comparing the observed male average to the reweighted female average. Column 7 shows the counterfactual "average treatment effect on the untreated" (women) obtained by comparing the observed female average to the reweighted male average, for the full sample. The "constant" line is the average in the female treatment condition and the "percent" line expresses the treatment effect as a percent of this female average. *Standard errors are clustered on the strata within which propensity scores are balanced. *p<0.05, **p<0.01, ***p<0.001. 25 Starr—Estimating Gender Disparities in Federal Criminal Cases Table 5 ALTERNATE SAMPLES AND SPECIFICATIONS: INVERSE PROPENSITY-SCORE REWEIGHTING ESTIMATES, TREATMENT ON UNTREATED (WOMEN)* (1) (2) (3) (4) (5) (6) Married Sample Main Drug Prop./ Reg. Non-Parent Single Parent Parent Male 15.34*** 23.35*** 5.975*** 12.82*** 13.40*** 17.63*** (1.70) 25.20*** (2.472) 60.9 231,582 (1.115) 40.00*** (2.064) 58.4 136,730 (0.408) 11.01*** (0.893) 54.3 77,989 (1.717) 24.85*** (3.154) 51.6 68,890 (1.877) 22.60*** (2.531) 59.3 56,085 (2.608) 27.26*** (3.749) 67.3 62,419 (7) (8) (12) Post-Booker (10) Sole Defendant (11) Pre-Booker (9) MultiDefendant Black Non-Black 14.65*** (1.855) 23.81*** (2.554) 61.5 109,663 15.89*** (1.961) 26.46*** (3.067) 60.1 121,883 21.42*** (2.421) 32.43*** (2.90) 66.0 109,487 9.599*** (1.553) 18.73*** (2.99) 51.2 121,875 17.52*** (2.645) 23.68*** (3.80) 74.0 71,737 13.20*** (1.087) 25.83*** (1.947) 51.1 159,801 (13) (14) (15) (16) (17) (18) States w/ W. Pris. 14.45*** (1.626) 25.79*** (2.78) 56.0 91,470 States w/o W Pris. 15.59*** (2.149) 24.81*** (2.96) 62.8 139,932 Police Notes Rec'd 15.75*** (1.665) 26.44*** (2.71) 59.6 134,613 Police Notes Flags 15.57*** (1.606) 26.44*** (2.78) 58.9 134,613 Family Rec'd 15.07*** (1.842) 25.22*** (2.777) 59.8 187,553 Family Added 15.19*** (1.407) 25.23*** (2.339) 60.2 187,549 (19) (20) (21) (22) (23) (24) Counsel Rec'd Counsel Added Plea/Time Added Drug Qty Rec'd Drug Qty Ctrl. Presumpt. Sentence Percent 15.33*** (1.531) 26.70*** (2.224) 57.4 14.83*** (1.351) 26.70*** (2.521) 55.5 14.06*** (1.607) 25.20*** (2.523) 55.8 19.28*** (1.943) 33.20*** (2.060) 58.1 17.83*** (1.720) 33.20*** (2.372) 53.7 5.661*** (0.748) 25.20*** (4.218) 22.5 N 135,471 135,470 231,582 37,074 37,074 231,617 Constant Percent N Sample Male Constant Percent N Sample Male Constant Percent N Sample Male Constant NOTE. – The constant reflects the observed female average sentence length in months for the designated sample (including zeros) and the "male" coefficient is the average additional sentence length predicted if these cases were treated as male, based on inverse propensity score reweighting of the observed male sentences using the same covariates as in Table 4 except as noted. Standard errors are clustered on the strata within which propensity scores are balanced. *p<0.05, **p<0.01, ***p<0.001. 26 Starr—Estimating Gender Disparities in Federal Criminal Cases Table 6 SERIAL DECOMPOSITION OF AVERAGE GENDER DISPARITY BY PROCEDURAL SOURCES: IPW ESTIMATES OF TREATMENT ON UNTREATED (WOMEN) Unexplained Gap (Months)  No Controls (Total Gap) 26.90*** Non-Drug Cases (Observed Female Mean: 13.27 Months)  Add  Pre Add Conviction Charge Charge Sev. Sev.  Add Controls Measures Measures Fact-finding 7.89*** 7.18*** 6.88*** 2.13*** Remainder (Attrib. to Sentencing) N/A (0.37) (0.31) (0.30) (0.29) (0.27) As % of Female Mean 202 59.5 54.1 51.8 16.1 N/A Share Explained by This Stage 19.01*** 0.71*** 0.30*** 4.75*** 2.13*** N/A (0.29) (0.08) (0.05) (0.13) (0.27) N/A 70.7 2.6 1.1 17.7 7.9 N/A N/A 9.0 3.8 60.2 27.0 This Stage As % of Total Gap This Stage as % of Disparity in Justice Process Drug Cases (Observed Female Mean: 40.04 Months) Remainder (Attrib. to Sentencing)  No Controls (Total Gap) 38.92***  Pre-Charge Controls 23.38***  Add Conviction Mand. Min. 15.57***  Add Fact-finding 8.67*** (0.42) (0.38) (0.35) (0.29) As % of Female Mean 97.2 58.4 38.9 21.7 N/A Share Explained by This Stage 15.54*** 7.81*** 6.90*** 8.67*** N/A (0.30) (0.22) (0.20) (0.29) This Stage As % of Total Gap N/A 39.9 20.1 17.7 22.3 N/A N/A 33.4 29.5 37.1 Unexplained Gap (Months) This Stage as % of Disparity in Justice Process N/A NOTE. – Column 1 shows the average observed male-female sentence gap in months, while Column 2 shows the gap when males are reweighted on the inverse propensity score using the pre-charge covariates from Table 4. In the other columns, additional covariates have been added sequentially. In Panel A, Column 3 adds the mandatory minimum, statutory maximum, and guidelines sentence associated with the initial charges; Column 4 further adds the same measures for the charges of conviction; and Column 5 further adds the final Guidelines offense level. In Panel B, Column 3 adds the mandatory minimum at conviction, and Column 4 further adds the final offense level. The "Share Explained by This Stage" row is based on the reduction of the unexplained relative to the preceding step, and the rows below it express this share as a percentage of the total gap and the gap unexplained by pre-charge covariates. The last column in each panel ("Share Remaining") expresses the residual unexplained in the preceding column, which is attributed to the final sentencing decision, in percentage terms, showing that the percentages the decomposition attributes to the procedural stages sum to 100%. *Standard errors are bootstrapped (500 replications). *p<0.05, **p<0.01, ***p<0.001. 27 Starr—Estimating Gender Disparities in Federal Criminal Cases Table 7 SHARE OF MEAN GENDER GAP EXPLAINED BY SPECIFIC FINDINGS OF FACT AND DEPARTURES: IPW DECOMPOSITION OF TREATMENT ON UNTREATED (WOMEN)* Findings of Fact: Aggravating/Mitigating Role Acceptance of Responsibility Obstruction of Justice Loss Amount Non-Drug Crimes (Gap Unexpl. by Pre-Charge Controls: 7.9 Months) Months Share of Gap (%) Drug Crimes (Gap Unexpl. By Pre-Charge Controls: 23.4 Months) Months Share of Gap (%) 1.138*** (0.062) -0.261*** (0.037) 0.228*** (0.042) 1.585*** (0.065) 4.578*** (0.128) 0.039 (0.094) 0.076 (0.070) 14.4 -3.3 2.9 20.1 Drug Quantity N/A Drug Safety Valves (Guidelines/Mand Min Waiver) Departures: Family Ties Substantial Assistance/ Cooperation Mental Health/Abuse/Addiction N/A 0.123*** (0.018) 0.069 (0.037) 0.136*** (0.024) 1.6 0.9 1.7 N/A 0.740*** (0.103) 2.074*** (0.111) 0.287*** (0.032) 2.141*** (0.108) 0.235*** (0.030) 19.6 0.2 0.3 N/A 3.2 8.9 1.2 9.2 1.0 NOTE. – Incremental reductions in unexplained disparity when particular findings of fact or departures are added to the IPW reweightings in Table 6. Findings of fact are added to weights that already include all variables through the conviction stage as noted in Table 6. Departures are added to weights that also included the final Guidelines offense level. *Standard errors are bootstrapped (500 replications). Because these figures are based on adding each of these variables independently (rather than together or sequentially), their collective explanatory power may be overstated if the variables are collinear with one another. *p.<0.05, **p<0.01, ***p<0.001. 28 Starr—Estimating Gender Disparities in Federal Criminal Cases FIGURES Figure 1a. – Distribution of Gender Propensity Scores for Full Sample Figure 1b. - Distribution of Gender Propensity Scores for Low Criminal History Sample 29 Starr—Estimating Gender Disparities in Federal Criminal Cases Figure 2. - Gender Disparities in the Sentencing Distribution: Females vs. Reweighted Males 0.4 Share of Distribution 0.35 Observed Male 0.3 Reweighted Male 0.25 Observed Female 0.2 0.15 0.1 0.05 0 NP <12 12 to 36 36 to 60 60 to 84 84 to 150 >=150 Sentence (Months) 30 Starr—Estimating Gender Disparities in Federal Criminal Cases Figure 3a. - Sequential Reweighting of the Sentencing Distribution: Non-Drug Cases 0.6 Observed Male Pre-Chg Controls Share of Distribution 0.5 + Chg Controls 0.4 + Conv Ctrls + Factfinding 0.3 0.2 0.1 0 NP <12 12 to 24 24 to 36 36 to 60 60 to 120 >120 Sentence (Months) Figure 3b. - Decomposition of Gender Gaps in the Sentencing Distribution by Procedural Percentage Point Gap in Prob. of Exceeding Threshold (Females vs. Reweighted Males) Source: Non-Drug Cases 0.14 Sentencing 0.12 Factfinding 0.1 0.08 Charge-Bargaining Charging 0.06 0.04 0.02 0 Any Pris 12 24 36 60 Sentence Length Threshold (Months) 120 31 Starr—Estimating Gender Disparities in Federal Criminal Cases Figure 3c. - Sequential Reweighting of the Sentencing Distribution: Drug Cases Share of Distribution 0.3 Observed Male 0.25 Pre-Chg Controls + Mand Min 0.2 +Factfinding 0.15 Observed Female 0.1 0.05 0 NP <24 24 to 48 48 to 72 72 to 120 120 to 180 >=180 Sentence (Months) Figure 3d. – Decomposition of Gender Gaps in the Sentencing Distribution by Procedural Percentage Point Gap in Prob. of Exceeding Threshold (Females v. Reweighted Males) Source: Drug Cases 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 Sentencing Factfinding Mand Min Any Pris 24 48 72 120 180 Sentence Length Threshold (Months) 32 Starr—Estimating Gender Disparities in Federal Criminal Cases DATA APPENDIX 1. The Linked Dataset This project is based on a linked, multi-agency dataset from four federal agencies: the U.S. Marshals’ Service (USMS), the Executive Office for the U.S. Attorneys (EOUSA), the Administrative Office of the U.S. Courts (AOUSC) and the U.S. Sentencing Commission (USSC).30 These datasets are collected by the Bureau of Justice Statistics (BJS) and, pursuant to security conditions, made available to researchers via the National Archive of Criminal Justice Data along with linking files that allow records to be linked at an individual level across the agencies.31 Together, these files trace cases from arrest through sentencing. USMS collects information upon booking of arrestees in federal custody, based on arrest-stage information drawn from law enforcement. Their data include arrest offense, race, age, gender, marital status, a written offense description based on information from law enforcement, U.S. citizenship status, arrest date, the federal judicial district, and the arresting agency. EOUSA collects investigation- and case-related data for prosecutors; its fields were used to determine whether arrestees were charged before a district judge and for information on the type and quantity of drugs seized in arrests. Data on the initial and final charges in the case (and their disposition) as well as the number of co-defendants, defense counsel type, and the county of the offense came from the AOUSC, which compiles district court records. The USSC provides information recorded by judges on the sentence, the mandatory minimum applicable at sentencing, the defendant’s criminal history, education level, number of dependents, and Hispanic ethnicity, as well as rich detail on the specific findings of “sentencing facts” entered by judges as well as the reasons given for departure from the Sentencing Guidelines range. The linking algorithm is dyadic and includes both inter- and intra-agency links, because EOUSA and AOUSC each have multiple kinds of files. There are two possible linking pathways that incorporate all the relevant fields. The first runs from USMS to the EOUSA suspect investigation file to the EOUSA “cases terminated” file to the AOUSC “cases terminated” file to the USSC. The second runs from USMS to the EOUSC suspect investigation file to the EOUSA “cases filed” file to the AOUSC “cases filed” file to the AOUSC “cases terminated” file to the USSC. The sample for the sentencing analysis is limited to cases that linked all the way through one of these two pathways. Link-through rates were 81% (USMS to EOUSA investigation files),32 93% (EOUSA to AOUSC, among 30 The underlying linked dataset is the same as that used in a related paper on racial disparity by Rehavi and Starr (2012). However, the samples are constructed differently, in part because of different common-support concerns; this study uses more years of data, includes different case types, and includes all federal districts. This study also includes a number of additional covariates. 31 Descriptions of the files are at http://www.icpsr.umich.edu/icpsrweb/content/NACJD/guides/fjsp.html. 32 The lower link rate at this stage is probably because there are substantive reasons cases might not link through, in addition to failures of the linking algorithm. Cases would not link through if they were immediately either declined or transferred to some other authority (before opening a suspect investigation file). i Starr—Estimating Gender Disparities in Federal Criminal Cases cases filed in district court only), and 90% (AOUSC to USSC, among cases with convictions of non-petty offenses only), and did not significantly differ by gender.33 2. Sample Restrictions 2.1. Timing The main sentencing sample consisted of cases sentenced between October 1, 2000 through September 30, 2009. The analyses of filing and conviction rates required case initiation (arrest or opening of the EOUSA investigation file, whichever is later) or disposition, respectively, during the same period. 2.2. Case Type Immigration cases were excluded because their stakes typically center on deportation rather than sentencing and because they often are handled via a very different fast-track process.34 In order to achieve better overlap between the male and female samples, I also excluded several case categories in which the arrestees were over 95% male: sex and pornography-related offenses (except for prostitution), weapons offenses, conservation offenses (mainly illegal hunting and fishing), and family offenses (mainly failure to pay child or spousal support). The remaining case types were property and fraud offenses, regulatory offenses (excluding those mentioned above), non-sexual violent crimes, and drug offenses. All case type exclusions were based on the USMS arrest code. Defining the sample based on the arrest stage data alone avoided potentially serious sample selection issues that could have emerged had the exclusions been based on the prosecutor's discretionary decisions. The USMS codes are based on the principal arrest offense and may exclude some secondary criminal conduct (although in most cases, because concurrent sentencing is the default rule, secondary conduct will not affect the sentence). While virtually nobody in the sample was convicted of any immigration, sex/pornography, family, or conservation offenses, overlap between weapons cases and other cases is more common: about 6% of the sample was convicted of a weapons charge. The presence of weapons in violent crimes is often captured by the arrest code, and their presence in any kind of case is often captured by the police-notes-based description field that I use in robustness checks. Cases with arrest codes indicating a reason for detention other than a criminal offense (material witness warrants and violations of the conditions of parole or probation) were also excluded from the sample. 3. Construction of Key Independent Variables 3.1. Demographics Gender, race, U.S. citizenship, marital status, and age are recorded by USMS. Race is coded as white, black, Asian, Native, and Other/Unknown; the last three groups together constitute only 4% of the sample, and I combined them. USSC provides number of dependents, education level, and Hispanic ethnicity (ethnic Hispanics are overwhelmingly coded as white in the race data). Marital status, number of dependents, and Hispanic ethnicity are sometimes missing and are included only in robustness checks. Also as a 33 Rates of filing in district court and conviction of non-petty offenses are outcomes assessed in the paper; cases that drop out of the sample due to non-filing, dismissal, or acquittal do not reflect linking failures. 34 Citizenship was included as a covariate, and non-citizens were excluded in robustness checks, because deportation is also an important concern when non-citizens are charged even in non-immigration cases. ii Starr—Estimating Gender Disparities in Federal Criminal Cases robustness check, the county fields in AOUSC were linked to county level unemployment, poverty, and wage data from the Census Bureau and to crime data from the FBI’s Uniform Crime Reporting Program. 3.2. Arrest Offense There are 430 unique arrest offenses listed in the USMS data. The original arrest offense codes included many very similar offense descriptions, including some that were slightly more detailed versions of others (for instance, “vehicle theft” and “vehicle theft by bailee”). Often the more detailed ones were rarely used. Therefore, the smallest categories were combined with others that could describe the same legal offense.35 In addition, I subdivided some of the drug arrest offense codes based on the drug type information in the EOUSA investigation-stage file. This is because many drug cases are simply given the arrest code “dangerous drugs,” and because the cocaine arrest codes combine crack and powder, which have different sentencing schemes. There were 123 resulting arrest-code groups. The results are robust to the use of the original offense codes. Note that the drug offense codes do not specify quantity. The drug quantity at arrest (in addition to type) is usually identified by the EOUSA investigation-stage file; however, the quantity field is unreliable starting in FY 2004.36 Therefore, the main analyses do not include quantity in the controls, but robustness checks confined to FY 2001-03 do. To enable quantity comparisons across drug types, quantities were translated into “marijuana equivalents” according to the conversion tables in the Sentencing Guidelines. 3.3. Criminal History Criminal history data are only available in the USSC data and are accordingly only available for those sentenced for guideline offenses. The variable used was the defendant's criminal history category, which ranges from 1 to 6 and forms the basis of the Guidelines sentencing grid. In 0.2% of the sentencing sample, this field was originally missing but could be calculated based on another Sentencing Commission field called "criminal history points," according to the rules laid out in the Guidelines. 3.4. Charge Severity Measures The raw charge and conviction data are the statutory provisions under which the defendant was charged and convicted. These provisions had to be translated into measures of charge severity in order to assess the contribution of charging and conviction severity to sentence disparities.37 On the basis of legal research, I identified the statutory maximum and 35 No single number defined what categories were small enough to be combined, because the combination depended on the legal assessment that the crimes were sufficiently similar. 36 There are drastic changes in the apparent quantity distribution in this field from 2003 to 2004 as well as large inconsistencies in quantity between this field and the sentencing-stage quantities recorded by USSC beginning in 2004. EOUSA adopted a new data entry system in 2004, and it seems apparent that the problem is with this system; unfortunately the inconsistencies appeared neither to be uniformly applicable nor confined to particular drug types or districts, so there is no way to identify which cases are problematic. 37 While the AOUSC data include a “severity” field, which is ostensibly based on the statutory maximum, it is not very useful because appears to automatically be based on the very highest maximum contained anywhere in the statute cited, even when that maximum is only triggered by an exceptional circumstance that rarely applies. For instance, charges under 18 U.S.C. § 1347 (health care fraud) are coded by AOUSC as having a statutory maximum of life, even though that maximum only applies when the fraud leads to a death. iii Starr—Estimating Gender Disparities in Federal Criminal Cases minimum sentence, and the Guidelines-recommended sentence associated with each combination of charges and convictions. Because the cited statutory provisions sometimes contain varied sentencing schemes depending on the facts of the case, I researched the most common ways in which these statutes are charged in order to be able to make realistic assumptions in the face of such ambiguities. When possible, ambiguities were resolved by reference to the other charges in the case, when the legal elements of those charges revealed additional facts that the prosecutor must have been alleging. For instance, suppose Charge 1 is a burglary offense that usually has a maximum sentence of 10 years, but has a 20-year maximum if someone is seriously injured in the course of the burglary. Charge 2 is an aggravated assault charge, with a 15-year maximum, in which aggravated assault is defined to require that serious injury be proven. Because Charge 2’s presence indicates that the prosecutor was alleging serious injury, the maximum sentence for Charge 1 is raised to 20 years. Implementing this approach required constructing a number of flags for every federal criminal statute, a complicated statutory interpretation task. The flags indicated whether certain facts were elements of the crime: death, injury, serious injury, drug crime, sex crime, fraud, official victim, minor victim, terrorist motive, an assault, use of a weapon, use of a gun specifically, a “crime of violence,” obstruction of justice, taking a person for ransom, and whether the crime was a predicate offense for the crime of felony murder. Statutes also had to be coded to reflect adjustments to the statutory or guidelines sentences that would be triggered by the presence of particular facts as identified by the flags for the other charges in the case. Remaining ambiguities were resolved according to default assumptions that varied between the severity measures.38 Constructing a measure of the Guidelines sentence involved additional challenges. First, the statutory provisions cited by AOUSC had to be matched to corresponding Sentencing Guidelines. The actual Guidelines range calculated by the judge is not solely determined by the charges; rather, it is heavily driven by sentencing fact-finding. However, the point of the charge-severity measures is to distinguish the effect of charging and conviction severity itself from that of sentencing fact-finding. Thus, the Guidelines-based measures of charge and conviction severity represent base offense levels determined solely by what the prosecutor charged (or what the defendant was convicted of), that is, the elements of the crime. It is based on applying the Guidelines assuming the elements of all charges brought were proven, but no additional findings of fact were made at sentencing. The Guidelines define the “offense level”—a severity scale running from 1 to 43— associated with each offense. In order for the units of this measure to be comparable to the other metrics, this offense level had to be converted into an implied sentence length in months. Under the Guidelines, offense levels translate mechanically into sentence ranges based on a grid, with criminal history as the other axis. The same column (Column 6) was used for the translation in all cases, such that the charging and conviction measures are blind to the defendant’s actual criminal history—they reflect charge severity alone, and criminal history is a separate covariate. The number of months used was the low end of the range in the applicable grid cell. 38 A detailed spreadsheet showing these flags and other details on coding choices is available on request. iv Starr—Estimating Gender Disparities in Federal Criminal Cases Once the severity of the individual charges were coded, they were combined into total severity measures for all charges. In general, the severity of federal cases is determined by the most serious charge alone, because concurrent sentencing is the default rule. Thus, secondary charges affected the charge severity measures only when one of the charging statutes was an offense specifying that consecutive sentencing was required. As described above, however, information drawn from secondary charges could be used to adjust the coding of the primary charge. This approach to combining charges follows the method specified in the Sentencing Guidelines (see U.S.S.G. § 5G1.2). Two final adjustments were then made. First, the statutory minimum and the sum of the individual-charge maximums were imposed as lower and upper constraints, respectively, on the Guidelines sentence, which also tracks sentencing law (see U.S.S.G. § 5G1.2). Second, zeros on the statutory maximum, guidelines, and mean sentence scales were replaced with half a month—half of the lowest nonzero values otherwise calculated—to reflect the fact that no criminal charge truly has zero severity, even if no incarceration is imposed. This adjustment affected only 0.05% of cases for the statutory maximum measure, 0.2% of cases for the guidelines measure, and 0.5% for the mean sentence measure. The mandatory minimum measure was turned into an indicator variable for whether there was any nonzero mandatory minimum and (for alternate specifications) a categorical variable designating whether the mandatory minimum was 0, less than 10 years, and 10 years or more. Similar variables were constructed based on the actual mandatory minimum of conviction recorded at the sentencing stage in the USSC data. 3.5. Conviction and Sentence Outcomes A dummy variable for whether the defendant was convicted of a non-petty offense was constructed based on AOUSC records. Non-petty offenses are those carrying more than six months as a statutory maximum, so the classification of offenses is based on the statutory maximum measure described above. Conviction of a non-petty offense is a prerequisite for inclusion in the Sentencing Commission data. Sentence data were drawn from the Sentencing Commission and are therefore only available for those convicted of offenses covered by the sentencing guidelines. Sentence lengths were truncated at 540 months, and life sentences were given that value. This length is longer than the highest non-life statutory maximum found in federal law (480 months), and corresponds approximately to the remaining life expectancy of an American of the sampleaverage age. Only 0.7% of sentenced cases were affected by this truncation. v Starr—Estimating Gender Disparities in Federal Criminal Cases Data Sources U.S. Census Bureau, 2000. “Census of Population and Housing, Summary File 3.” http://www.census.gov/census2000/sumfile3.html (last updated October 13, 2011). U.S. Department of Justice. Office of Justice Programs. Bureau of Justice Statistics. 2009. Federal Justice Statistics Program: Paired-Agency Linked Files, 2009. Ann Arbor, MI: Inter-university Consortium for Political and Social Research. ICPSR Study Number 30701-v3 (2011-11-11). U.S. Department of Justice. Office of Justice Programs. Bureau of Justice Statistics. Federal Justice Statistics Program: Arrests and Bookings for Federal Offenses, 2000-2009. Ann Arbor, MI: Inter-university Consortium for Political and Social Research. ICPSR Study Numbers 24126-v2 (2011-03-08), 24145-v2 (2011-03-08), 24164.v2 (2011-0308), 24181.v2 (2011-03-08), 24216.v2 (2011-03-08), 24199.v2 (2011-03-08), 24211.v2 (2011-03-08), 24226.v2 (2011-03-08), 24231.v2 (2011-03-08), 29428.v2 (2011-03-08), 30794-v1 (2011-07-22). Original Data Source: U.S. Marshals’ Service (“USMS”). U.S. Department of Justice. Bureau of Justice Statistics. Federal Justice Statistics Program: Suspects in Federal Criminal Matters Concluded, 2000-2009 [United States]. Ann Arbor, MI: Inter-university Consortium for Political and Social Research. ICPSR Study Numbers: 24120-v2 (2011-03-08), 24139.v2 (2011-03-08), 24158.v2 (2011-03-08), 24175.v2 (2011-03-08), 24193.v2 (2011-03-08), 24210.v2 (2011-03-08), 24225.v2 (2011-03-08), 29424.v2 (2011-03-08), 30790.v1 (2011-06-03). Original Data Source: Executive Office of U.S. Attorneys (“EOUSA Matters Out”). U.S. Department of Justice. Bureau of Justice Statistics. Federal Justice Statistics Program: Defendants Charged in Criminal Cases Filed in District Court, 2000-2009 [United States]. Ann Arbor, MI: Inter-university Consortium for Political and Social Research. ICPSR Study Numbers: 24121-v2 (2011-03-08), 24140.v2 (2011-03-08), 24159.v2 (2011-03-08), 24176.v2 (2011-03-08), 24194.v2 (2011-03-08), 24211.v2 (2011-03-08), 24226.v2 (2011-03-08), 29426.v2 (2011-03-08), 30791.v1 (2011-06-03). Original Data Source: Executive Office of U.S. Attorneys (“EOUSA Cases In”). U.S. Department of Justice. Bureau of Justice Statistics. Federal Justice Statistics Program: Defendants in Federal Criminal Cases -- Terminated, 2000-2009 [United States]. Ann Arbor, MI: Inter-university Consortium for Political and Social Research. ICPSR Study Numbers: 24122-v2 (2011-03-08), 24141.v2 (2011-03-08), 24160.v2 (2011-03-08), 24177.v2 (2011-03-08), 24195.v2 (2011-03-08), 24212.v2 (2011-03-08), 24227.v2 (2011-03-08), 29433.v2 (2011-03-08), 30792.v1 (2011-06-03). Original Data Source: Executive Office of U.S. Attorneys (“EOUSA Cases Out”). U.S. Department of Justice. Bureau of Justice Statistics. Federal Justice Statistics Program: Defendants in Federal Criminal Cases Filed in District Court, 2000-2009 [United States]. Ann Arbor, MI: Inter-university Consortium for Political and Social Research. ICPSR Study Numbers: 24114-v2 (2011-03-08), 24133.v2 (2011-03-08), 24152.v2 (2011-03-08), 24169.v2 (2011-03-08), 24186.v2 (2011-03-08), 24204.v2 (2011-03-08), 24221.v2 (2011-03-08), 29402.v2 (2011-03-08), 30781.v1 (2011-06-03). Original Data Source: Administrative Office of the U.S. Courts (“AOUSC Cases In”). vi Starr—Estimating Gender Disparities in Federal Criminal Cases U.S. Department of Justice. Bureau of Justice Statistics. Federal Justice Statistics Program: Defendants in Federal Criminal Cases in District Court -- Terminated, 2000-2009 [United States]. Ann Arbor, MI: Inter-university Consortium for Political and Social Research. ICPSR Study Numbers: 24115-v2 (2011-03-08), 24134.v2 (2011-03-08), 24153.v2 (2011-03-08), 24170.v2 (2011-03-08), 24187.v2 (2011-03-08), 24205.v2 (2011-03-08), 24222.v2 (2011-03-08), 29242.v2 (2011-03-08), 30784.v1 (2011-0603). Original Data Source: Administrative Office of the U.S. Courts (“AOUSC Cases Out”). U.S. Department of Justice. Bureau of Justice Statistics. Federal Justice Statistics Program: Defendants Sentenced Under the Sentencing Reform Act, 2001-2009 [United States]. Ann Arbor, MI: Inter-university Consortium for Political and Social Research. ICPSR Study Numbers: 24127-v2 (2011-03-08), 24146.v2 (2011-03-08), 24165.v2 (2011-0308), 24182.v3 (2011-03-08), 24200.v3 (2011-03-08), 24217.v3 (2011-03-08), 24232.v2 (2011-03-08), 29381.v2 (2011-03-08), 30795.v1 (2011-06-06). Original Data Source: U.S. Sentencing Commission (“USSC”). U.S. Department of Justice. Office of Justice Programs. Federal Bureau of Investigation. Uniform Crime Reporting Program Data: County-Level Detailed Arrest and Offense Data, 2007-2009. Ann Arbor, MI: Inter-university Consortium for Political and Social Research. ICPSR Study Numbers: 30763-v1 (2012-01-25), 27644-v1 (2011-04-21), 25114-v1 (2009-07-31). vii