Insights

Imputing Race/Ethnicity: Part 1

Lisa M. Lines Senior Health Services Researcher

Jamie Humphrey Health Geographer

September 02, 2021

This blog originally appeared on The Medical Care Blog and is republished here with permission.

In Part 1 of this two-part series, we lay out arguments for and shortcomings of imputing race/ethnicity from the perspective of health equity. In Part 2, we’ll talk about evidence gaps and research needed, as well as a few alternative approaches.

The Biden administration is focusing on health equity and improved data collection to measure and analyze disparities and inequities. Imputation is a method of inferring or assigning values, or a vector of probabilities, to missing data. How does imputing race and/or ethnicity fit with the administration’s efforts, as well as the broader reckoning with the racial equity imperative?

We (researchers, policymakers) often want to know what every person’s race/ethnicity is because it helps us understand and track quality, costs, access to care, and outcomes for different groups. Many have asked for COVID-19-related data to be released by race/ethnicity, so we can measure the disproportionate impact in Black, Indigenous, and other People of Color (BIPOC) communities. Moreover, the Federal government requires that various entities collect and report race and ethnicity data. Recent efforts have focused on reporting quality measures and other outcomes stratified by race/ethnicity. However, people sometimes leave race/ethnicity questions blank on a survey, for example – and this is more likely if they don’t feel they fit into any of the answer categories.

Beyond Black and White

Race/ethnicity have historically been measured in a variety of ways. Since race and ethnicity are social constructs, rather than biological ones, definitions are bound to evolve. In earlier years, races and ethnicities aside from white and Black were measured inconsistently, if at all. In fact, Social Security categories for race were just Black, White, and “Other” until 1980.

From 1790 to 2020, every US Census has asked about race – using different categories nearly every time. Here’s one illustration: people from the Indian subcontinent were categorized as “Hindu” from 1920–1940), as “White” in the next three censuses, and as “Asian” since 1980.

In another example, some people of Middle Eastern and North African (MENA) descent have lobbied to be included in the “White” category on the Census. Others from MENA communities have lobbied to have their own category, arguing that being lumped into the white category erases their community.

A small study in two diverse clinics nearly 20 years ago found that “many patients became angry when asked about race/ethnicity, and some did not understand the question… many respondents identified with a national origin instead of a race or ethnicity.” Is it any wonder that the government’s standard categories, as seen on many surveys, are still contentious? Relying on self-reporting thus means dealing with under-reporting and missing data. Increasing self-reporting takes rebuilding broken trust, which is not a quick or easy thing.

Imputing Missing Race/Ethnicity Data is a Long-Established and Common Practice

Attempts to address missing data began as early as the 1950s. Older approaches involved assigning the mean or mode value to missing data. In more complex analyses, researchers used other variables that weren’t missing to predict values in a single regression imputation.

Some Federal data resources use hot-deck imputation. This approach involves imputing data by randomly selecting the value from a similar record. The Medical Expenditure Panel Survey (MEPS), for example, imputes missing data on income and employment in this manner, but not on disability or race/ethnicity. For race/ethnicity, MEPS creates edited/imputed versions of the race/ethnicity indicators, filling in from other data sources (where available) and the race/ethnicity of family members.

Methods Have Evolved Over the Last Few Decades

In recent years, multiple imputation with chained equations (MICE) has overcome the limitations of single regression approaches. MICE uses information from multiple regression models and random, bootstrapped samples. Bayesian and random forest-based regression approaches have also shown promise in terms of reducing misclassification bias.

A complete description of approaches to imputation is beyond the scope of this blog post. However, it’s worth noting that modern approaches often generate estimated probabilities for statistical modeling, rather than assigning people to specific categories. This avoids the potentially problematic issue of directly assigning people to the wrong categories. However, when using these probabilities in models, their coefficients cannot be interpreted the same way as the coefficients estimated with categorical race/ethnicity data.

An argument can be made that, done correctly, imputation is imperfect but better than nothing. It reduces bias and variance and improves the quality of the data. Multiple imputation methods also account for uncertainty in the imputed data. Indirect estimation is certainly less “burdensome” – from the government’s perspective - than gathering this information directly.

Shortcomings of Imputing Race/Ethnicity from a Health Equity Perspective

From a health equity perspective, however, it is worth digging deeper. Can a statistical model actually be constructed to predict race/ethnicity that satisfies different kinds of validity – including face validity, construct validity, replicability, and predictive validity?

One major statistical issue with imputation is that the methodology implies that these missing data are non-systematically missing and/or that they should belong to the same patterns as the nonmissing data. However, research shows that people who do not volunteer identification data tend to come from underrepresented groups.

Statistically speaking, imputing race/ethnicity creates bias in terms of misidentification, which is particularly problematic in this context. If we assess the impact of the healthcare system on health outcomes through stratification by race/ethnicity, using an algorithm that induces bias in a metric so highly related to our outcome(s) of interest seems ill-advised unless it corrects more bias than it introduces.

Ethics and Identities

Ethically, we should be concerned about filling in information that has been withheld deliberately. For example, someone who agreed to provide sensitive financial or health data may not have done so if plans to impute race or ethnicity to their data were disclosed. Choosing not to answer is a valid response category. Imputation should only be done for truly missing responses.

So much of an individual’s experience of the healthcare system can be shaped by their race/ethnicity because of systemic bias and structural racism. Race and ethnicity are also associated with the effects of segregation, a relative lack of generational wealth, and many other things – largely as a direct result of federal, state, and local policy and practice. Race and ethnicity are very different from other kinds of characteristics that could be imputed, like cholesterol levels.

Race and ethnicity are essential parts of our identities, our cultures, and our experiences. Understandably given historical precedents, minority racial and ethnic identities also may be correlated with mistrust of the medical profession and mistrust of government. Racism and stigma are – independent of economics – factors in the care that people receive. For example, care providers often underestimate and undertreat the physical pain felt by Black people. Those with chronic illnesses face additional stigma that worsens their quality of life.

Given this, some argue that self-report is the only standard for personal identification – not a benchmark for validation, nor merely the best of many ways to determine a person’s identity. Algorithms that impute racial/ethnic data could exacerbate racial/ethnic biases in clinical decision-making and public policy-making. If imputed race and ethnicity variables do not accurately predict actual race and ethnicity, the conclusions policymakers draw from the imputed data could lead to misinformed policy choices that harm BIPOC populations.

Under-representation

The imputation methodologies currently in use, by their very nature, perpetuate underrepresentation: less-represented identifications are going to be less likely to be assigned (by definition) and BIPOC representation will continue to suffer. This seems backward: shouldn’t the point be to understand the experiences of those least likely to be identified? Echoing the language of the disability-rights movement – “nothing about us without us” – how can we help inform good policy without good data on those who are known to experience worse care and outcomes?

For example, electronic health records are the source of race/ethnicity in some cancer registries. In one recent study, it was found that American Indians were frequently miscategorized in those registries as white. Similarly, individuals with multiple racial/ethnic identities and Indigenous people are often misidentified on death certificates. Some healthcare facilities are better than others in collecting race/ethnicity accurately. In these cases, how are we to analyze the care provided to under-represented groups if they are misidentified in our data?

In another example, researchers used people’s surnames and where they live to assign probabilities for different races/ethnicities. The method of using surnames has obvious shortcomings: people change their surnames at marriage, people can be adopted by parents of a different racial identification, etc. The researchers note that the accuracy of their approach ranges from 88-95% for Hispanic, Black, white, and Asian/Pacific Islander people. However, American Indian/Alaska Native and multiracial people had much lower correlations between imputed and self-reported information: 12-54%.

The shortcomings of imputation approaches could be magnified with more use of algorithms. Algorithmic bias can be hard to detect and understand. More research is needed on the implications of this issue.

Conclusions

Take-away messages: Don’t impute race/ethnicity crudely or thoughtlessly -- think carefully about the validity of your models. Doing it “wrong” has potential repercussions. A better-than-nothing approach could be dangerous -- issues with regard to poor care or poor outcomes stemming from systemic racism can hardly be mitigated by math. The onus needs to remain on improving the collection of self-reported data on race and ethnicity, as well as other relevant factors of interest.

Imputing missing race/ethnicity information is routinely done – but just because it’s common doesn’t mean it’s right. Many practices that were once common in health and medicine have gone by the wayside. Someday, imputing race/ethnicity may be seen as another archaic practice from a less-enlightened era.

The question of how to handle missing data on race/ethnicity is not a simple matter. In Part 2, we will talk about the need for more and better data to inform health equity analyses. We will also discuss where we could be heading as a field, including approaches involving population-level and neighborhood-level data.

Disclaimer: This piece was written by Lisa M. Lines (Senior Health Services Researcher) and Jamie Humphrey to share perspectives on a topic of interest. Expression of opinions within are those of the author or authors.