Every decade, the U.S. Census Bureau counts Americans, trying to strike a balance between gathering accurate information and protecting the privacy of the people described in the data. But current technology can reveal a person’s transgender identity by linking seemingly anonymous information, such as their neighbors and age, to discover how their gender is reported differently in successive censuses. The ability to de-anonymize gender and other data could spell disaster for trans people and families living in states trying to criminalize them.
In places like Texas, where families seeking medical care for trans children can be accused of child abuse, the state needs to know which teens are trans in order to investigate. We are concerned that census data can be used to simplify such investigations and penalties. Could the weakness of the anonymization of publicly released datasets be exploited to find trans children and punish them and their families? Similar concerns underscored the public outcry in 2018 over the census requiring people to reveal their citizenship status — data that would be used to find people living in the U.S. illegally to punish them.
Using our expertise in data science and data ethics, we take mock data designed to mimic the Census Bureau’s publicly released datasets and try to re-identify trans teens, or at least narrow down where they might live, unfortunately , we succeeded. Through the data anonymization method used by the Census Bureau in 2010, we were able to identify 605 transgender children. Thankfully, the Census Bureau is adopting a new approach to differential privacy that will improve overall privacy, but that work is still a work in progress.When we look back at the recent published datawe found that the bureau’s new method reduced the recognition rate by 70%—much better, but there is still room for improvement.
Even as researchers who use census data to answer questions about life in the United States for our work, we firmly believe that privacy matters.The bureau is currently conducting Public Comment Period About designing the 2030 Census. Submissions may affect how the census is conducted and how the Bureau of Statistics will anonymize the data. That’s why this is important.
The federal government collects census data to determine the size and shape of congressional districts, or how to allocate funds. However, government agencies are not the only ones using this data. Researchers in a variety of fields, including economics and public health, use publicly released information to study national conditions and make policy recommendations.
But the risks of data de-anonymization are real, and not just for trans children. In a world where private data collection and access to powerful computing systems are increasingly common, it may be possible to lift the privacy protections the Census Bureau has built into the data. Perhaps the most famous, Computer scientist Latanya Sweeney shows Almost 90% of U.S. citizens can be re-identified by their zip code, date of birth, and assigned gender.
In August 2021, the Census Bureau responded.The group uses differential privacy methods preferred by cryptographers to protect its repartitioned data. Mathematicians and computer scientists have been drawn to Mathematical elegance of this approach, which involves deliberately introducing a controlled number of errors into key census counts and then cleaning up the results to ensure they remain internally consistent. For example, if the census accurately counted 16,147 people who identified as Native Americans in a particular county, it might report a close but different number, such as 16,171. It sounds simple, but counties are made up of census tracts, and census tracts are made up of census tracts. This means that in order to get a number close to the raw count, the census must also adjust the number of Native Americans in each census block and area; the art of the Census Bureau method is to have all of these close but different numbers add up Another close but different number.
One might think that protecting people’s privacy is a no-brainer. But some researchers, mainly those whose work relies on existing data privacy methods, see it differently. The changes, they argue, will make it harder for researchers to work in practice — and the privacy risks the Census Bureau is guarding against are largely theoretical.
Remember: we have shown that risk is not theoretical. Here’s a little bit about how we did it.
We reconstructed the full list of people under the age of 18 in each census block so we could understand their age, gender, race, and ethnicity in 2010. We then matched this list with a similar list in 2020 to find people who are now 10 years older and report different genders. This approach, called a reconstruction-supported link attack, requires only publicly released datasets. When we reviewed it and formally submitted it to the census, it was powerful and concerning enough to inspire researchers at Boston University and Harvard to contact us to learn more details about our work.
We modeled what a bad actor can do, so how do we make sure that such an attack doesn’t happen? The Census Bureau is taking this aspect of privacy seriously, and researchers who use this data must not get in their way.
Census collection is labor-intensive and costly, and we all benefit from the data that this work produces. But the data can also cause harm, and the Census Bureau’s work to protect privacy has come a long way in mitigating that risk. We must encourage them to continue.
This is an opinion and analysis article and the views expressed by the author are not necessarily those of the author Scientific American.