LV4260/Getty Images

Agencies eye synthetic data to help train and test AI

The Department of Homeland Security and Chief Data Officers Council put out calls recently for products and insight on synthetic data generation.

Government agencies are on the hunt for vendors and best practices that can help them make use of artificially generated data — also known as synthetic data generation – that can be used to build or test artificial intelligence applications and machine learning models.

The Department of Homeland Security’s Science and Technology Directorate released a solicitation Dec. 15 for synthetic data solutions that can “generate synthetic data that models and replicates the shape and patterns of real data, while safeguarding privacy.”

The technique has the potential to help the department train machine learning models in instances where there is no real-world data available or when using that data would be a privacy, civil rights and liberties or security risk.

The agency’s Silicon Valley Innovation Program, which invests in startup companies with tech that could meet operational needs for DHS, calls out the potential for synthetic data generators to be of particular use to the Cybersecurity and Infrastructure Security Agency to develop realistic training and exercise scenarios or model cyber and physical environments in real time. 

A National Strategy on Privacy-Preserving Data Sharing and Analytics, issued by two subcommittees of the National Science and Technology Council in 2023, notes that the vast amounts of data existing today have great potential, but are often restricted by the challenges around sharing and using sensitive information. 

The strategy lists synthetic data as a type of privacy-preserving data sharing and analytics technology that could “unlock the beneficial power of data analysis while protecting privacy.”

Adoption of synthetic data has been slow, the report notes, because of limited awareness, a lack of standards, varying stages of maturity and more.

The report’s authors call out the need for verification and validation techniques for the use of synthetic data to address accuracy and data quality issues, as well as the need for research on the effectiveness of those different techniques.

At DHS, “the ability to generate and use synthetic data would be a gamechanger in the department’s use of complex and rapidly evolving technologies to meet its critical mission while protecting privacy,” DHS Chief Privacy Officer Mason Clutter said in a statement. 

The solicitation notes that currently, although DHS generates a lot of data, it is “highly challenging to utilize or share that data across organizational boundaries” because of its sensitive nature. 

The department’s solicitation is open through April 10, and companies that participate are eligible for up to $1.7 million in funding to develop the tech for homeland security use cases. 

The Chief Data Officers Council is also asking for input on synthetic data, as the council works to establish best practices for synthetic data generation. 

A request for information published in the Federal Register on Friday seeks input on a more formalized definition for synthetic data as well as answers to questions about its applications, challenges and limitations. 

Among the questions they’re asking are how synthetic data can be used and the challenges associated with it, as well as what best practices should be considered for ethics and equity.  

That RFI is open through Feb. 5.