Credit: Marysia Machulska
Companies committed to data-based decision-making share common concerns about privacy, data integrity, and a lack of sufficient data.
Synthetic data aims to solve those problems by giving software developers and researchers something that resembles real data but isn’t. It can be used to test machine learning models or build and test software applications without compromising real, personal data.
A synthetic data set has the same mathematical properties as the real-world data set it’s standing in for, but it doesn’t contain any of the same information. It’s generated by taking a relational database, creating a generative machine learning model for it, and generating a second set of data.
The result is a data set that contains the general patterns and properties of the original — which can number in the billions — along with enough “noise” to mask the data itself, said Kalyan Veeramachaneni, principal research scientist with MIT’s Schwarzman College of Computing.
Gartner has estimated that 60% of the data used in artificial intelligence and analytics projects will be synthetically generated by 2024. Synthetic data offers numerous value propositions for enterprises, including its ability to fill gaps in real-world data sets and replace historical data that’s obsolete or otherwise no longer useful.
“You can take a phone number and break it down. When you resynthesize it, you’re generating a completely random number that doesn’t exist,” Veeramachaneni said. “But you can make sure it still has the properties you need, such as exactly 10 digits or even a specific area code.”
Synthetic data: “no significant difference” from the real thing
A decade ago, Veeramachaneni and his research team were working with large amounts of student data from an online educational platform. The data was stored on a single machine and had to be encrypted. This was important for security and regulatory reasons, but it slowed things down.
At first, Veeramachaneni’s research team tried to create a fake data set. But because the fake data was randomly generated, it did not have the same statistical properties as the real data.
Gartner has estimated that 60% of the data used in AI and analytics projects will be synthetically generated by 2024.
That’s when the team began developing the Synthetic Data Vault, an open-source software tool for creating and using synthetic data sets. It was built using real data to train a generative machine learning model, which then generated samples that had the same properties as the real data, without containing the specific information.
To begin, researchers created synthetic data sets for five publicly available data sets. They then invited freelance data scientists to develop predictive models on both the synthetic and the real data sets and to compare the results.
In a 2016 paper, Veeramachaneni and co-authors Neha Patki and Roy Wedge, also from MIT, demonstrated that there was “no significant difference” between predictive models generated on synthetic data and real data.
“We were starting to realize that we can do a significant amount of software development with synthetic data,” Veeramachaneni said. Between his work at MIT and his role with PatternEx, an AI cybersecurity startup, “I started getting more and more evidence every day that there was a need for synthetic data,” he said.
Use cases have included offshore software development, medical research, and performance testing, which can require data sets significantly larger than most organizations have on hand.
The Synthetic Data Vault is freely available on GitHub, and the latest of its 40 releases was issued in December 2022. The software, now part of DataCebo, has been downloaded more than a million times, Veeramachaneni said, and is used by financial institutions and insurance companies, among others.
It’s also possible for an organization to build its own synthetic data sets. Generally speaking, it requires an existing data set, a machine learning model, and the expertise needed to train a model and evaluate its output.
A step above de-identification
Software developers and data scientists often work with data sets that have been “de-identified,” meaning that personal information, such as a credit card number, birth date, bank account number, or health plan number, has been removed to protect individuals’ privacy. This is required for publicly available data, and it’s a cornerstone of health care and life science research.
Related Articles
But it’s not foolproof. A list of credit card transactions might not display an account number, Veeramachaneni said, but the date, location, and amount might be enough to trace the transaction back to the night you met a friend for dinner. On a broader scale, even health records de-identified against 40 different variables can be re-identified if, for example, someone takes a specific medication to treat a rare disease.
A synthetic data set doesn’t suffer these shortcomings. It preserves the correlations among data variables — the rare disease and the medication — without linking the data to the individual with that diagnosis or prescription. “You can model and sample the properties in the original data without having a problem of data leakage,” Veeramachaneni said.
This means that synthetic data can be shared much more easily than real data. Industry best practices in health care and finance suggest that data should be encrypted at rest, in use, and in transit. Even if this isn’t explicitly required in federal regulations, it’s implied by the steep penalties assessed for the failure to protect personal information in the event of a data breach.
In the past, that’s been enough to stop companies from sharing data with software developers, or even sharing it within an organization. The intention is to keep data in (purportedly) safe hands, but the effect is that it hinders innovation, as data isn’t readily available for building a software prototype or identifying potential growth opportunities.
“There are a lot of issues around data management and access,” Veeramachaneni said. It gets even thornier when development, testing, and debugging teams have been offshored. “You have to increase productivity, but you don’t want to put people in a situation where they have to make judgment calls about whether or not they should use the data set,” he said.
Synthetic data eliminates the need to move real data sets from one development team to another. It also lets individuals store data locally instead of logging into a central server, so developers can work at the pace they’re used to.
An additional benefit, Veeramachaneni said, is the ability to address bias in data sets as well as the models that analyze them. Since synthetic data sets aren’t limited to the original sample size, it’s possible to create a new data set and refine a machine learning model before using the data for development or analysis.
Access to data means access to opportunities
The ability to freely share and work with synthetic data might be its greatest benefit: It’s broadly available and ready to be used.
For Veeramachaneni, accessing synthetic data is like accessing computing power. He recalled going to the computer lab at night in graduate school about 20 years ago to run data simulations on 30 computers at the same time. Today, students can do this work on their laptops, thanks to the availability of high-speed internet and cloud computing resources.
Data today is treated like the computer lab of yesteryear: Access is restricted — and so are opportunities for college students, professional developers, and data scientists, to test new ideas. With far fewer necessary limitations on who can use it, synthetic data can provide these opportunities, Veeramachaneni said.
“If I hadn’t had access to data sets the way I had in the last 10 years, I wouldn’t have a career,” he said. Synthetic data can remove the speed bumps and bottlenecks that are slowing down data work, Veeramachaneni said, and it can enhance both individual careers and overall efficiency.
A 3D dog from a single photograph
Synthetic data can be more than rows in a database — it can also be art. Earlier this year, social media was enamored with DALL-E, the AI and natural language processing system that creates new, realistic images from a written description. Many people appreciated the possibility for whimsical art: NPR put DALL-E to work depicting a dinosaur listening to the radio and legal affairs correspondent Nina Totenberg dunking a basketball in space.
This technology has been years in the making as well. Around the same time that Veeramachaneni was building the Synthetic Data Vault, Ali Jahanian was applying his background in visual arts to AI at MIT’s Computer Science and Artificial Intelligence Laboratory. AI imaging was no stranger to synthetic data. The 3D flight simulator is a prime example, creating a realistic experience of, say, landing an airplane on an aircraft carrier.
Related Articles
These programs require someone to input parameters first. “There’s a lot of time and effort in creating the model to get the right scene, the right lighting, and so on,” said Jahanian, now a research scientist at Amazon. In other words, someone needed to take the time to describe the aircraft carrier, the ocean, the weather, and so on in data points that a computer could understand.
As Veeramachaneni did with his own data set, Jahanian focused on developing AI models that could generate graphical outputs based on observations of and patterns in real-world data, without the need for manual data entry.
The next step was developing an AI model that could transform a static image. Given a single 2D picture of a dog, the model can let you view the dog from different angles, or with a different color fur.
“The photo is one moment in time, but the synthetic data could be different views of the same object,” Jahanian said. “You can exhibit capabilities that you don’t have in real data.”
And you can do it at no cost: Both the Synthetic Data Vault and DALL-E are free. Microsoft (which is backing the DALL-E project financially) has said that users are creating more than 2 million images per day.
There are concerns about data privacy, ownership, and misinformation. An oil painting of a Tyrannosaurus rex listening to a tabletop radio is one thing; a computer-generated image of protestors on the steps of the U.S. Capitol is another.
Jahanian said these concerns are valid but should be considered in the larger context of what the technology makes possible. One example is medicine: A visualization of a diseased heart would be a lot more impactful than a lengthy clinical note describing it.
“We need to embrace what these models provide to us, rather than being skeptical of them,” Jahanian said. “As people see how they work, they’ll start to influence how they are shaped and trained and used, and we can make them more accessible and more useful for society.”
Read next: Data literacy for leaders