One billion photos were used to train Meta’s latest photo recognition algorithm, a powerful demonstration of today’s appetite for data. For those companies without access to platforms like Instagram, there is another answer: synthetic data.
Synthetic data is artificially created by a computer, rather than collected from the real world. These computer generated images can be automatically annotated by the machine that creates them. Annotation is an important part of AI training and is a process where important points in a photo, such as people or objects, are labeled to help machine learning models understand what the image represents. They also avoid any compliance or privacy issues by virtue of being an original image that does not show real people.
Such technology saves businesses the challenge of sourcing and collecting thousands of real-world images, while avoiding privacy, GDPR, and copyright issues.
“The biggest bottleneck in AI is the dearth of privacy-compliant, real-world data,” says Steve Harris, chief executive of UK-based synthetic data startup Mindtech Global. “Even a simple image recognition application needs up to 100,000 training images, and each image must be privacy compliant and perfectly annotated by a human.” Obtaining, annotating, and cleaning real-world data is “a monumental task,” he says, that can take up to 80% of a data scientist’s time.
Marek Rei is Professor of Machine Learning at Imperial College London. “Manual data collection is time consuming and expensive,” he says. “If you can generate data from scratch, you can essentially create infinite amounts of it. For some rare events, getting even 10 real examples can be difficult, while synthetic data can potentially provide unlimited examples.”
Thanks to these benefits, 60% of the data used for the development of AI and analytics projects will be generated synthetically by 2024, Gartner predicts, which leads the consultancy to describe it as “the future of AI”.
With previous AI models, the development process involved collecting data, training the model, testing it, and making any necessary changes before testing it again.
The problem with this method is that the data used remains the same, according to Ofir Chakon, CEO and co-founder of synthetic data company Datagen.
“The performance gain you get from this model-centric approach is relatively low,” he says. “To really get a significant performance improvement from your AI algorithms, you need to change your mindset. Instead of iterating over the model parameters, you should iterate over the data itself.”
Datagen produces synthetic data for a variety of AI applications, from facial recognition technology to driver monitoring systems, security cameras, and even gesture recognition. Chakon believes that such apps will become increasingly popular as more companies expand into the metaverse.
To produce the computer-generated data for a facial recognition system, Datagen scans the faces of real people of a variety of ages and demographics. Based on this 3D information, its AI learns the composite parts of the human face so it can start generating completely new images of people. “By scanning 100 base identities, we can create millions of new identities,” says Chakon.
For example, given enough information, the generative model can be asked to create the face of a 30-year-old white male with brown hair; he will spit out a completely new image each time.
“Based on what you learn from the real-world scans and the conditions in which they are set, you can generate a completely new identity that has nothing to do with what was in the original collection of faces,” says Chakon.
Proponents of synthetic data say this can help reduce the bias that often creeps into algorithms at the training stage. “Skewed training data can result in technology solutions and products that reinforce and perpetuate discrimination in the real world,” says Harris. “For example, AI systems have often been found to be poor at recognizing darker skin tones. This is because the AI in question has been trained on data sets that lack diversity.”
In 2015, Google’s image recognition algorithm was criticized for mislabeling images of black people as “gorillas.” With synthetic data, it is theoretically possible for AI developers to generate endless faces of people of different ethnicities to train their models, meaning there are less likely to be gaps in AI understanding.
Harris says that some of his clients use Chameleon, Mindtech’s AI training platform, to generate diverse data from scratch, while others use it to address a lack of diversity in their existing real data sets. “By using computers to train AI, we are removing the biggest obstacle to progress: human bias.”
computer training computers
Inevitably, there are problems with using computer-generated imagery to train AI for real-world applications. “Synthetic data almost never gives the same results as a comparable amount of real data,” explains Rei. “We typically have to make some assumptions and simplifications to model the data generation process. Unfortunately, this also means missing out on many of the nuances and complexities present in real data.”
This is easy to identify with a cursory glance at some of the faces that have been generated synthetically; they are unlikely to fool a person into thinking they are real. Datagen is currently investing in its photorealism capabilities, but Chakon argues that realism is not crucial for all applications.
“If you’re developing a blemish-detecting AI for makeup application, having details is important,” he says. “But if you’re developing a security system, it’s much less relevant if you can identify small details on a person’s face.”
Synthetic data is also not a panacea for AI bias; it depends on the people who generate the data to use those platforms responsibly. Rei adds: “Any bias that is present in the data generation process, whether intentionally or unintentionally, will be detected by models trained on it.”
An Arizona State University study showed that when trained on images of predominantly white, male engineering professors, their generative model amplified the biases in the data set, meaning it produced images of minority modes less often. Worse still, the AI began to “lighten the skin color of non-white faces and transform female facial features to be masculine” by generating new faces.
With synthetic data programs giving developers access to unlimited amounts of data, this has the potential to dramatically exacerbate the problem of bias if mistakes are made at any point in the build process.
If used correctly, synthetic data can still help improve the diversity of some data sets. “If the data distribution is very unnatural, for example, it doesn’t contain any examples of people of a particular race, then synthetically creating these examples and adding them to the data may be better than doing nothing,” says Rei. “But it probably won’t be as good as collecting real data with more precise coverage of all the races.”
While synthetic data can make the AI modeling process faster, cheaper, and easier for programmers, it still presents many of the same challenges as its real-world counterpart. “Whether synthetic data is better than real-world data is not really the right question,” argues Harris. “What AI developers need to do is find or create adequate amounts of appropriate data to train their system.” Using a mix of real and artificial data may be the answer.