Generative AI’s Popularity Could Cause Their Data Models to Collapse
At some point, the software could begin training on the data it creates itself and forget about the results.
With the advent of generative artificial intelligence, more AI-created material is available online without explicitly being tagged as such. Some people have already begun to wonder what might happen if the systems start to pull in the generated content and base what they create on the previous output that customers use and make available online.
New research in the journal Nature says that the ultimate result is bad. Researchers from the University of Oxford, University of Cambridge, Imperial College of London, University of Toronto, Vector Institute, and University of Edinburgh looked at so-called large language models that create new text from sophisticated statistical analysis of existing text.
They found that “indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear.” In other words, the broad use of AI-generated content begins to spoil the existing pools of data when people indiscriminately use the gen AI systems to create text and then put it on the Internet. In developing the latest versions of their products, the software vendors pull in the AI-generated content and use it for new training data of the systems.
It’s like the symbol of the ouroboros, or serpent eating its tail. The researchers described it as models “indiscriminately learning from data produced by other models,” presumably including previous versions of those models. The official term is model collapse and is basically like a social network echo chamber. People increasingly hear the things that support their own tendencies.
Probable events get overestimated and improbable events are underestimated. A model loses the ability to recognize the range of data available and becomes “poisoned with its own projection of reality.”
Like an online discussion of politics, the models become more certain even though reality is less so because the network algorithms feed back what is already expected.
The authors say that this degradation “is an inevitable outcome of AI models that use training datasets created by previous generations.” Apparently, there is no way to avoid the problem.