OpenAI recently introduced new restrictions aimed at preventing unauthorized use of its API. While the official explanation points to “misuse,” the real concern may run deeper: rivals using OpenAI’s own outputs to train their AI.
A new study suggests DeepSeek is copying OpenAI’s homework — literally.
Copyleaks reveals suspicious similarities
Copyleaks, a company specializing in AI content detection, analyzed the stylistic fingerprints of various large language models. According to their research, an astonishing 74% of outputs from Chinese model DeepSeek-R1 were identified as being written by OpenAI.
“This doesn’t just suggest overlap — it implies imitation.”
In contrast, other major models such as Microsoft’s phi-4 and Elon Musk’s Grok-1 showed almost no similarity. Copyleaks classified 99.3% and 100% of their outputs as having “no agreement” with OpenAI’s style, indicating they were likely independently trained.
Mistral’s Mixtral showed some overlap, but DeepSeek’s numbers were off the charts.
AI fingerprints are hard to hide
Even when AI is prompted to write in varied styles, it leaves behind detectable signatures — like linguistic fingerprints. These traces span across formats, topics, and tasks, and Copyleaks’ system can now match them back to their source with surprising accuracy.
This tech could change the game in policing unauthorized model training, enforcing licenses, and defending intellectual property.
OpenAI responds with stricter verification
OpenAI has stayed relatively quiet but hinted at their motivations: “Unfortunately, a small minority of developers intentionally use the OpenAI APIs in violation of our usage policies.”
And the company went even further earlier this year, following DeepSeek’s splashy debut with new reasoning models:
“We are aware of and reviewing indications that DeepSeek may have inappropriately distilled our models.”
Distillation or duplication?
Model distillation — the process of training a new model on the outputs of another — is common in AI. But doing it with proprietary outputs, without permission, could break terms of service.
DeepSeek claims its R1 model used open-source data, though it doesn’t deny overlap with OpenAI. When asked about potential imitation, DeepSeek offered no comment.
A question of consent and competition
Critics argue OpenAI isn’t entirely innocent — its early models were reportedly trained using scraped internet content without consent. But some experts believe there’s a key difference.
“It really comes down to consent and transparency,” said Alon Yamin, CEO of Copyleaks.
Using human content without permission is murky. But training on the outputs of other proprietary AI systems? That’s more like reverse-engineering a competitor’s product.
Yamin believes that training on OpenAI outputs raises competitive risks, as it essentially transfers hard-earned innovations without the original developer’s knowledge or compensation.
What’s next for AI ownership?
As the AI arms race intensifies, so does the debate over data rights, consent, and originality. New tools like Copyleaks’ digital fingerprinting may help resolve disputes — or spark new ones.
For OpenAI and its rivals, that’s both a shield and a signal: the age of anonymous AI imitation may be coming to an end.
Some of the coolest finds people are looking for — you might love them too 👇