DeepSeek, a prominent Chinese artificial intelligence laboratory, has recently made headlines with the launch of its new AI model, DeepSeek V3. This model is claimed to outperform many peers on various benchmark tests, showcasing its prowess in handling text-based assignments such as coding and essay writing. Released earlier this week, DeepSeek V3 is described as both large and efficient.
However, the unveiling of DeepSeek V3 has raised eyebrows due to its peculiar tendency to identify itself as ChatGPT, OpenAI's well-known AI-powered chatbot. When tested, DeepSeek V3 maintained that it is another iteration of OpenAI's GPT-4 model, which was launched in June 2023. Users on platforms like X, as well as tests performed by TechCrunch, corroborate this identification.
This self-identification has sparked questions regarding the model's training process. When prompted for information about DeepSeek’s application programming interface (API), DeepSeek V3 instead provided guidance on using OpenAI's API, leading to concerns about its originality and potential reliance on rival outputs. Additionally, it has demonstrated the same humour as GPT-4, echoing standard jokes and punchlines characteristic of OpenAI's offerings.
Experts believe the underlying reason for this behaviour lies in the statistical methodologies employed by such models. According to Mike Cook, a research fellow at King’s College London, it appears DeepSeek V3 may have been trained on publicly available datasets that included text generated by GPT-4. This could result in the model memorising specific outputs, allowing it to reproduce them verbatim during interactions. Cook described this as potentially problematic for model quality, indicating that reliance on previous models’ data can lead to "hallucinations" and inaccurate responses. He stated, "Like taking a photocopy of a photocopy, we lose more and more information and connection to reality."
The implications of DeepSeek’s training practices may extend to legalities as well. OpenAI's terms explicitly prohibit users from employing the output generated by its products to develop competing models. Both DeepSeek and OpenAI have not provided comments regarding the issue at this time. However, Sam Altman, CEO of OpenAI, seemingly addressed the situation on X, suggesting that while it is relatively straightforward to replicate an existing successful model, creating something novel and innovative presents a significantly greater challenge.
DeepSeek V3 is not the first AI model to misidentify itself; similar occurrences have been documented in models such as Google’s Gemini, which has also claimed to be other competing products under specific prompts. The proliferation of AI-generated content across the web further complicates the landscape, as many AI systems rely on a vast array of training data from online sources. It is estimated that by 2026, 90% of web content may be AI-generated, creating potential contamination within training datasets.
Drawing on insights from Heidy Khlaaf, chief AI scientist at the AI Now Institute, the motivations behind using existing model outputs can be compelling for developers due to potential cost savings. She noted the likelihood that certain models may inadvertently incorporate outputs from ChatGPT or GPT-4 into their training sets. Khlaaf explained that even in a data-rich environment filled with AI-generated content, the outputs from models that have responded to ChatGPT prompts do not necessarily have to reflect distinctive qualities associated with OpenAI’s custom messages.
The most likely scenario appears to be that a significant amount of data based on ChatGPT and GPT-4 has entered the training dataset for DeepSeek V3, prompting concerns over its reliability in self-identification. More critically, this reliance on absorbed outputs poses the risk of amplifying biases and flaws present in GPT-4, potentially impacting the overall credibility of DeepSeek V3's performance in various tasks.
Source: Noah Wire Services