DeepSeek, a Chinese AI startup that emerged from the quantitative hedge fund High-Flyer Capital Management, has unveiled its latest innovation in artificial intelligence: the ultra-large model DeepSeek-V3. The model was made publicly available today via the Hugging Face platform, under a licensing agreement established by the company. With a remarkable 671 billion parameters, DeepSeek-V3 utilises a mixture-of-experts architecture designed to efficiently activate only specific parameters vital for particular tasks.
In various benchmarks conducted by DeepSeek, the new model has demonstrated superior performance compared to prominent open-source alternatives, such as Meta's Llama 3.1-405B, while closely matching the capabilities of closed models from established entities like Anthropic and OpenAI.
DeepSeek's ongoing ambition is to contribute towards the development of artificial general intelligence (AGI), which refers to models possessing the capability to comprehend and learn any intellectual task undertaken by a human. The release of DeepSeek-V3 signifies an essential step in bridging the capabilities of open-source models with their closed-source counterparts.
The architectural framework of DeepSeek-V3 draws from its predecessor, DeepSeek-V2, employing a base structure centred around multi-head latent attention (MLA) and DeepSeekMoE. This architecture facilitates effective training and inference by utilising "experts"—specialised and shared smaller neural networks—responsible for handling 37 billion of the 671 billion parameters per token.
In an effort to enhance performance further, DeepSeek introduced two key innovations with DeepSeek-V3. Firstly, an auxiliary loss-free load-balancing strategy was implemented, allowing real-time monitoring and adjustment of workloads across the experts, ensuring optimum use without hindering overall performance. Secondly, the model incorporates multi-token prediction (MTP), enabling it to forecast multiple future tokens simultaneously, significantly enhancing training efficiency, with a generation rate of 60 tokens per second.
DeepSeek indicated in its technical documentation that the training of DeepSeek-V3 involved a dataset comprised of 14.8 trillion high-quality, varied tokens. The training process included a two-stage context length expansion, extending the maximum context length from 32,000 tokens to an impressive 128,000 tokens. Following the initial training phase, further fine-tuning was applied to align the model with human preferences.
The investment into the training of DeepSeek-V3 was notably economical, totalling approximately $5.57 million, derived from roughly 2.8 million GPU hours on the H800 hardware. This is in stark contrast to the exorbitant costs associated with other large language models, such as the Llama-3.1, which exceeded an estimated training investment of $500 million.
Current benchmarks indicate that DeepSeek-V3 is the strongest open-source model available. It notably surpassed other open models and performed exceptionally well on Chinese language and mathematics assessments, attaining a score of 90.2 on the Math-500 test, with its nearest rival scoring 80. However, it was outperformed by Anthropic's Claude 3.5 Sonnet in several specific benchmarks.
The advancements presented by DeepSeek-V3 suggest a narrowing of the performance gap between open and closed-source models. This development could significantly influence the AI landscape by providing businesses with diverse options for integration into their operational models.
DeepSeek-V3 is now accessible via GitHub under an MIT licence, with API access for commercial purposes being offered at introductory pricing until February 8. After this date, the cost will adjust to $0.27 per million input tokens and $1.10 per million output tokens, following a more standardised pricing structure in the AI market.
Source: Noah Wire Services