Computer Industry Weekly: Doubao Real-time Voice Large Model and Large Model 1.5PRO Version Online The U.S. Government Joins Forces with Technology Giants to Launch "Stargate"

DATE: Jan 30 2025

Computing power: Hengyuan Cloud 13 cores + 128G computing power is tight

This week, Hengyuan Cloud 13 cores + 128G computing power is tight. Specifically, in the graphics card configuration of A100-40G, the price of Tencent Cloud 16 cores + 96G is 28.64 yuan / hour, and the price of Alibaba Cloud 12 cores + 94 GiB is 31.58 yuan / hour; The graphics card configuration is A100-80G, and Hengyuan Cloud 13 cores + 128G computing power is tight; Alibaba Cloud 16 cores + 125 GiB is priced at 34.74 RMB/hour; In the graphics card configuration of A800-80G, the price of Hengyuan Cloud 16+256G is 9.00 yuan/hour.

AI application: Doubao team releases the latest real-time voice model and large model 1.5 Pro version

On January 20, ByteDance's Doubao team officially released the Doubao real-time voice model, which is an integrated model of speech understanding and generation, which realizes end-to-end voice dialogue. Compared with the traditional cascade mode, the model has excellent performance in speech expression, control, and emotional acceptance, and has the characteristics of low latency and can be interrupted at any time during the conversation. These features make the model more flexible and efficient in practical applications, and can better meet the needs of users in different scenarios.

In terms of architecture, the Doubao team has developed an end-to-end framework that deeply integrates speech and text modalities, conducts unified joint modeling for speech generation and understanding, and realizes multi-modal input and output effects. The model supports a variety of modes, including speech-to-speech (S2S), speech-to-text (S2T), text-to-speech (T2S), text-to-text (T2T), and more. In terms of sound control, the model can not only output according to basic instructions, but also follow complex instructions, showing strong voice control capabilities. In terms of voice playing, the model is able to mimic a variety of dialects and accents.

In addition, the Doubao real-time voice model also supports real-time networking functions, which can dynamically obtain the latest information according to the question to ensure the timeliness of the question answer. According to the evaluation results, the Doubao real-time speech model has significant advantages in emotional understanding and emotional expression. In terms of overall satisfaction, the Doubao real-time speech model scored 4.36, while GPT-4o scored 3.18. At present, the new real-time voice call function of the Doubao App has been fully launched.

On January 22, 2025, the 1.5 Pro version of the bean bag model was officially released. The model uses the MoE architecture and pursues a balance between model performance and inference performance through the integrated design of training and inference. Doubao-1.5-pro outperforms best-in-class ultra-large and dense pre-trained models with only small activation parameters and achieves excellent results across multiple benchmarks. The specific highlights are as follows:

1) Leading comprehensive capabilities: Doubao Large Model 1.5Pro has world-leading results in knowledge (MMLU_PRO, GPQA), code (McEval, FullStackBench), reasoning (DROP), Chinese (CMMLU, C-Eval) and other public evaluation benchmarks.

2) Efficient model structure and ultra-low cost: Doubao large model 1.5Pro uses small activation parameters for pre-training, which maintains superior performance while maintaining extremely low training cost, and adopts a large-scale sparse MoE architecture, which is equivalent to the performance of the Dense model with 7 times the activation parameters, far exceeding the conventional efficiency of about 3 times the leverage of the MoE architecture in the industry. With the self-developed server cluster solution, it flexibly supports low-cost chips, and the hardware cost is greatly reduced compared with industry solutions. The self-developed network card and network protocol significantly optimize the efficiency of packet communication, and the efficient overlap of operator layer computing and communication ensures the stability and efficiency of multi-machine distributed inference. Through solutions such as fine quantization and PD separation, computing power can be flexibly used and multi-task hybrid scheduling can be used to achieve more efficient computing power utilization.

3) Comprehensive improvement of multimodal capabilities: In terms of vision, compared with the previous version, Doubao-1.5-pro has made comprehensive technical improvements in multi-modal data synthesis, dynamic resolution, multi-modal alignment, and hybrid training, which further enhances the model's capabilities in visual reasoning, text document recognition, fine-grained information understanding, instruction compliance, etc., and makes the model's response mode more streamlined and friendly. In terms of speech multimodality, the new end-to-end framework of Speech2Speech is proposed, which not only deeply integrates speech and text modality through native methods, but also realizes the end-to-end speech understanding generation in the true sense of speech dialogue, which has a qualitative leap in dialogue effect compared with the traditional ASR+LLM+TTS cascade mode.

4) Stronger deep thinking ability: Based on the Doubao 1.5 base model, through the breakthrough and engineering optimization of the RL algorithm, the Doubao deep thinking model was developed without using other model data. Phased progressDoubao-1.5-Pro-AS1-Preview has achieved industry-leading results on AIME.

It is important to note that Doubao did not use data generated by any other model during all model training. The Doubao model builds an independent data production system, combines the annotation team with the model self play technology, efficiently optimizes data quality, improves the diversity and difficulty of data annotation, and ensures the independence and reliability of data sources

AI Financing Trends: The U.S. government, OpenAI, SoftBank, and Oracle jointly launched the "Stargate" project, with a planned investment of $500 billion in four years

　　　 The U.S. government, OpenAI, SoftBank, and Oracle jointly launched the "Stargate" project, with a planned investment of $500 billion over four years. On January 22, Beijing time, the new President of the United States, Trump, announced that he would cooperate with OpenAI, Oracle and SoftBank to jointly invest $500 billion to support the construction of artificial intelligence infrastructure in the United States, a project called "Stargate". According to the plan, the participants in the Stargate project will form a joint venture with an initial investment of $100 billion, and the total investment may be as high as $500 billion in the next four years. Oracle co-founder Larry Ellison said that the first joint project will be arranged to build a data center in Texas, USA, and the relevant work has already begun.

OpenAI says the project will not only support the reindustrialization of the United States, but will also provide strategic support to protect the national security of the United States and its allies. According to OpenAI's statement, Son will serve as chairman of the board of directors of the joint venture, with SoftBank and OpenAI being the main partners in the project, with SoftBank providing financial support and OpenAI being responsible for operational management. Semiconductor companies Arm, Microsoft, Nvidia, Oracle and OpenAI will be key technology partners.

Investment Advice

On January 27, Apple's App Store China Free List showed that DeepSeek became the first in China, becoming a milestone event for domestic large-scale model overtaking in corners.

Instead of using the supervised fine-tuning (SFT) training paradigm commonly used in the industry, DeepSeek R1 directly uses reinforcement learning to allow the model to autonomously evolve complex reasoning capabilities, including reflections and long-chain thinking. Compared to OpenAI's o1, the cost of input per million tokens of the DeepSeek model has been sharply reduced from $15 to $0.55, and the cost of output has been reduced from $60 to $2. With the dual attributes of open source and high cost performance, DeepSeek will accelerate the development of AI from training to the inference era, and further promote the development of AI software and hardware.

We are firmly and continuously optimistic about AI software and hardware opportunities, and recommend paying attention to iFLYTEK (002230. SZ), Cambrian (688256. SH), high-speed communication connector business or significantly benefit from GB200 Dingtong Technology (688668. SH), Emdoor Information (001314. SZ) and so on.

Risk Warning:

1) The iteration speed of the underlying AI technology is not as fast as expected. 2) Policy supervision and copyright risks. 3) The implementation effect of AI applications is not as expected. 4) Recommend the risk that the company's performance is less than expected.

Follow Yicai Global on

star50stocks

Ticker Name

Percentage Change

Inclusion Date

star50

star50stocks

Log in to Yicai Global

EMAIL

PASSWORD

Create your account

EMAIL

We sent you a code

VERIFICATION CODE

You'll need a password

PASSWORD

Find your Yicai Global account

Enter your email

Check your email

Enter code

Change your password

Enter your new password

Enter your new password again

Reset your password

Enter your new password

star50

star50stocks

Log in to Yicai Global

EMAIL

PASSWORD

Create your account

EMAIL

We sent you a code

VERIFICATION CODE

You'll need a password

PASSWORD

Find your Yicai Global account

Enter your email

Check your email

Enter code

Change your password

Enter your new password

Enter your new password again

Reset your password

Enter your new password

getcode