How strong is high fidelity digital live streaming sales
2025-11-10
When the live broadcast duration approaches 6 hours and most e-commerce hosts show signs of fatigue, the hosts in the live room can still answer audience questions with enthusiasm and occasionally say a few "jokes" to attract viewers to place orders. With the rapid development of artificial intelligence technology, big models are rapidly evolving towards multimodality, and digital humans have become innovative applications that integrate big language models and multimodal technologies. E-commerce live streaming is an excellent scenario for the landing of digital humans. Digital human technology enables businesses to conduct live streaming without the need to invest a large amount of manpower and resources, significantly reducing costs such as venue rental, equipment procurement, and personnel training. At the same time, digital people can live stream 24 hours a day without interruption, further increasing product exposure time and sales opportunities, and enhancing economic benefits. However, traditional digital human generation technology often faces the problem of multimodal separation of speech, language, and visual, manifested as stiff dialogue, poor matching between speech intonation and emotion, and single facial expressions and gestures. Baidu's Chief Technology Officer Wang Haifeng said that in response to the pain points of digital human applications, Baidu has innovatively developed a script driven multimodal collaborative high fidelity digital human technology. The foundation of a script is the dialogue. The generation of dialogue is not only about content output, but also about fitting the anchor's persona and language style, ensuring personalized and consistent language expression; In multi anchor scenarios, it is also necessary to achieve overall coordination of semantic logic, intonation rhythm, and emotional style. At the same time, in order to enhance the depth of dialogue content, it is necessary to introduce content planning, knowledge enhancement, and fact checking mechanisms to reduce the risk of artificial intelligence hallucinations. Based on the dialogue, the large model can directly generate a digital human live streaming script. The script comes with "visual tags" and "voice tags", which can tell the system what actions the corresponding dialogue characters need to take. Strong interactivity is a major characteristic of e-commerce live streaming scenes. The naturalness of speech synthesis is a key factor in determining user immersion during the interaction with the audience. The audience hopes to hear the anchor's emotional and fluctuating voice, rather than a rigid and mechanical reading. Wang Haifeng introduced that in response to this demand, Baidu has proposed a "text controlled speech synthesis" solution. The text controlled speech synthesis model not only has high restoration speech synthesis ability, but also can combine live broadcast lines and personal characteristics of the anchor to transform these text contents into natural and infectious sounds, allowing digital humans to not only produce sound, but also accurately convey subtle emotions such as intonation, pride, and emphasis. In addition to interacting with users, digital human anchors also need to physically interact with products and the space they are in during live streaming. How to achieve this? The high consistency hyper realistic digital human long video generation technology can analyze and understand multimodal signals such as input historical video data, script, speech information, and bone drive, and generate high expressive segments, complex "human object field" interaction segments, and large action and expression segments based on them. The system is able to schedule these segments uniformly over a long time span, ensuring that speech, mouth movements, expressions, and actions remain highly consistent and synchronized at all times. Nowadays, digital humans are gradually moving from laboratories to various application scenarios, and the commercialization process is significantly accelerating. It can be foreseen that with the leap of key abilities such as deep thinking and multimodal interaction, more and more digital people will appear on screens and enter people's lives. At the same time, industry experts remind that the "Draft Measures for the Supervision and Administration of Live E-commerce" proposes that for those who use artificial intelligence and other technologies to generate character images and videos for live marketing activities, the operator of the live broadcast room should prominently mark them on the live broadcast page and continuously remind consumers that the character images and videos are generated by artificial intelligence and other technologies, in order to clearly distinguish them from natural persons' names or images. Han Jizhong, a senior engineer at the Institute of Information Engineering of the Chinese Academy of Sciences, said that while embracing digital human technology, people need to set clear boundaries and prevent fraud or false propaganda by using high fidelity technology. The development of technology must go hand in hand with legal and ethical constraints to ensure that innovation stays on the right track and achieves steady progress. (New Society)
Edit:Momo Responsible editor:Chen zhaozhao
Source:Science and Technology Daily
Special statement: if the pictures and texts reproduced or quoted on this site infringe your legitimate rights and interests, please contact this site, and this site will correct and delete them in time. For copyright issues and website cooperation, please contact through outlook new era email:lwxsd@liaowanghn.com