Accelerate the construction of high-quality datasets for artificial intelligence

2025-02-10

Currently, artificial intelligence is in a critical period of rapid development, reshaping the economic and social development model. The 2024 Central Economic Work Conference pointed out the implementation of the "Artificial Intelligence+" initiative to cultivate future industries. As one of the three core elements in the development of artificial intelligence, data is the fundamental element for training artificial intelligence models and the core resource for their application. Accelerating the construction of high-quality datasets for artificial intelligence is of great significance for promoting the implementation of "AI+" scenarios. The problems in the construction of high-quality datasets and the supply of high-quality data are key factors in accelerating the development of the new generation of artificial intelligence. At present, there is still a shortage of data supply for the new generation of artificial intelligence, and specialized technologies for data processing need further breakthroughs. The data industry and data ecology need to be enriched, and the overall planning and supporting policies for high-quality datasets need to be improved. Firstly, there is still a shortage of high-quality data supply in the general domain, vertical domain, and embodied intelligence domain. On the one hand, Chinese public data lags behind English data in terms of quality and quantity. On the other hand, the degree of openness and utilization of public data in China needs to be improved, and there is no unified standard for openness in different regions. High quality industry datasets specifically designed for the development of artificial intelligence are still relatively scarce. The insufficient collection of real interactive data in the field of embodied intelligence is mainly due to the difficulty and high cost of obtaining interactive data between intelligent robots and the environment, as well as the lack of unified reference standards for enterprise data collection. Secondly, the synthesis, processing, and utilization technologies of high-quality data urgently need to be improved. The technology of using deep learning and reinforcement learning to generate high-precision and diverse synthetic data urgently needs breakthroughs in maturity and application scope. With the continuous improvement of social automation and intelligence, the requirements for data processing are also constantly increasing. Therefore, there is an urgent need for iterative optimization of processing technologies for structured, semi-structured, and unstructured data to further improve data processing efficiency. Again, the development of data subjects and business models is not yet mature. Our country lacks high-quality data aggregation and governance entities similar to the "data+artificial intelligence" model of the United States, and the number of companies with large-scale data aggregation management and analysis capabilities is insufficient. The public data authorization operation entities in multiple fields such as healthcare, law, insurance, finance, industry, and scientific research are still being cultivated, and the development of business models for data set construction and operation utilization is not yet mature. Finally, the specialized planning and supporting policies for high-quality datasets need to be improved. China has issued a series of guidelines and policies related to data development, but a special plan and support policies for high-quality datasets for the training of new generation artificial intelligence models and scenario applications have not yet been introduced. Measures for their construction, operation, circulation, and utilization need to be further refined. In terms of data collection, there is a lack of applicable standard specifications for data in various fields; In terms of data usage, the lack of data sharing and circulation promotion mechanisms for training large models and embodied intelligent models has to some extent limited the rapid improvement of model capabilities. Multiple measures should be taken to construct high-quality datasets. In response to the current problems in resources, technology, models, and systems, and in combination with the needs of the development of the new generation of artificial intelligence, it is recommended to leverage the synergy between the government and the market to promote the construction of high-quality datasets through multiple measures. One is to accelerate the opening of public data and the circulation of enterprise data, and build high-quality datasets for the new generation of artificial intelligence. It is suggested to establish a collaborative mechanism involving departments, industries, and regions, focusing on the construction of high-quality datasets, expanding the scope and scale of data supply, improving public and industry data standards, and accelerating the construction of a trustworthy data space. Build big data centers and big model industry application innovation (engineering) centers for key fields such as healthcare, education, scientific research, law, industry, agriculture, logistics, finance, energy, and transportation, break down information silos, build a complete data ecosystem, construct high-quality datasets, and enhance the capabilities of vertical artificial intelligence models. Focusing on future industry demands such as autonomous driving and embodied intelligence, open up relevant public data, develop industry data standards, explore data circulation mechanisms between enterprises, and encourage enterprises and research institutions to create high-quality industry datasets. Secondly, we will increase our efforts to tackle key technical issues related to high-quality datasets in the construction industry. Accelerate the development of key common technologies for data synthesis and governance, focusing on data synthesis and processing; Focusing on data circulation and aggregation, vigorously promoting technologies such as privacy computing and blockchain; Focusing on the application mode of "data+artificial intelligence", we will focus on developing data management technology, exploring new model structures and training architectures. Encourage data product and service enterprises targeting artificial intelligence to take the lead in undertaking major national projects, conducting applied basic research and key core technology breakthroughs. Promote the construction of industry university research cooperation and innovation consortia, and create a new cooperation model that deeply integrates data technology, products, and services. Targeting key scenarios, we aim to create a "testing ground" for data technology, providing a real data environment and simulated application scenarios, establishing pilot bases, attracting enterprises, universities, and research institutions to participate in the innovation and verification of data technology, and accelerating the promotion and application of new technologies. The third is to guide enterprises and business model innovation, and build an artificial intelligence data industry ecosystem. Vigorously cultivate enterprises in multiple fields such as artificial intelligence data resources, technology, services, applications, security, and infrastructure, and focus on building data industry innovation platforms for the artificial intelligence industry. Encourage enterprises to explore multi domain business models based on "data+artificial intelligence", support cooperation between enterprises and all parties, and create industrial innovation chains and ecosystems based on high-quality datasets. Encourage enterprises to explore large-scale models and specific intelligent application scenarios to drive the development of the data industry. Support enterprises related to model application, model development, data services, data products, etc. to form innovation consortia, develop high-quality datasets, and develop new formats such as "data-as-a-service", "knowledge-based service", and "model-based service". The fourth is to increase policy support for the construction of high-quality datasets for artificial intelligence. To meet the development and application needs of the new generation of artificial intelligence technology, we will improve the data resource construction system, cultivate the data industry, support the development of data technology, systematically promote the construction of high-quality datasets, and strengthen industry applications. Coordinate central and local fiscal funds, industry guidance funds, and various policy investments, and increase investment in the construction of high-quality datasets. Encourage financial institutions to innovate products and services, and increase financing support for data related enterprises. Guide social capital to participate in the development and utilization of high-quality artificial intelligence datasets in an orderly manner. (Liao Xinshe) Author: Wang Xiaoming (Researcher of the Chinese Academy of Sciences Science and Technology Strategy Consulting Institute)

Edit:Luo yu    Responsible editor:Wang xiao jing

Source:stdaily.com

Special statement: if the pictures and texts reproduced or quoted on this site infringe your legitimate rights and interests, please contact this site, and this site will correct and delete them in time. For copyright issues and website cooperation, please contact through outlook new era email:lwxsd@liaowanghn.com

Return to list

Recommended Reading Change it

Links

Submission mailbox:lwxsd@liaowanghn.com Tel:020-817896455

粤ICP备19140089号 Copyright © 2019 by www.lwxsd.com.all rights reserved

>