Sci-Tech

Data annotation for AI development processing 'high-quality raw materials'

2025-02-01   

With the rapid development of artificial intelligence, the shortage of high-quality training data has gradually become a major bottleneck restricting industry progress, and the data annotation industry can provide strong impetus for the innovative development of artificial intelligence. The National Development and Reform Commission, the National Data Administration, the Ministry of Finance, and the Ministry of Human Resources and Social Security jointly issued the "Implementation Opinions on Promoting the High Quality Development of the Data Labeling Industry" (hereinafter referred to as the "Implementation Opinions"), proposing the development goals by 2027: the specialization, intelligence, and technological innovation capabilities of the data labeling industry will be significantly improved, the industry scale will leap significantly, and the annual compound growth rate will exceed 20%. What is the current situation of China's data annotation industry? What are the "thresholds" that need to be crossed for the high-quality development of the data annotation industry? In response to these issues, reporters from Science and Technology Daily conducted interviews. Simply put, the process of training artificial intelligence big models is like a teacher teaching students how to read. "Zhang Tong, Vice Dean of the School of Computer Science and Engineering at South China University of Technology, vividly explained that data annotation is to" label "or" mark "data, requiring professionals to explain the labels of each data and the corresponding tasks to be performed to the big model. They 'teach' the big model what data is involved in training, labeling various data such as images, speech, and text. High quality data annotation helps machines to accurately understand, learn quickly, and train efficiently, significantly improving the accuracy and generalization ability of large models. During the training of ChatGPT, the Open Artificial Intelligence Research Institute (OpenAI) in the United States invested a significant amount of resources in data annotation. To ensure high-quality completion of annotation tasks, enable ChatGPT to better understand human instructions, and ensure the accuracy and reliability of large models, OpenAI has hired numerous "teachers". These 'teachers' cover both general data annotators and professionals, as well as doctoral level experts. Data annotation is one of the core cornerstones of the development of artificial intelligence. The data annotation industry is an emerging industry that processes data through screening, cleaning, classification, annotation, labeling, and quality inspection. Its core task is to process raw data into high-quality raw materials that can be used to train artificial intelligence models. "Zhang Tong introduced that data annotation, as a crucial part of training large models, directly affects the performance of machine learning models and plays an important role in supporting the improvement of artificial intelligence capabilities. In Zhang Tong's view, unprocessed raw data is only a potential resource, while data that has been annotated and processed can be effectively traded and circulated in the market, thereby fully unleashing the value of data elements. Cultivating and strengthening the data annotation industry is essential for improving the quality of data supply and promoting the innovative development of artificial intelligence. Industry insiders believe that as artificial intelligence technology continues to mature and its application areas continue to expand, the data annotation industry will usher in a broader market space, especially in emerging technology fields such as low altitude economy, smart cities, autonomous driving, and smart healthcare, which show enormous potential. The industry has entered a stage of rapid development, and the global data annotation market is currently in a period of rapid growth. In recent years, China's data annotation industry has entered a stage of rapid development, with the industrial chain continuously improving and technological innovation achievements gradually realizing market-oriented applications. According to estimates, the scale of China's data annotation industry will reach around 80 billion yuan by 2023. Seven cities, including Chengdu in Sichuan, Shenyang in Liaoning, Hefei in Anhui, and Changsha in Hunan, which have undertaken the task of building data annotation bases, have made significant breakthroughs in areas such as large-scale model annotation and automated annotation. As one of the first batch of data annotation bases in Changsha, Changsha Information Industry Park has attracted more than 10000 various digital enterprises such as intelligent connected vehicles, data annotation, and network security to settle in, successfully creating an artificial intelligence innovation center computing power service platform. Guangdong actively promotes the pilot and base construction of data annotation training, providing solid data support for large-scale model training. In September 2023, the pilot program for public data annotation training in Guangdong Province was officially launched. At the Guangdong Public Data Labeling Base (Qingyuan), a group of outstanding companies in the fields of autonomous driving and government public labeling, such as Baidu, Yanhu Technology, and Haosida, have taken the lead in settling in. With the driving force of leading enterprises and the agglomeration effect of the digital economy industry, Qingyuan's data annotation industry is thriving. We take the digital economy industry as the core, closely cooperate with leading enterprises in the digital economy industry, and are committed to building a national level data annotation industry cluster and an industry education integration demonstration zone Li Yankang, the person in charge of Guangdong Public Data Labeling Base (Qingyuan), introduced that the Baidu AI Cloud (Qingyuan) Artificial Intelligence Basic Data Industrial Base, which has settled here, has cumulatively introduced 5 incubated data labeling enterprises and cultivated more than 300 professional data labeling experts. In the future, the base will continue to cultivate and incubate more excellent data annotation enterprises, promoting the continuous growth and development of Qingyuan's data service industry. The introduction of the Implementation Opinions will further improve the quality of data supply and effectively solve the problem of high-quality data shortage that hinders the development of the artificial intelligence industry, as there is still a significant gap in composite talents. It is worth noting that with the continuous deepening of artificial intelligence applications, the demand for data annotation has become increasingly segmented and specialized. In July 2024, Zhang Tong's team and Guangzhou Huayinkang Medical Group Co., Ltd. jointly established an AI Pathology Research Center at the Guangdong Provincial Laboratory of Artificial Intelligence and Digital Economy (Guangzhou) to develop a large-scale artificial intelligence pathology model, enabling the AI model to treat patients like a professional doctor. In the data preprocessing stage, the center specially hired three senior chief physicians to annotate the data. In professional fields such as healthcare and materials, the annotation process involves the combination of professional objects and terminology, and only professional practitioners can be competent in the annotation work. Moreover, annotation tasks are extremely time-consuming, labor-intensive, and resource intensive. The entire annotation work is not achieved overnight, but requires optimization and continuous iteration in practical application scenarios to continuously upgrade the level of model intelligence Zhang Tong said that there is still a large talent gap in China's data annotation industry, and it is urgent to cultivate composite data annotation talents. This is the "threshold" that China's data annotation industry must cross for high-quality development. The Implementation Opinions make arrangements for strengthening the construction of labeled talent teams. Using talent project plans and technology projects as a starting point, cultivate and introduce high-end professional talents; Develop (revise) national occupational standards for artificial intelligence training and data annotation related professions; Supporting mutual recognition of vocational qualifications and skill levels in the field of data annotation... These measures will provide support for the high-quality development of the data annotation industry. The construction of a sound industrial ecosystem is equally important for the development of the data annotation industry. The Implementation Opinions propose to smooth the data collection, annotation, and artificial intelligence application industry chain, and promote the coordinated development of the upstream and downstream of the data annotation industry; Support leading enterprises and third-party organizations in building open source platforms for data annotation, to assist the development of small and medium-sized enterprises; Cultivate a group of third-party institutions for human resources, supply and demand matching, international cooperation, legal auditing and other service data annotation, and improve the data annotation industry ecosystem. The future development of the data annotation industry can also consider the idea of 'using artificial intelligence to promote artificial intelligence', that is, allowing AI that has already completed learning to feed back data annotation work and improve efficiency. This is a research direction worth exploring in depth and of great value Zhang Tong believes that the development of the data annotation industry is expected to accelerate the deep integration of the digital economy and the real economy, and accelerate the formation of new quality productivity. (New Society)

Edit:CAIYITONG Responsible editor:MENGMENG

Source:Science and Technology Daily

Special statement: if the pictures and texts reproduced or quoted on this site infringe your legitimate rights and interests, please contact this site, and this site will correct and delete them in time. For copyright issues and website cooperation, please contact through outlook new era email:lwxsd@liaowanghn.com

Recommended Reading Change it

Links