There are mountains of ancient Chinese books, and it will take years to repair and sort them all manually. Fortunately, using AI, this daunting task can now be expedited with more accuracy, making this form of literature digitization a breeze.
Twenty times faster
Liu Shuai, a PhD candidate of classical philology at East China Normal University, said he managed to sort through two million Chinese characters from ancient Chinese books in only one month last September, where AI was adopted to facilitate the work.
It was 20 times faster than his previous time for similar work, Liu said. Shidianguji, a smart platform jointly developed by Peking University (PKU) and ByteDance to digitalize ancient Chinese books, made it possible.
Character recognition, text proofreading, structure organization and punctuation proofreading are major processes of sorting ancient books, where AI can save human labor.
When a picture of an inside page from an ancient book is uploaded to the platform, the Optical Character Recognition (OCR) technology will automatically tag the places, book names, times and people's names and titles in the picture. For uncertain Chinese characters, different colors will be marked by OCR to locate their position on the page, and the characters will be corrected based on the original text.
In the project Liu took part in, AI was used to do the first steps of ancient book sorting, then the public volunteers did the proofreading, and the experts dealt with the difficult and unresolved parts in the previous steps.
AI changed the workflow, as the "subcontracting" system made the process easier, which transformed traditional book sorting workshops into factories with assembly lines, said Liu.
Higher recognition accuracy
Ancient books are often marred by creases, missing words or faded ink, which makes it extremely difficult to decipher them. Traditional character recognition software is designed for printed materials, and becomes clumsy when it is used to decipher ancient books, as it often happens that a Chinese character can be written in different ways and styles, and there are also non-character strokes and symbols.
The above-mentioned challenges are the key reasons for the slow digitalization of ancient books in large libraries. The recognition accuracy of characters surged as AI stepped in to deal with difficult-to-recognize strokes and shapes.
AI recognition of ancient books is like restoring old photos. The project Liu participated in last year managed to restore the ancient book The Book of Han: Treatise on Punishment and Law, discovered in Dunhuang, northwest China's Gansu province.
AI learned the characters, styles of strokes and texture of pages of the original ancient book, and restored the missing characters based on the original font, color and background, striving to make the restored sections as close as possible to the original ancient book.
The accuracy of the platform's AI automatic punctuation surpassed 90 percent based on tests, and the translation of ancient texts also reached the level of experts, according to Yang Hao, deputy director of the Research Center for Digital Humanities of PKU.
Wang Yu, who is responsible for the ancient book project products at the corporate social responsibility department of Douyin Group, ByteDance, said that they simplified recognition work by having a one click process for a user to reach the original text for manual comparison and calibration.
They are also upgrading algorithms for recognizing handwriting characters, variant characters, complicated formats and illustrations, aiming to continuously enhance the accuracy of recognition, Wang added.
Experts for sorting ancient books needed
There are only around 10,000 people working on sorting ancient books in China currently, according to Wu Guowu, deputy secretary general of the ancient book sorting and research committee of Chinese universities.
This is far from enough, compared with the huge amount of ancient books. It is estimated that there are over 200,000 categories, 500,000 extant versions, and more than 3.2 million volumes of ancient Chinese books.
AI's involvement in sorting ancient books also makes the cultivation of relevant experts innovative. Wu said that most classical philology majors in universities have opened interdisciplinary courses regarding digital humanities, and seven universities have applied for the establishment of a bachelor's degree in digital humanities, where ancient book sorting is a crucial component.
Liu once worried that he might lose his job, as AI has a speed that humans cannot compete with. But he has since changed his mind. AI recognition of ancient books is based on the high-quality data organized by humans, Liu said, adding that humans are still needed to explore the mysteries of ancient books and pass down cultural heritage, no matter how technology is going to evolve.
Source: Science and Technology Daily
Tel:86-10-65363107, 86-10-65368220, 86-10-65363106