Posted by: Hilde de Weerdt 6 years, 5 months ago


Isn't the Siku quanshu enough?
Reflections on the impact of new digital tools for classical Chinese

At a recent workshop a Chinese cultural historian whom I hold in high esteem raised the following question: "Isn't the Siku quanshu enough?" The implication was that the search functionality of one of the largest digital corpora of classical Chinese texts has made a great contribution to Chinese cultural studies, that this is sufficient, and that no more precious research time should be spent on the creation, application, and revision of digital tools. The position is representative of a good proportion of humanities scholars. We have all become avid users of databases and search engines but we are concerned about the digitization of everything. Below I respond to the specific question regarding the Siku quanshu; the points I raise can also be read as a response to the more general question why humanities scholars should not rely on a limited set of large commercial text databases and why they should take an active interest in the question of which databases and which tools can best serve humanities research questions and methods in the future.

When it first started appearing in libraries in the late 1990s and early 2000s the electronic edition of the Wenyuange Siku quanshu (Siku quanshu dianziban 文淵閣四庫全書電子版) [hereafter SKQS] was a massive hit amongst those studying any aspect of pre-nineteenth-century Chinese culture. The Siku quanshu was compiled between 1773 and 1782 and included 3460+ titles covering all periods up to the time of compilation. Published by Digital Heritage Publishing in Hong Kong, the database included the full-text and the page images of all titles in the original Wenyuange edition. In addition, the electronic edition included related titles such as the catalogs Siku quanshu zongmu 四庫全書總目 and Siku quanshu jianming mulu 四庫全書簡明目錄. It also featured a character dictionary and a date conversion tool. The electronic edition allowed for basic search functionality across the entire corpus or in subsets defined by author, subject category, or in specific titles.[1]

My interlocutor noted that its use initiated significant changes in the study of Chinese texts. In the last ten years it had become clear that with the use of a tool that could search thousands of texts at the same time, readers could put together their own text collections based on topics or genres of interest to them. The net effect of this had been that it had become possible to read outside of the canon and far easier to trace how the canon had come into being and what canonization had erased from memory. 

The benefits of natural language searching are thus acknowledged, but how effective a tool is SKQS in locating relevant primary source texts? I will raise four points to suggest that there are good reasons why 1) SKQS should not be the preferred tool for sinologists and definitely not used at the exclusion of other sources; 2) commercial providers in the humanities often fail to offer the flexibility required for the effective use of digital resources.


The first reason why SKQS is not enough is rather well known. The current edition of the Siku quanshu resulted from a long selection procedure during which many more titles than the number included were deselected. [2] Despite appearances then SKQS is not a complete set of texts for the study of pre-nineteenth century Chinese culture.  

More importantly, even for the texts it includes it is often unreliable. The editors not only weeded out large numbers of titles, in the copying process they also altered earlier editions to fit Qing publishing regulations. Complaints about the poor editions used as base texts especially for olders texts such as early imprints and about the interference of Qing editors were voiced ever since the days of its compilation and continue to this day. Since editors did not indicate where they made changes, the extent of their interference has by and large gone unnoticed.

Alternate text databases and the transcription of large numbers of out of copyright texts provide alternatives. Free digital corpora of texts have not only added to the ocean of texts to search through but also allowed for more advanced query methods to be applied to the texts. Such methods allow us, among other things, to determine to what extent SKQS editions differ from other editions. Below is an example drawn from my own work on notebooks. When searching for terms relating to foreigners, for example, my results from SKQS were vastly different from those obtained from the Zhonghua shuju edition included in Hanji quanwen ziliaoku 漢籍全文資料庫. [3] 

Siku freq
Zhonghua freq
18 (3)

Table. Frequency of terms associated with the Jin state and other non-Han people in the SKQS edition of Huizhu lu (Waving the Duster) and in the collated Zhonghua shuju edition (which is based on a Song edition).  

This result one could obtain by reading through one of the editions, compiling a list of terms of interest, and then searching them in the respective databases one by one or in small groups (most databases are not set up to compare the usage of entire lists of terms). Far easier and more informative to apply corpus linguistics methods. Such methods would not only reveal which terms tend to occur most frequently, they could also indicate their statistical relevance across the entire text or the adjectives and verbs with which such terms co-occur. Such indicators provide clues that could be relevant for how we ought to interpret the entire text and selected passages from it. For example, the fact that the qualifier, "ugly," only occurs in combination with "slaves" (referring to foreigners) and appears mostly in the same passage (a call to arms by a pro-war activist) tells us something about its strongly emotive connotation. An overview of the terms most frequently co-occurring with references to non-Chinese would further highlight that in this text foreigners are portrayed mainly as perpetrators of violence and breakers of treaties.

In sum, the SKQS as it is provides its readers poor editions of texts--one can learn much about how eighteenth-century censors wished readers of the Siku quanshu to read earlier texts but one would miss much if one relied exclusively on searches in it to compile sets of texts for other historical research. Moreover, the SKQS provides little help to the reader to overcome such known deficiencies. 

Beyond basic and advanced search

As noted by my interlocutor the SKQS has made a huge literature available to scholars. How well does it make this literature legible to present-day readers? A few examples may suggest that the task of the individual scholar trawling through the huge literature that SKQS makes available is not always an easy one.

Example one, the Princeton University instructions to the SKQS include one example showing that when readers look for "lijia" (里甲, a form of social organization) they will find 1161 results. They can refine these results by limiting the search to all texts in the history section or to a specific title, but would that really be the most useful way for a scholar to navigate through such a long results list? Individual authors' collections are quite likely to include material on this topic, material that could usefully supplement what one might find in histories and local gazetteers. More likely a reader looking up this term would be interested to know in which places one can find lijia and perhaps with which places these terms are most closely associated. Or, one might want to know when lijia were set up or altered in some way. A more useful way to navigate these results would be to allow for significant terms co-occuring with the search term to serve as a further nagivation aid: placenames, time references, or personal names could appear in a ranked list or tag cloud with the results. At this time, readers can only see a list of titles and the number of results in each title; they cannot even sort the results.

Example two, when looking for "inscribing or inscriptions on walls" (tibi 題壁), the reader would find 1465 results in the main text only (excluding footnotes and annotations). If one would like to find all poetic texts inscribed on walls one would have to extend this search by looking for additional related terms. This is not an impossible task but it is at the very least an uninviting one and one that could be rendered easier if database designers were to implement available technology to help users locate relevant passages more effectively. For example, the result list would be far more legible if personal and place names associated with the text results were made available to readers so that inscriptions written on the same building or in the same places became instantly visible--it would be relatively easy to place them on a map. Even if the author was absent a comparison of the inscribed text could suggest which lines and which poems were inscribed most often. Or, one could look how particular titles appeared in slight variations over time. There are various options here not all of which may be equally necessary, but it is readily apparent that investment in the creation of tools that can help us navigate evergrowing corpora (the SKQS has now been supplemented by many other collections of court archives, Buddhist scriptures, modern editions of prose texts and poems, etc) can help us find texts more effectively than the SKQS can. In addition, digital methods can also help us interpret and analyze these texts and share them with others.

None of this is entirely new. Chinese Studies is blessed with a good number of (printed) indexes to major historical and prose collections. These typically include place and personal name indexes and keyword indexes. Many have relied on those to find materials relevant to their research. The topical index to the SKQS, for example, is often a more effective tool than the SKQS itself if one wants to locate chapter or piece titles relevant to a given topic. When looking for writings on the civil service examinations, for example, the printed topical index proved to be a far more effective tool than the digital database as less relevant results had been weeded out by its editors. Digital methods now allow us to go back to the entire database and to navigate the results in a variety of ways more or less simultaneously. 

To conclude, the SKQS and similar text databases suffer from three major defects that tend to lead to unmanageable results lists:

1. The search functionality is (even in advanced mode) rather basic. Boolean searches are conducted across the range of an entire chapter (juan) which yields far too many results when multiple terms are searched at the same time. Regular expressions, which would allow for more flexibility in defining search criteria, are not made available to advanced readers.

2. The results list cannot be sorted and only includes basic information such as title, author of the main title, dynasty, and number of hits per title.

3. Text results cannot be easily exported for further reading and analysis.

Whence innovation?

How come that 15 years or so after it was first implemented the SKQS still looks more or less the same (with some additions and minor changes to the Intranet versions now available at most subscribing institutions) and little has happened to improve in the known areas of weakness described above? No doubt the high cost of the initial project may have been a factor constraining further development before a solid return on costs (and profits). In addition, the fact that many humanities scholars continue to be mere consumers of mass digital products contributes to the poor design and basic functionality of key databases. What is needed is a critical engagement with all manner of traditional philological and digital tools, a willingness to help develop better tools for teaching and research, and better communication between users and engineers/designers.  

The development of databases within the scholarly community such as the China Biographical Database and China Historical GIS as well as other authority databases and the design of text analytical methods offer some pointers towards solutions for some of the problems indicated above. For example, the integration of China Biographical Database in MARKUS, an online markup tool under development, allows for far more flexible searching, and can link search terms to place and personal names, temporal references, official titles etc occurring in their vicinity. Text mining methods may similarly lead to the identification of better search results and aid in their interpretation. Such improvements require scholarly input and critical engagement. 

The question that became the title for the post (Isn't the SKQS enough?) was a rhetorical one suggesting that, yes, this massive database should satisfy the needs of all interested in interpreting classical Chinese predating the 1770s. I hope to have demonstrated that despite the key role it has occupied in the field, it really isn't enough. Humanities scholars would be well served to substitute an uncritical adoption of databases with select poor editions and limited search functionality and a fear for quantitative approaches to text analysis with a critical and constructive engagement in the creation and improvement of philological tools for the future.



