The following article was posted by Jon Noring on the TeleRead blog in February 2007. This is an excellent discussion on why Digital Text Master files should be created along with ideas on how to implement it. — Ed
‘Digital Text Masters’ (Digitizing the classic public domain books)
by Jon Noring
The recent TeleBlog articles about the Project Gutenberg (PG) text Tarzan of the Apes (see 1, 2), suggest that not all is well in the existing corpus of public domain digital texts.
My personal experience the last twelve years in digitizing several public domain books has helped me to see a number of problems which I’ve mentioned in various forums, including the PG forums, and The eBook Community. For the sake of not turning this already long article into a whole book, I won’t cover here the complete list of problems I found, plus those found by others.
To summarize what I believe should be done to resolve most of the known problems, when it comes to creating a digital text of any work in the public domain, we should first produce and make available what we call a “digital text master,“ which meets a quite high degree of textual accuracy to an acceptable and known print source. From the “master,” various display formats, and derivative types of texts (e.g., modernized, corrected, composite, bowdlerized, parodied, etc.) can then be produced to meet a variety of user needs.
(Btw, what better example to illustrate the concept of a “digital text master” than to show the self-portrait of the great 17th century Dutch master painter, Rembrandt van Rijn, whose attention to detail and exactness is renowned.)
We must have some fixed frame of reference by which we produce digital texts of the public domain, otherwise it leads to problems of all kinds (a couple of these problems, but by no means all of them, are illustrated by the Tarzan of the Apes PG etext.) This is especially true for projects which have the intent of producing unqualified texts of public domain works (thereby implying faithfulness and accuracy) — such projects have an obligation to offer a digital text faithful and reasonably accurate to a known source book so the user knows what they are getting. This is somewhat like food labeling, so one knows what ingredients they are getting in their food.
Fortunately, Distributed Proofreaders (DP) is dedicated to this very goal, and their finished digital texts are being donated to the Project Gutenberg collection at a fairly fast clip. However, DP came on the scene relatively late in the game, so the most popular, classic works were already in the PG collection when DP arrived. As a result, DP has mostly focused on the lesser-known works, many of which are good, but will never be widely read compared to the great classics.
Unfortunately, however, in the PG Collection the great classic works, such as Tarzan of the Apes, are of unknown faithfulness and accuracy to an unknown (not recorded) source work (is that enough unknowns?) Even if they were digitally transcribed with “rigor” (to be clear, I believe a number of the earlier PG texts are of high-quality), how does one know? In effect, PG does not support the concept of a “digital text master,” preferring to be a “free-for-all archive” of whatever someone wants to submit. Until recently when the policy was changed for new texts, PG wouldn’t even tell you the provenance of what had been submitted — that information was intentionally stripped out.
The ultimate losers here are the users of the digitized public domain texts. They are, by and large, a trusting group, and simply assume that those who created the digital texts did their homework and faithfully transcribed the best sources. One reason for this TeleBlog article is to point out to users that if it is of concern to them, they should be more demanding and wary of the digital texts they find and use on the Internet. Be good consumers!
Especially beware of boilerplate statements that say such-and-such a text may not be faithful to any particular source book — if not, what is it “faithful” to? Should you spend a significant part of your valuable free time reading something of unknown provenance and faithfulness? This is especially true in education, where it is important the digital texts students use be of known provenance, and that the process of text digitization was guided by experts (and in effect “signed” by them) to assure faithfulness and accuracy — to be trustworthy.
The “Digital Text Masters” Project
For the above reasons, a few of us are now studying a non-profit project to digitally remaster the most well-known public domain works of the English language (including translations.) We will focus on about 500 to 1000 works in the next decade or so, so unlike DP which is understandably focused on numbers because there’s a lot to digitally transcribe (and they will do a good job at getting those books done), our focus will be on a very small number of the great works, and we will give them the full, royal treatment with little compromise. We will “do them right” and when in doubt, will come down on the side of rigor even if it appears to some to be overkill.
Here’s what we tentatively have in mind:
- Of course, we have to begin generating the ranked list of works we’d like to digitally master over time. This list will not be etched in concrete, but will continue to morph. We will not only focus on fiction (although fiction may dominate the early works due to fiction’s overall simpler text structure and layout), but we will consider some of the great works of non-fiction which had significant influence on human progress.
- For each Work, we will consult with scholars and lay enthusiasts to select the one (or more) source books that should be digitally mastered. The Internet now makes it very easy to bring together a large number of experts and enthusiasts and draw upon their collective wisdom.(Importantly note that for some Works there may be more than one source edition selected to be digitally mastered. We do NOT plan to choose one particular edition and call that “canonical” and then eschew all others. Selection of source books to digitially master is on a case-by-case basis. If someone wants to put in the work/resources to focus on a particular source book — their work of love — we won’t stop them so long as what they do follows all the requirements and the resources are there to properly get the job done.)
- We will find, or make ourselves, archival-quality scans of the selected source books. (Purchasing source books will be considered.) Archival quality means the master scans will likely be done at about 600 dpi full color with minimum distortion, and saved in lossless format. Calibration chart scans should accompany each scan set allowing for quality checking and normalization. Care will be taken to assure complete, quality scans of all pages, including the cover, back and spine. In essence, we don’t want someone to hurry to scan the source book, but rather take their time and do it right. Derivative page scan images (such as lower-rez versions) will be made available online alongside, and linked from, the digital text masters.
- We will use a variety of processes to generate a very highly accurate text transcription of the source book. Such processes include OCR, multiple key entry, and a mix of the two in various combinations, along with running machine-checking algorithms for anomalies. The goal is a very low error rate. DP may be used, if DP agrees to participate (they are overwhelmed as it is with their current focus on the more obscure works), but we need to investigate multiple ways to do the actual textual transcription and to get a good measure of the likely error rate. Preservation of the actual characters used in the source books (including accented and special characters) will be done using UTF-8 encoding.The process used to digitally master a given text will be meticulously recorded in a metadata supplement, including special notes particular to the source book. (Unusual and unique exceptions requiring special decisions and handling are likely to be encountered in most source books — this is where the consulting expertise of DP will be of great help.)
- An XML version of the digital master will be created using a high-quality, structurally-oriented vocabulary, such as a selected subset of TEI. Original page numbers, exact line breaks, unusual errors which have to be corrected rather than flagged, and other information will be recorded right in the markup.
- Library-quality metadata/cataloging will be produced for each digital master.
- Several derivative user formats will be generated and distributed for each original digital master.
- A database archive of all the digital text masters, associated page scan images, and all derivatives will be put together to allow higher level searching, annotation, and other kinds of interactivity.
- A robust system will be setup to allow continued error reporting and correction of the existing digital text masters in the archive. Even though we plan very low error rates, we know some errors will slip through.
- The project will bring together scholars and enthusiasts to build a library of annotations for each digital text master (especially useful for educational purposes), as well as encourage the addition of derivative editions (fully identified as such) for each Work.
- We also would like to build real communities around the various works. For example, for each of the most popular works we may build an island in Second Life (or whatever will supplant Second Life in the future — Google is rumored to be working on a “Second Life killer.”) We want to make the books come alive, and not just be staid XML documents sitting in a dusty repository following the old-fashioned library model.
- We will heavily promote the Digital Text Masters archive, especially for education and libraries since the collection will find ready acceptance there because of its quality, trustworthiness and metadata/cataloging. It will also be easier to produce and sell authoritative paper editions.
I could go on and list a few more, and expand on each one in more detail, but that should give a representative overview of the general vision.
We are looking at a few funding/revenue models (one of which is quite innovative) to help launch and maintain the project. The highest costs may be for double or triple key entry should we have that done commercially for any if not all source books — the remaining major cost may be for the design and maintenance of the database as well as developing other tools, some of which might be useful for other projects to digitize texts such as DP, and of course to benefit PG as well.
The project plans to start small and controlled, especially in the early phase where R&D to shake out remaining unknowns will be conducted, and work our way up from there. Proper governance and management will be put into place as early as possible. Ties to academia and education, the library community, and various organizations involved with digitizing or adding to the public domain (such as the Internet Archive, Open Content Alliance, the Wikipedia, etc.) will be actively pursued, but the general vision must not be compromised by those ties. I am in informal talks now with several organizations.
We Need You!
Of course, the most important thing we need is you. If you agree with the general goals and approach of the “Digital Text Masters” project, and are interested in being involved in some capacity, then step forward. We are especially looking for people who enjoy the great classics, are detail-oriented, and believe in doing things right the first time even if it takes more effort. If interested, contact me in private, jon@noring.name, and let’s talk about your thoughts and what interests you. A teleconference among the interested people (who will be known as the founders) is being planned once we get a minimum critical mass brought together.
I look forward to hearing from you. And it would not surprise me if we’ll see a number of comments below, both critical and supportive, of the idea (if you support it, I hope you will comment!) I’m already anticipating some of the arguments to points which were not covered in this article for brevity sake.
(p.s., the use of the Digital Text Masters collection as a “test suite” to improve the quality of processes to auto-generate text from page scan images is discussed in my comment to the original Tarzan of the ApesTeleBlog article.)
(p.p.s., the “We Need You!” graphic is by Ben Bois. Although associated with OpenOffice, it is nevertheless a cool graphic!)
Several interesting comments were posted about this article at TeleRead. If you wish to read those or post your own comment, you can do so by clicking the link: ‘Digital Text Masters’ (Digitizing the classic public domain books).