Towards A Universal Digital Library: A Few Milestones

by Marie Lebert on March 25, 2008
HistoryNews

Many of us dream of a universal digital library freely available on the web, i.e. available anywhere and at any time. Thanks to Project Gutenberg, the Internet Archive and others, we are getting there, at least for the books from public domain. The process began a while ago with a few pioneers – It is running at full speed now. We still need to see copyright issues worked out in order to provide free access to as many works as possible. We still need large scale knowledge-building projects to get reliable reference, scholarly and educational content. We still need better quality OCR technology and in the future, go back to the original image files to provide a higher quality book. We still need more efforts, there are currently 25 million books belonging to the public domain and as of mid-2007, just over 2 million freely available on the internet.

1968: The ASCII Code
1971: WorldCat, by OCLC
1971: Project Gutenberg
1989: The World Wide Web
1991: Unicode
1993: The Online Books Page
1993: The PDF Format
1994: The W3C Consortium
1996: The Internet Archive
2000: The Public Library of Science
2000: Distributed Proofreaders
2001: Wikipedia
2002: Bookshare.org
2002: The MIT OpenCourseWare
2004: Project Gutenberg Europe
2004: Google Print
2005: The Open Content Alliance
2006: Google Book Search
2006: Microsoft Live Search Books
2007: Citizendium
2007: The Encyclopedia of Life
2008: A Perspective


1968: The ASCII Code

Used since the beginnings of computing, ASCII (American Standard Code for Information Interchange) is a 7-bit coded character set for information interchange in English. It was published in 1968 by ANSI (American National Standards Institute), with an update in 1977 and 1986. The 7-bit plain ASCII, also called Plain Vanilla ASCII, is a set of 128 characters with 95 printable unaccented characters (A-Z, a-z, numbers, punctuation and basic symbols), i.e. the ones that are available on the English/American keyboard. Plain Vanilla ASCII can be read, written, copied and printed by any simple text editor or word processor. It is the only format compatible with 99% of all hardware and software. It can be used as it is or to create versions in many other formats. Extensions of ASCII (also called ISO-8859 or ISO-Latin) are sets of 256 characters that include accented characters like French, Spanish or German, for example ISO 8859-1 (Latin-1) for French. Another widely used character set is Unicode, a universal double-byte character encoding launched in 1991 to support any language and any platform.

1971: WorldCat, by OCLC

WorldCat was created in 1971 by the non-profit OCLC (Online Computer Library Center) as the union catalog of the university libraries in the State of Ohio. Over the years, OCLC became a national and worldwide library cooperative, and WorldCat the largest library catalog in the world. In 2005, WorldCat had 61 million bibliographic records in 400 languages from 9,000 member libraries (paid subscription) in 112 countries. In 2006, 73 million bibliographic records were linking to 1 billion documents available in these libraries. In August 2006, WorldCat began to migrate to the web through the beta version of the new website WorldCat.org. Member libraries now provide free access to their catalogs and electronic ressources: books, audiobooks, abstracts and full-text articles, photos, music CDs and videos. Another pioneer site was RedLightGreen, launched in Spring 2004 (with a beta version in Fall 2003) as the web version of the RLG Union Catalog, another major union catalog created in 1980 by the Research Libraries Group (RLG). RedLightGreen ended its service in November 2006, after a successful 3-year run, and RLG joined OCLC.

1971: Project Gutenberg

In July 1971, Michael Hart created Project Gutenberg with the goal of making available for free, and electronically, literary works belonging to public domain. A pioneer site in a number of ways, Project Gutenberg was the first information provider on the internet and is the oldest digital library. When the internet became popular, in the mid-1990s, the project got a boost and an international dimension. The number of electronic books rose from 1,000 (in August 1997) to 5,000 (in April 2002), 10,000 (in October 2003), 15,000 (in January 2005) and 20,000 (in December 2006) with a current production rate of around 370 new books each month. With 50 languages and 38 mirror sites around the world, books are being downloaded by the tens of thousands every day. Project Gutenberg promotes digitization in “text format”, meaning that a book can be copied, indexed, searched, analyzed and compared with other books. Contrary to other formats, the files are accessible for low-bandwidth use. The main source of new Project Gutenberg eBooks is Distributed Proofreaders, conceived in October 2000 by Charles Franks to help in the digitizing of books from public domain.

1989: The World Wide Web

The World Wide Web -that became the Web or web- was invented by Tim Berners-Lee in 1989. “The dream behind the Web is of a common information space in which we communicate by sharing information. Its universality is essential: the fact that a hypertext link can point to anything, be it personal, local or global, be it draft or highly polished. There was a second part of the dream, too, dependent on the Web being so generally used that it became a realistic mirror (or in fact the primary embodiment) of the ways in which we work and play and socialize. That was that once the state of our interactions was on line, we could then use computers to help us analyse it, make sense of what we are doing, where we individually fit in, and how we can better work together.” (Tim Berners-Lee, The World Wide Web: A very short personal history, 7 May 1998.) According to the network tracking firm Netcraft, there were 100 million websites on November 1st, 2006. Previous milestones in the survey were reached in April 1997 (1 million sites), February 2000 (10 million), September 2000 (20 million), July 2001 (30 million), April 2003 (40 million), May 2004 (50 million), March 2005 (60 million), August 2005 (70 million), April 2006 (80 million ) and August 2006 (90 million).

1991: Unicode

First published in January 1991, Unicode is the “universal” character encoding maintained by the Unicode Consortium. “Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.” (excerpt from the website) This double-byte platform-independent encoding provides a basis for the processing, storage and interchange of text data in any language, and any modern software and information technology protocols. Unicode is a component of the W3C (World Wide Web Consortium) specifications.

1993: The Online Books Page

Founded in 1993 by John Mark Ockerbloom while he was a student at Carnegie Mellon University, The Online Books Page is “a website that facilitates access to books that are freely readable over the Internet. It also aims to encourage the development of such online books, for the benefit and edification of all.” (excerpt from the website) John Ockerbloom first maintained this page on the website of the School of Computer Science of Carnegie Mellon University. In 1999, he moved it to its present location at the University of Pennsylvania Library, where he is a digital library planner and researcher. The Online Book Page listed 12,000 books in 1999, 20,000 books in 2003 (including 4,000 books published by women), and 25,000 books in 2006. The books “have been authored, placed online, and hosted by a wide variety of individuals and groups throughout the world”, with 6,300 books from Project Gutenberg. The FAQ also lists copyright information about most countries in the world with links to further reading.

1993: The PDF format

PDF (Portable Document Format) was conceived by Adobe in 1992, launched in June 1993 with Adobe Acrobat software, and perfected over 15 years as the global standard for distribution and viewing of information. It “lets you capture and view robust information from any application, on any computer system and share it with anyone around the world. Individuals, businesses, and government agencies everywhere trust and rely on Adobe PDF to communicate their ideas and vision.” (excerpt from the website) Adobe Acrobat gives the tools to create and view PDF files and is available in many languages and for many platforms (Macintosh, Windows, Unix, etc.). Today, over 500 million copies of PDF-based Adobe Reader (formerly Acrobat Reader, until May 2003) have been downloaded worldwide. Approximately 10% of the documents on the internet are available in PDF.

1994: The W3C Consortium

Founded in October 1994, the W3C (World Wide Web Consortium) develops interoperable technologies (specifications, guidelines, software and tools) for the web, as a forum for information, commerce, communication and collective understanding. The W3C develops common protocols to lead the evolution of the web, for example the specifications of HTML (HyperText Markup Language) and XML (eXtensible Markup Language). HTML is the lingua franca for publishing hypertext on the web. XML was originally designed as a tool for large-scale electronic publishing. It now plays an increasingly important role in the exchange of a wide variety of data on the web and elsewhere.

1996: The Internet Archive

Founded in April 1996 by Brewster Kahle, the Internet Archive is a non-profit organization that builds an “internet library” to offer permanent access to historical collections in digital format for researchers, historians and scholars. An archive of the web is stored every six months. In October 2001, with 30 billion web pages stored, the Internet Archive launched the Wayback Machine, for users to be able to surf the archive of the web by date. In 2004, there were 300 terabytes of data, with a growth of 12 terabytes per month. In 2006, there were 65 billion pages from 50 million websites. In late 1999, the organization also started to include more collections of archived web pages on specific topics. It also became an online digital library of text, audio, software, image and video content. In October 2005, the Internet Archive launched the Open Content Alliance (OPA) with other contributors as a collective effort to build a permanent archive of multilingual digitized text (Text Archive) and multimedia content.

2000: The Public Library of Science

The Public Library of Science (PLoS) was founded in October 2000 by biomedical scientists Harold Varmus, Patrick Brown and Michael Eisen, from Stanford University, Palo Alto, and University of California, Berkeley. Headquartered in San Francisco, PLoS is a nonprofit organization whose mission is to make the world’s scientific and medical literature a public resource. In early 2003, PLoS created a nonprofit scientific and medical publishing venture to provide scientists and physicians with high-quality, high-profile journals in which to publish their most important work: PLoS Biology (launched in 2003), PLoS Medicine (2004), PLoS Genetics (2005), PLoS Computational Biology (2005), PLoS Pathogens (2005), PLoS Clinical Trials (2006), PLoS Neglected Tropical Diseases (2007). All PLoS articles are freely available online, and deposited in the free public archive PubMed Central. They can be freely redistributed and reused, including for translations, as long as the author(s) and source are cited. PLoS also hopes to encourage other publishers to adopt the open access model, or to convert their existing journals to an open access model.

2000: Distributed Proofreaders

Conceived in October 2000 by Charles Franks, Distributed Proofreaders was launched online in March 2001 to help in the digitizing of public domain books. The method is to break up the tedious work of checking ebooks for errors in small, manageable chunks. Originally meant to assist Project Gutenberg in the handling of shared proofreading, Distributed Proofreaders has become the main source of Project Gutenberg eBooks. In 2002, Distributed Proofreaders became an official Project Gutenberg site. The number of eBooks processed through Distributed Proofreaders has grown fast. In 2003, about 250-300 people were working each day all over the world producing a daily total of 2,500-3,000 pages, the equivalent of two pages a minute. In 2004, the average was 300-400 proofreaders participating each day and finishing 4,000-7,000 pages per day, the equivalent of four pages a minute. Distributed Proofreaders processed a total of 3,000 books in February 2004, 5,000 books in October 2004, 7,000 books in May 2005, 8,000 books in February 2006 and 10,000 books in March 2007, with the help of 36,000 volunteers.

2001: Wikipedia

Launched in January 2001 by Jimmy Wales and Larry Sanger (Larry resigned later on), Wikipedia has quickly grown into the largest reference website on the internet. Its multilingual content is free and written collaboratively by people worldwide. Its website is a wiki, which means that anyone can edit, correct and improve information throughout the encyclopedia. The articles stay the property of their authors, and can be freely used according to the GFDL (GNU Free Documentation License). Wikipedia is hosted by the Wikimedia Foundation, which runs a number of other projects, for example Wiktionary – launched in December 2002 – followed by Wikibooks, Wikiversity, Wikinews and Wikiquote. In December 2004, Wikipedia had 1.3 million articles from 13,000 contributors in 100 languages. Two years later, in December 2006, it had 6 million articles in 250 languages.

2002: Bookshare.org

Bookshare.org was launched in February 2002 by Benetech, a company which develops “technology that truly helps to change the world.” Bookshare.org is a online library for people with vision and reading disabilities living in the US (written proof of disability, and paid subscription). It went online with 7,620 books in two formats, BRF (Braille Format) and DAISY (Digital Accessible Information System). The books can be downloaded to Braille printers, portable Braille devices (using electronic Braille systems) and synthetic speech software. One year later, in February 2003, Bookshare.org had 11,500 books and 200 volunteers. The catalog had 17,000 books in February 2004, 20,000 books in January 2005, 23,000 books in July 2005 and 30,000 books in November 2006. In March 2005, Bookshare.org began offering collections in Spanish (500 in March 2005 and 1,000 in December 2006). In March 2006, Bookshare.org began to offer local and national newspapers and magazines accessible online daily. In May 2007, Bookshare.org began its international expansion.

2002: The MIT OpenCourseWare

The MIT OpenCourseWare (MIT OCW) is a large-scale, web-based electronic publishing initiative launched by MIT (Massachusetts Institute of Technology) to promote open dissemination of knowledge and information. A pilot version of the MIT OpenCourseWare (MIT OCW) was available online in September 2002, with 32 course materials of MIT. In September 2003, the site was officially launched with several hundred course materials. In March 2004, 500 course materials were available in 33 different topics, and regularly updated. In May 2006, 1,400 course materials were offered by 34 departments belonging to the five schools of MIT. A steady state should be reached in 2008, with the publishing of 1,800 course materials, virtually all of MIT’s undergraduate and graduate courses. In November 2005, the MIT launched the OpenCourseWare Consortium (OCW Consortium) as a collaboration of educational institutions creating a broad body of open educational content using a share model. One year later, the OCW Consortium included the courses of 100 universities worldwide.

2004: Project Gutenberg Europe

In January 2004, Project Gutenberg spread across the Atlantic with the launching of Project Gutenberg Europe (PG Europe) and Distributed Proofreaders Europe (DP Europe) by Project Rastko, a non-governmental cultural and educational project located in Belgrade, Serbia. DP Europe uses the software of the original Distributed Proofreaders. DP Europe is a multilingual website, with its main pages translated into several European languages by volunteer translators. In April 2004, DP Europe was available in 12 languages. The long-term goal is 60 languages and 60 linguistic teams representing all European languages. DP Europe supports Unicode to be able to proofread eBooks in numerous languages. Unicode is an encoding system that gives a unique number for every character in any language. DP Europe finished processing its 100th book in May 2005 and its 400th book in December 2006. DP Europe operates under “life +50” copyright laws. When it gets up to speed, DP Europe will provide eBooks for several national and/or linguistic digital libraries.

2004: Google Print

In October 2004, Google launched the first part of Google Print as a project aimed at publishers, for users to be able to see snippets of their books and order them online. The beta version of Google Print (http://print.google.com) went on line in May 2005. In December 2004, Google launched the second part of Google Print as a project intended for libraries, to build up a digital library of 15 million books by scanning and digitizing the collections of main libraries, beginning with the Universities of Michigan (7 millions books), Harvard University, Stanford, Harvard and Oxford, and the New York Public Library. The planned cost was an average of $10 per book, and $150 to $200 million dollars on ten years. In August 2005, Google Print was stopped until further notice because of lawsuits filed by publishers for copyright infringement. The program resumed in October 2006 under the new name of Google Book Search.

2005: The Open Content Alliance

The Open Content Alliance (OCA) was conceived by the Internet Archive in early 2005 to offer broad, public access to the world culture. It was launched in October 2005 as a group of cultural, technology, non profit and governmental organizations willing to build a permanent archive of multilingual digitized text and multimedia content. The project aims at digitizing public domain books around the world and make them searchable through any web search engine and downloadable for free. Unlike the Google Print project, the OCA scans and digitizes only public domain books, except when the copyright holder has expressly given permission. The first contributors to OPA were the University of California, the University of Toronto, the European Archive, the National Archives in the United Kingdom, O’Reilly Media and Prelinger Archives. The digitized collections are freely available in the Text Archive of the Internet Archive. In December 2006, thery reached a milestone of 100,000 digitalized books publicly available, with 12,000 new books added per month.

2006: Google Book Search

Google Book Search was launched in August 2006 to replace the controversial Google Print, stopped in August 2005 because of main copyright concerns. Google Book Search offers excerpts of books digitized by Google in the participating libraries (Harvard, Stanford, Michigan, Oxford, California, Virginia, Wisconsin-Madison, Complutense of Madrid and New York Public Library). Google scans 3,000 books a day, including copyrighted books. The inclusion of copyrighted books is widely criticized by authors and publishers worldwide. In the US, lawsuits were filed by the Authors Guild and the Association of American Publishers (AAP) for alleged copyright infringement. The assumption is that the full scanning and digitizing of copyrighted books infringes copyright laws, even if only snippets are made freely available on the search engine. To counteract copyright concerns and the problems of a closed platform, the Internet Archive launched the Open Content Alliance (OPA) with the goal of digitizing only public domain books and make them searchable and downloadable through any search engine.

2006: Microsoft Live Search Books

In December 2006, Microsoft released the beta version of Live Search Books. The book search engine performs keyword searches for non copyrighted books digitized by Microsoft from the collections of the British Library, University of California, and University of Toronto, followed in January 2007 by the New York Public Library and Cornell University. Books offer full text views and can be downloaded in PDF files. In the future Microsoft intends to add copyrighted works with the permission of their publishers. Microsoft has also participated in the Open Content Alliance (OPA), launched by the Internet Archive in October 2005.

2007: Citizendium

Citizendium was launched in October 2006 as a pilot project to build a new encyclopedia, at the initiative of Larry Sanger, who was the co-founder of Wikipedia (with Jimmy Wales) in January 2001, but resigned later on over policy and content quality issues. Citizendium – which stands for a “citizen’s compedium of everything” – is a wiki project open to public collaboration, but combining “public participation with gentle expert guidance”. The project is experts-led, not experts-only. Contributors use their own names, not anonymous pseudonyms, and they are guided by expert editors. “Editors will be able to make content decisions in their areas of specialization, but otherwise working shoulder-to-shoulder with ordinary authors.” (Larry Sanger, Toward a New Compendium of Knowledge, September 2006) Constables make sure the rules are respected. Citizendium was launched on March 25th, 2007, with 1,100 articles, 820 authors and 180 editors.

2007: The Encyclopedia of Life

Launched in May 2007, the Encyclopedia of Life is a global scientific effort to document all known species of animal and plants (1.8 million), and expedite the millions of species yet to be discovered and catalogued (8 to 10 million). This collaborative effort is led by several main institutions: Field Museum of Natural History, Harvard University, Marine Biological Laboratory, Missouri Botanical Garden, Smithsonian Institution, Biodiversity Heritage Library (BHL). The initial funding comes from the MacArthur Foundation (10 million dollars) and the Sloan Foundation (2.5 million dollars). A number of pages will be available by mid-2008. The encyclopedia will be operational in 3-5 years and completed (with all known species) in 10 years. Built on the scientific integrity of thousands of experts around the globe, the Encyclopedia will be a moderated wiki-style environment, freely available to all users everywhere.

2008: A Perspective

While refrained from digitizing copyrighted books at this stage, many institutions are interested in seeing copyright issues worked out, to provide free access to as many works as possible. Another main issue is the proofreading of digitized books, that ensures a better accuracy of the text without any loss from the print version. Good OCR software is said to ensure a 99% accuracy, leaving ten mistakes per pages. If the step of the proofreading seems essential to Project Gutenberg, this step is mostly skipped by the Internet Archive, Google, Bookshare.org (except for textbooks) and many others.

— Marie Lebert, March 2008

With many thanks to Mike Cook, who kindly edited this article.

If you liked this post, say thanks by sharing it.
John Mark Ockerbloom March 26, 2008 at 5:21 pm

Just an update on the OLBP figures: As of today, there are over 31,000 books listed, with 7400 being Project Gutenberg listings. For expedited listings of your favorite titles from Gutenberg and elsewhere, I recommend the submission form at

http://onlinebooks.library.upenn.edu/webbin/suggest