December 28, 2006

Digitizing Yearbooks

No matter where I turn on the Internet, I can't seem to get away from ads offering to help me find high school chums and college buddies. But this is just someone monetizing the ways we already use the Internet to locate folks who, while they may not have interests in common with us, at least shared a common experience. These reunion services, in fact, differ little from MySpace and other social networking websites that try to bring like-minded or like-experienced individuals together.

In the world of education, this common experience is traditionally captured through the "yearbook." (In fact, many high school and college reunion services draw their data from yearbooks.) This nostalgic itch that afflicts all who fondly recall their time in school is recognized by institutions. After all, alumni are donors, and many long to recall times when they made their earliest career choices, met future spouses, or made lifelong friends. Colleges and universities have capitalized on this longing: they throw reunion parties, support alumni publications, build local alumni associations, set up clubs (with discounted lodgings), partner with credit card companies, and even sponsor cruises and overseas tours with their faculty as guides.

One unusual way these institutions have tried to reach out to their alumni is through the digitization of their student publications. These include alumni newsletters and magazines, student newspapers and publications, and yearbooks.

Of all these publication, the yearbook is the most important because it represents the entire senior class of that year and not merely those who had enough gumption to write for the student newspaper or enough notoriety to show up in it. The yearbook is more census than chronicle, and for many alumni, it offers the broadest survey of their institution during their years of attendance. It is also the next best potential source of information for locating fellow students outside of the institution's own alumni office. It is a source of nostalgia, personal genealogy, and contextual biography.

Like any other digitization project, digitizing a yearbook requires balancing costs, results, and goals. Here are three yearbook projects, with my comments, that exemplify approaches to this kind of work. (I'll add more as I locate new and distinct examples.)

University of Wisconsin, Milwaukee
Format: This collection appears to use 8-bit grayscale and 24-bit color imaging. Displayed are individual JPGs, which are organized by volume with the original text available for viewing.
Pros: This collection has some real strengths: you can conduct full text searches of all of the yearbooks, with the hit terms highlighted on JPG image. Moreover, upon command, you can see the raw (presumably keyed) text visible alongside each original page.
Cons: On the other hand, the "table of contents" for each volume is a list of every single page of the yearbook itself. The JPG images tend to be weak while the zoom feature is limited (probably because of the weakness of the image). The visible full text is entirely unformatted, although the complications of yearbook data make presentation of it in any other way cost prohibitive. And while you can print out each page as a PDF, you can print only a single page. This manages to control the size of the file for the printing queue, which is a good thing, but there are no allowances for printing a range of pages.

University of South Carolina
Format: This collection appears to use 8-bit grayscale and 24-bit color imaging. However, instead of single pages, the output are volume-level PDFs. Moreover, the PDFs are nonsearchable, meaning you can't use the find function in Acrobat Reader.
Pros: Basically the advantages are those inherent in the PDF software itself: you can page through one at a time or jump to the desired page; the zooming is robust, as are the images themselves; and you have the ability download the entire work or print large portions of it.
Cons: Obviously, there is no full text search available across or within each volume. Unfortunately, the PDF file includes no "bookmarks," which could have served as an ad hoc table of contents. And the files are large...very large. The 1957 yearbook is 142 Mb alone.

University of Wisconsin, Madison
Format: This collection appears to use 8-bit grayscale and 24-bit color imaging as well. It loads individual GIFs, which are aggregated by volume.
Pros: Full text searching is possible for this collection, with your hit terms highlighted on the GIF image. This collection, unlike the other two, includes a true electronic table of contents that allows you to leap directly to the yearbooks sections. Moreover, the results list is extremely robust, giving complete metadata for all of the volumes that have the term, down to the page number itself.
Cons: There are a few. The paging mechanism is primitive, using a simple next/previous command structure (there is no "go to page" command). GIF images tend to be weak, as is the zoom feature, which allows only four zoom levels, the largest not being all that large. The full text is not visible, suggesting that the text file is probably raw (i.e., uncorrected) data captured using optical character recognition software. Finally, the print option is a screen shot of the GIF opened in a new browser window.

Because yearbooks tend to be messy affairs, mixing photographs and text in all sorts of ways, there is an emphasis on high-quality image capture (hence none of these projects use bitonal images). As for text capture, that is where cost effectiveness and platform capability--to display or not display text--come into play. Those about to digitize a yearbook collection would be wise to think carefully about their intended audience and the type of information they want. My own view tends to incline toward high-quality imaging to maintain the robustness, texture and color of the photographic images and artwork that appeared in the original yearbook and some text searchability. Whether that means full text searching across all of the yearbooks or just within each yearbook using searchable volume-level PDFs--after all, I'm likely to look only at the yearbooks for my year of graduation and probably the two years before and after at most--is a matter for the institution launching such a project to think about as it figures what kind of functionality it wants to support.

Bennett Lovett-Graff
Publisher, Content Solutions
National Archive Publishing Company
Digitization, Microfilming, and Publisher Services

December 6, 2006

Bitonal TIFFs

Bitonal images force computer scanners to treat that dot or point per inch, as set by the resolution, as either a black or white dot. Bitonal images are ideal for text for several reasons.

1. "Blackened" text, particularly at a high resolution, is easier for an OCR engine to recognize. Text in a grayscale format (see next entry) has a "blur" effect around the edges of the text, making it more difficult for OCR engines to recognize a letter correctly.

2. Bitonal whitens the background, creating an even "page tone" for all of the documents, whereas grayscale or color imaging would pick up all of the variants in page coloration and even lighting!

3. Bitonal images are much smaller than grayscale and color images. This makes the images themselves much more manageable for purposes of scanning, data transfer, Webt or FTP uploading and downloading, and Web displaying.

Here is a page spread from a religious publication called Restoration. On display is a bitonal, 300 dpi TIFF file.