Wednesday, December 31, 2008

Google Book Settlement for Librarians

Permission is granted to circulate this article, and to print and copy for non-commercial purposes.

Adam Corson-Finnerty is a senior library administrator at the University of Pennsylvania. He is the author of three obscure books that will be affected by the Google Book Settlement. The sentiments expressed in this article are purely his own and do not represent the views of his library or his university.

Comments are invited: corsonf@pobox.upenn.edu 215-573-1376


Google Book Settlement—For Librarians
By Adam Corson-Finnerty


Introduction

I think that the Google Book Settlement is good for publishers, good for authors, good for libraries, and good for the people.

The reading public will have millions of books available to enjoy, and copyright holders will get new revenue. It will bring us much closer to the Great Library in the Sky.

Once ratified by a judge, the privately negotiated settlement gives Google a green light to digitize virtually every English book in the world, not to mention a few million books in other languages. Whatever money Google makes from this enterprise will be split 37-63, with the lion’s share going to authors and publishers.

What’s Going to Happen

If the Settlement is approved by the presiding judge, as is likely, then here is what will result:

• We will be able to search for, and find, every book, and will be provided with information about its availability in the nearest library, and allowed to purchase online reading access, and be enabled to purchase a print copy at a fair price to be delivered to our office or home;

• We will be able to walk into any public library or college library, and read the full text online of every book in Google’s vast database, for free;

• Colleges and universities will be able to license community access to Google’s entire book database--allowing students and faculty to read and print every book.

This will happen because Google--at its own expense--will scan, process, and save a digital copy of every book it can get its hands upon. We readers will thereby benefit from the enlightened self-interest of our rich "Uncle Google."

Uncle Google will include all books in the “public domain,” that is, books that are out of copyright. There are a few million of those.

Google will include every in-copyright book that has gone out-of-print. In the Settlement, these are referred to as “commercially unavailable.” Out-of-print books are those titles that the publisher has decided to cease printing, cease stocking, and cease selling. These titles dwarf the size of the books that are currently “commercially available.” There are an estimated 20-30 million such titles in the US copyright arena.

Finally, we can predict that the Google database will also include virtually every in-print book, because publishers would be crazy not to make their stuff available through what will quickly become “the largest bookstore on earth.”

Google’s Folly?

My biggest concern about the GBS is that Uncle Google will decide it is a bad investment and back out of it. The Settlement envisions that possibility, and provides a mechanism for transferring the deal to another entity or entities, should Google back out.

The settlement establishes a non-profit organization to administer the deal on behalf of the rights-holders. It is called the Book Rights Registry. This Registry will be controlled by eight Directors—four selected by the American Association of Publishers, and four by the Authors Guild.

Why them? Because they were the ones who brought a class action suit against Google, on behalf of all US authors and all US publishers.

The economics work like this: Google collects its licensing fees, sales income, and whatever other revenue it manages to wrest from the book database, and sends 63% to the BRR. The BRR divides the lucre between various rights-holders. In this, the BRR will be very much like ASCAP, which collects royalties for music performance, and divides it among its "Composers, Authors and Publishers."

Under most book contracts, the rights to the work revert to the author once the publisher decides to stop selling the book. Under the terms of the Settlement, Google will be free to scan any book that is deemed not to be “commercially available.” A rights-holder can opt out of this arrangement, but there is very little reason to do so. Allowing your book to be brought back to life by Google will make you a little bit of money, make you “searchable,” and share your wit and wisdom with generations to come. And the deal is non-exclusive, so it doesn’t prevent you from making other deals.

Rights-holders do need to let the BRR know that they are out there, and give an address so the check can be sent. The agreement envisions that each rights-holder will receive a $200 per title "inclusion fee" for allowing Google to scan the book and make sections of it viewable on their site. Once Google launches its Subscription Service, rights-holders will get a small cut of the receipts. Should anyone pay for the privilege of viewing and “owning” an entire book, the author or publisher gets a piece. And should someone want to order a physical copy of a book, produced through a print-on-demand service like Lightning Source or Booksurge, then the rights-holder gets a cut of that too.

There are many ways that Uncle Google can make money from his book database. First, he will sell annual licenses to the full database, and to subsets of the base (Poetry, perhaps, or Self-Improvement). These licenses can be sold to school, college, and university libraries, as well as to printshops, corporations, and other commercial entities.

Public and college libraries get one free license for one machine in each library branch (or for every 4,000 – 10,000 students), the long lines at this one machine may cause them to purchase additional licenses—at a discount, one would hope, but a price will be paid.

Google will allow individuals to "purchase" digital access to its books, with the price being set either by the rights-holder or by Google. Google will also allow individuals to print out sections or an entire work, at a per-page fee. Advertisements will be placed alongside books— a microscope supply company next to Germ Hunter, for instance. A nanny service next to Jayne Eyre.

The draft settlement indicates that Google does not intend immediately to sell books through print-on-demand, but that it may decide to do so in the future. Similarly, the company may undertake sales of e-books for the Kindle, the Sony Reader, and other handheld devices.

In the strictest economic terms, Google is one of the few companies that has figured out how to monetize eyeballs. To Google, books are just more "content," another chunk of stuff that can be spidered and indexed and given an algorithmic massage, so that more people, spending ever more time in front of their screens, will keep on googling.

Disruptive Technology

New technologies are almost always disruptive. Once they are adopted, things begin to change. Sometimes whole industries change. Sometimes whole societies change. In this case, the combination of the Internet, e-commerce, and print-on-demand technology is upending the book publishing business. It will also up-end the library business.

While most librarians may be aware of POD, it is somewhat less likely that they will know about the Espresso Book Machine. Put this machine together with a book database, and you have the makings of yet another revolution.

The EBM does something very cool. It can print out a 300 page book, with color cover, in four minutes. The end product looks just like a “real” book—because it is a real book. Perfect bound, good paper, clear type. It’s a book.

The soon-to-be-released 2.0 version has a modest footprint, something like 6 ft. by 3 ft.. It will fit nicely in a bookstore, or the lobby of a library, or in a coffeeshop. The EBM has been called “an ATM for books.” It is not quite that yet, but the analogy is spot-on.

The materials cost for a 300-page book is just under $3.00. Amortize the cost of the machine itself, and you have a per-book cost of at most $6.00. Hook it up to the one-million-title Internet Archive and you can publish a lot of interesting and valuable titles—all free of copyright charges, because the books are in the public domain.

Now imagine hooking the EBM up to the Google database. If Google keeps on trucking, then what you will have is the ability to print pretty much every book ever written. Right there in your library.

From a business point of view, it makes perfect sense. The old model of “print, then distribute,” is completely flipped to “distribute, then print.” Very much like what has happened in the Music business—except that in the near term, the physical object (the book) will be the preferred outcome.

No doubt some librarians will worry about what these new books will cost? Indeed, what will it cost to buy a permanent “view” of a book on Google’s database? And what will libraries have to pay to have seven, then ten, then thirty million books available as a subscription?

The answer is, most likely, less than you might think. Remember that music didn’t really start to sell until Apple’s iTunes priced everything at 99 cents a song.

Google has proposed that initially the purchase of permanent e-access to a book will range from $1.99 to $21.99. Nothing higher, and a staggering 65% of the titles will cost less than $7.99. Google will seek permission from the BRR to adjust these rates after three years, and to price titles according to computer-driven algorithms that produce maximum revenues.

As for printed books from the Espresso Book Machine, one should expect that they would cost no more, and probably significantly less, than a book one buys at a Barnes & Noble store, or through Amazon.com.

Library subscriptions to the entire Google Book database are harder to predict. One of the most expensive databases in the academic library world is Science Direct. Each of America’s top research universities pays more than $1,000,000 per year to license its contents for their students, researchers, and faculty. They pay this astronomical sum because Science Direct provides access to more than 4,000 top journals in Science, Technology, and Medicine.

Google’s strategy will probably be quite different than Elsevier, the owner of Science Direct. Rather than “sell high” to a few institutions, Google will want to “sell low” so that its base becomes ubiquitous. After all, it wants customers for life, and college students will lose access to “their” books when they graduate—unless they purchase the individual books in print, or purchase permanent viewing rights.

An Alternate Universe

The Academic Universe is not the same as the Business World. The denizens think different. Differently. Different-ur.

Thus the academic “take” on the Google Book Settlement may be quite distinct from the Publishers’ take, or the Wall Street take, or the view at the Department of Justice. I know this because I live and work in the cat-bird seat of the Academy: the Library.

The Academic Library has its nose in everything that every scholar has done, is doing, or hopes to do. To us, Google is just another information feed, but it’s one hell of a game-changer.

Libraries keep things. And Research Libraries keep almost everything. Harvard keeps twelve million volumes. Stanford, Yale, UNC, and Penn all have between six and ten million books. At the Penn Libraries, we keep almost two million volumes in a high-density storage facility in the old printing floor of the Philadelphia Bulletin Building. Princeton, Columbia, and the New York Public Library have a shared storage facility in a large tract of land in central New Jersey. Books from these units can be retrieved within 48 hours.

But once Google comes fully online and virtually every book in any of these facilities is discoverable, readable, printable, even print-on-demandable at your local library or bookstore or coffeeshop—then do hundreds of libraries really need to keep copies?

The answer will certainly be that they do not, and the trend toward regionalization and consolidation of holdings will accelerate dramatically. One can see the day when only a few copies will be preserved, with one or two designated as “master” copies, to be held forever.

They will be held forever for at least two reasons. The first reason is that the best preservation method for the text that is contained in a book is still a book. Ink on paper outlasts every other text storage methodology that we have devised—certainly every digital technology.

Paper and parchment have proven to be very durable long-term storage devices. A book of Shakespeare’s sonnets, printed in the 17th century, can still be read by today’s reader without the aid of any device, save perhaps glasses.

The second reason that physical copies of book will be maintained is that the book itself is an object of scholarly interest. Dozens of centers for the “History of the Book” have developed around the world. To such scholars, the physical object itself—the book as artifact—in essential.
It is reasonable to expect that some vast library collections will become, in effect if not in name, Museums of the Book. And that people will travel to these museums to look at books, read books, and study books. Many of the great “special collections” libraries already play this role. One thinks of the Newberry Library in Chicago, the Morgan Library in New York, the Huntington Library in California, the Houghton Library at Harvard. And, of course, the Library of Congress.

But most research libraries will have the option of getting out of the massive book storage business.

The Digital Preservation Blues

Some of the “participating libraries” in the Google Books program have described it as having “preservation” as a major outcome. This is a dubious claim.

Oya Rieger is a senior librarian at Cornell. Her responsibilities include electronic scanning of books and manuscripts, the maintenance of a digital repository for the articles and papers that are produced by Cornell faculty, "e-scholarship" programs, "e-publishing," and digital preservation.

Long before Google decided to hoover up the world's books, Cornell, Michigan, Penn, and a handful of other institutions had begun the slow, careful process of scanning print materials, putting digital "facsimiles" up on line for all the world to read, and worrying about how these digital records would be maintained for future generations of scholars.

These efforts look rather puny in comparison to the Google operation. Cornell was digitizing about 1.5 million pages a year. That sounds like a lot, until you realize that this represents only 5,000 books. In contrast, the Google-University of Michigan initiative is scanning 30,000 books per week.

Rieger was asked to undertake a study of such large-scale digitization initiatives, and to ask whether they served the need for digital "preservation." Her conclusion:

[T]here is no evidence to suggest that the corporate and non-profit partners have any long-term business plans for maintaining access to digitized collections or for migrating delivery platforms through future technology cycles.

In other words, No.

Google has not undertaken its enormous scanning project with preservation in mind. The goal is current online access. Therefore, the initial scans were considered "good enough" if they could be easily read easily on a screen, and if 95-99% of the words could be "recognized" by OCR software (which converts pictures of words to machine-readable—searchable—text).

This is a perfectly understandable decision, from a business point of view. And the cooperating libraries—Michigan, Stanford, Harvard, Oxford—are to be commended for allowing Google to create a very good thing, even if it is not the ideal thing.

Rieger's study, "Preservation in the Age of Large-Scale Digitization," sets out what "the ideal thing" might require. A true preservation program requires very high quality-control standards. It may sound downright unappetizing, but a preservationist must deal with such things as ingest workflow, file format migration, and bit corruption. Suffice it to say that a complete re-scanning of every volume—under strict quality control standards-- may be just the start of a true digital book preservation program.

GBS Questions for Librarians


Broadly speaking, the Google Book Settlement is a good thing. However, I have some very specific questions that I have not seen addressed by the Library community.

1. The Book Rights Registry is a non-profit entity that plays a critical role in administering the agreement. The BRR is controlled by four author representatives and four publisher representatives, with five votes needed for decisions. Why aren’t there any voting library representatives on this board? Or “public” representatives.

2. This question is made more important by my reading of what happens if Google decides that the book-scanning business is a money sinkhole. We saw Microsoft bail out of the LiveSearch business, so this is not moot. If Google bails, then the BRR takes over the business (with some library participation) and must seek new commercial or non-commercial partners. All the more reason to have some “public” directors.

3. Here is something cool to think about. In-print books and public domain books appear to be the tip of the iceberg. The greatest number of titles are out-of-print but still in copyright. I have seen estimates of 20-30 million titles. In most cases, the rights to such works may have reverted to the author. Google is going to have a green light to scan these titles for inclusion in its database, and for selective display, and commercial use, unless the author formally objects. The author will get a cut of any revenue, through the distributive mechanism of the BRR. All well and good. But this also opens an interesting opportunity. Allowing your book to be in the GBS is non-exclusive. Therefore, authors could also give publication rights to a non-profit entity, perhaps their university library, perhaps to a coalition of libraries.

4. I have heard that Google’s scans are not preservation quality, and perhaps not even print-on-demand quality. That mass scanning and machine-only OCR cause many quality problems. This includes missed pages, pages that are blurry, pages that are cut off, foldouts that are skipped or distorted, meaningless word translations, and so on. Google itself is at pains to say something about this in the draft settlement:

17:10 Scan Quality. Google will strive to detect and eliminate errors in the Digitization quality or Metadata. Google makes no guarantees, however, regarding the Digitization quality or Matadata quality of any Book or Insert….


5. A related question: the Google Agreement is between the company and authors and publishers. Artists, photographers, and illustrators are not included. I have heard that this will mean the images in an in-copyright book will be blanked out. This would be a terrific lose to general readers as well as scholarly readers. One hopes that Google is pursuing a comparable “deal” with these groups.


6. A different sort of question is this: If Google is successful, then virtually every book ever printed in English, and millions of titles in other languages, will be available to read, print out, and purchase through print-on-demand. So most academic research libraries can get out of the book storage business, right?

You can see what I mean: save a few preservation master copies, and a dozen circulating copies for those who want to study the book as artifact, and dump the rest. For most of our patrons, if they want to read the book on paper, a printed facsimile should do just fine. Are we ready to crawl out on that limb?


7. I am really puzzled by what is said about “mining” the GBS database. Only “non-consumptive” research will be permitted, and scholars must apply for permission to use the database, stating their intent.

What the heck is "non-consumptive" research? Here is how the draft settlement describes the term:

"Non-Consumptive Research" means research in which computational analysis is performed on one or more Books, but not research in which a researcher reads or displays substantial portions of a Book to understand the intellectual content presented within the Book.
Got that?

This might appear to mean that you can count words and analyze patterns, but you cannot see the words or phrases in context, if seeing is indeed "consuming". Take this bit of possible research: Suppose you wanted to study how widely the term “fulsome praise” has transmuted from having a negative connotation to having a positive one. You would have to see the phrase in context, which means that you have to "consume" some additional words, maybe even a paragraph or two.

I have been assured by a representative of one of the chief library partners, that such research will be allowed. And, indeed, the settlement indicates "Linguistic Analysis" will be allowed. This is defined as "Research that performs linguistic analysis over the Research Corpus to understand language, linguistic use, semantics and syntax as they evolve over time and across different genres or other classifications of Books."

OK. So we can eat a few words, but not a "substantial" amount of words. One hopes—and assumes—that "consumptive" research which allows reading to "understand the intellectual content" of a work will be provided for under the "subscription" service. However, the subscription database would have to be organized to aid massive analysis by computer, and the ability to jump out to the text, and back in to the data. There is no indication in the settlement that the subscription database will be optimized for scholarly inquiry. Indeed, it appears quite the opposite—that there will be, in effect, two databases—one for substantial reading, and one for "non-consumptive" research.

Google will allow the establishment of two outside research bases, both of which are restricted to “non-consumptive” research. It is likely that a coalition of libraries led by the University of Mishigan will manage one such database. And I wouldn't be surprised to learn that Stanford will have first dibs on managing the other (but see my blog on the shakeup at the Stanford Library: http://musingsofcorsonf.blogspot.com/2008/12/shakeup-at-sul-stanford-university.html )

If my conjecture is correct, then the settlement represents a very significant loss to the academic community--the loss of true "consumptive" research.





The full text of the draft Google Book Settlement can be downloaded from
http://books.google.com/booksrightsholders/agreement-contents.html
See Also:
1. ALA/ARL Overview of Settlement:
http://www.arl.org/bm~doc/google-settlement-13nov08.pdf

2. Principles and Recommendations for the Google Book Search Settlement, by James Grimmelman
http://laboratorium.net/archive/2008/11/08/principles_and_recommendations_for_the_google_book

3. "Preservation in the Age of Large-Scale Digitization," by Oya Y. Rieger, A report to the Council on Library and Information Resources, February 2008. http://www.clir.org/pubs/abstract/pub141abst.html


Permission is granted to circulate this article, and to print and copy for non-commercial purposes.

Adam Corson-Finnerty is a senior library administrator at the University of Pennsylvania. He is the author of three obscure books that will be affected by the Google Book Settlement. The sentiments expressed in this article are purely his own and do not represent the views of his library or his university.

Comments are invited: corsonf@pobox.upenn.edu 215-573-1376

1 comment:

Daniel Longmore said...

Did you stop posting?