JCDL 2006 Conference Notes

Day 2 – Opening Session

Monday lunch – Open JCDL community; future of JCDL
Tues lunch – demonstration at HSL Collaboration Center;

Plenary Panel – Getting Books Online –

alot of debates about topic at past conference due to topic: Google as Library

Dan Clancy – Engineering Directory: Google Books; used to be IS director at NASA
undergrad at Duke
David Ferrario – Chief Exec – Research Libraries at NY Public Library. A G5 site – contributing to google books project. Director of library at Duke for 8 years
Dan Greenstein – UL CA Dig Lib Project;
Moderator: Cliff Finch – 18 years at U of CA – Director of Library Automation; now Internet Archive Director

– online resources; captures public imagination; implications for teaching and learning; preservation, copyright, economic models for publishing industry – ideas that go back a long way – beginning of dig age. much controversy

– alot of talk about getting books online but not about what we'll do with them once online

– google has talked about 'snippets' of books to ensure copyright

– what happens if we succeed? what does it mean to have open content in the context of large collections of digital books?

Google – Dan Clancy

– how many google interns have been in a library in teh last year? about half raised their hands. testament to ease of use of internet; access to primary sources of information increased. inverse impact for things not on the internet

– google's mission: "organize the world's information and make it universally accessible to all"

– goal: create a comprehensive virtual card catalog of aall books in all languages, while respecting publishers' right

– initiative intended to get at the 85% of books out-of-print

– try not to make editorial decisions

– 92% of world's books are neither generating revenue for the cp holder nor easily accessible to potential readers

– digitization was fair use

– 2 user experiences: sample pages view ~15%; snippet view 65%; full view

– snippet view – controversial; book in copyright; cannot page through book; google picks which snippets you see

– most popular searches – harry potter(!)

– scale -30 million books

– google developed its own scanning technology

research challenges
– "scanning is the easy part"; "what are some of the research challenges?

– how to balance cpright with public interest?
technical, social

– previewing vs. using for long periods of time – challenge in digital world;

– ontology of objects – how do we connect objects? link structure makes the web; annotated relationships btw objects. all user created. web of implicit relationships between the content; frbr hierarchy; references; authorship; temporal relationships; topical similiarity; the more content the harder it is to find information

David Ferraro – Ny Public Library

– why would ny public library get involved with google?
under discussion with google 3 years before he arrived; lots of committees – very conservation org. rationale – involved with massive digitization projects (digital gallery and inmotion – 13 diff african american migration; streaming video,etc.; worked with ny public schools to create lesson plans); google offered a way to get non-unique material digitized. could give them experience with corporate partnership; put ny public library at table with other partners; library partner perspective – incredible collaboration wtih google; not "evil corporate empire". resulted in attitude changes; learning about librarianship on the part of google; spent 31 years at mit – very similiar working with google

– public domain books; duplication throughout the research libraries involved with google – what does this mean for research library collections

– conerns about hwo much info google has about library user communities

– longterm preservation issues; librfaries are tyring to turn into dig preservation activity

Dan Greenstein – Open content alliance

project of internet archives – scanning out of cpright books; funding from corporate and private donors

– rlg – undertaken standards process for scanning…

– scanning furiously; results of scanning: raw files thrown away; jpeg2000 archival masters; searchable pdf; xml file that contains the ocr ; bibliographic metadata

– funding streams much smaller than google; in oca, there's a thin lead organization; not an alternative to google books online;

– qualities of open – access to public domain; both oca and gogole making access available to out of cpright books

Cliff's questions to panel:

1. Scanning; ocr – how much is an issue is ocr? do we need large scale investements in ocr research?

google:books propose different challenges for ocr: in the past 98% ocr quality is fine; however, when turning these into speech, it becomes more difficult. currently ocr is the most expensive phase of the google process – computationally. google is not focusing as much on the optical but is focusing on software on top of that. more complex lang models. ai problem; our lang models are a far cry from human ability; humans use book-specific models – learnign the curvature of letters in each books. computers don't do this well. need more emphasis on more obscure languages. ocr is very text based today; need data-driven, machine learning approach.

ny public:99% isn't quite good enough for most research libraries

oca: abobe is looking at this; ocr needs to be researched further

google releasing an open-source ocr package

2. international aspects to putting books online? strategies/priorities to diversification?

oca: aspiration is to digitize internatioally

ny public lib: 5 partners analysis – 49% of collections english lang; 430 different langugages represented

google: 50% foreign in terms of what get's digitized. ocr can handle most latin scripts now; arabic and chinese pose challenges

3. preservation

google did not design optimized process for high dollar brittle content/collections. qulaity of content and digization questions: libraries have different notions of what is preservation quality. focused on scale but try to ensure that quality is improved.

oca: for single editions of materials – won't be preservation quality; think of it in terms of collection mgmt. what's the quality sufficient to get rid of redundant materials in the collection; avoid cost and spend on materials that are particularly special.

4. google is amassing huge textual databases; theoretically an intellectual advantage on the part of google; can be translated (maybe?) into a business advantage; creation of enormous computational resources; access is private now


reesearch advantage:
interested in how googlecan help further research; what can researchers tell google in terms of helping researchers do their work

business advantage: company investing money can gain advantage; intellectutual pursuit is to encourage research that can be done in scale; what is the balance? says that the ai algorithms are not sophisticated enough to mine all the data

ny public library: topic that came up at original meeting;

oca: nobody has secure right and title to the entire source of google books; recognition that over time folks relax; more a problem with universities – how do we platform this information in a way that enables the computational aspects to happen; expectation form university side that we cannot go through the inforjmation computational; only certain infrastructures can support it; non-trivial problem to support this type of environment in business and some universities

what does persistence mean? google will provide persistent links? are folks pointing to the book or a page in the book? what is a persistent identifier? redundancy across organiazations as well – how do you organize the links?


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: