JCDL 2006 Conference Notes

Day 2 – Opening Session

Monday lunch – Open JCDL community; future of JCDL
Tues lunch – demonstration at HSL Collaboration Center;

Plenary Panel – Getting Books Online –

alot of debates about topic at past conference due to topic: Google as Library

Dan Clancy – Engineering Directory: Google Books; used to be IS director at NASA
undergrad at Duke
David Ferrario – Chief Exec – Research Libraries at NY Public Library. A G5 site – contributing to google books project. Director of library at Duke for 8 years
Dan Greenstein – UL CA Dig Lib Project;
Moderator: Cliff Finch – 18 years at U of CA – Director of Library Automation; now Internet Archive Director

– online resources; captures public imagination; implications for teaching and learning; preservation, copyright, economic models for publishing industry – ideas that go back a long way – beginning of dig age. much controversy

– alot of talk about getting books online but not about what we'll do with them once online

– google has talked about 'snippets' of books to ensure copyright

– what happens if we succeed? what does it mean to have open content in the context of large collections of digital books?

Google – Dan Clancy

– how many google interns have been in a library in teh last year? about half raised their hands. testament to ease of use of internet; access to primary sources of information increased. inverse impact for things not on the internet

– google's mission: "organize the world's information and make it universally accessible to all"

– goal: create a comprehensive virtual card catalog of aall books in all languages, while respecting publishers' right

– initiative intended to get at the 85% of books out-of-print

– try not to make editorial decisions

– 92% of world's books are neither generating revenue for the cp holder nor easily accessible to potential readers

– digitization was fair use

– 2 user experiences: sample pages view ~15%; snippet view 65%; full view

– snippet view – controversial; book in copyright; cannot page through book; google picks which snippets you see

– most popular searches – harry potter(!)

– scale -30 million books

– google developed its own scanning technology

research challenges
– "scanning is the easy part"; "what are some of the research challenges?

– how to balance cpright with public interest?
technical, social

– previewing vs. using for long periods of time – challenge in digital world;

– ontology of objects – how do we connect objects? link structure makes the web; annotated relationships btw objects. all user created. web of implicit relationships between the content; frbr hierarchy; references; authorship; temporal relationships; topical similiarity; the more content the harder it is to find information

David Ferraro – Ny Public Library

– why would ny public library get involved with google?
under discussion with google 3 years before he arrived; lots of committees – very conservation org. rationale – involved with massive digitization projects (digital gallery and inmotion – 13 diff african american migration; streaming video,etc.; worked with ny public schools to create lesson plans); google offered a way to get non-unique material digitized. could give them experience with corporate partnership; put ny public library at table with other partners; library partner perspective – incredible collaboration wtih google; not "evil corporate empire". resulted in attitude changes; learning about librarianship on the part of google; spent 31 years at mit – very similiar working with google

– public domain books; duplication throughout the research libraries involved with google – what does this mean for research library collections

– conerns about hwo much info google has about library user communities

– longterm preservation issues; librfaries are tyring to turn into dig preservation activity

Dan Greenstein – Open content alliance

project of internet archives – scanning out of cpright books; funding from corporate and private donors

– rlg – undertaken standards process for scanning…

– scanning furiously; results of scanning: raw files thrown away; jpeg2000 archival masters; searchable pdf; xml file that contains the ocr ; bibliographic metadata

– funding streams much smaller than google; in oca, there's a thin lead organization; not an alternative to google books online;

– qualities of open – access to public domain; both oca and gogole making access available to out of cpright books

Cliff's questions to panel:

1. Scanning; ocr – how much is an issue is ocr? do we need large scale investements in ocr research?

google:books propose different challenges for ocr: in the past 98% ocr quality is fine; however, when turning these into speech, it becomes more difficult. currently ocr is the most expensive phase of the google process – computationally. google is not focusing as much on the optical but is focusing on software on top of that. more complex lang models. ai problem; our lang models are a far cry from human ability; humans use book-specific models – learnign the curvature of letters in each books. computers don't do this well. need more emphasis on more obscure languages. ocr is very text based today; need data-driven, machine learning approach.

ny public:99% isn't quite good enough for most research libraries

oca: abobe is looking at this; ocr needs to be researched further

google releasing an open-source ocr package

2. international aspects to putting books online? strategies/priorities to diversification?

oca: aspiration is to digitize internatioally

ny public lib: 5 partners analysis – 49% of collections english lang; 430 different langugages represented

google: 50% foreign in terms of what get's digitized. ocr can handle most latin scripts now; arabic and chinese pose challenges

3. preservation

google did not design optimized process for high dollar brittle content/collections. qulaity of content and digization questions: libraries have different notions of what is preservation quality. focused on scale but try to ensure that quality is improved.

oca: for single editions of materials – won't be preservation quality; think of it in terms of collection mgmt. what's the quality sufficient to get rid of redundant materials in the collection; avoid cost and spend on materials that are particularly special.

4. google is amassing huge textual databases; theoretically an intellectual advantage on the part of google; can be translated (maybe?) into a business advantage; creation of enormous computational resources; access is private now


reesearch advantage:
interested in how googlecan help further research; what can researchers tell google in terms of helping researchers do their work

business advantage: company investing money can gain advantage; intellectutual pursuit is to encourage research that can be done in scale; what is the balance? says that the ai algorithms are not sophisticated enough to mine all the data

ny public library: topic that came up at original meeting;

oca: nobody has secure right and title to the entire source of google books; recognition that over time folks relax; more a problem with universities – how do we platform this information in a way that enables the computational aspects to happen; expectation form university side that we cannot go through the inforjmation computational; only certain infrastructures can support it; non-trivial problem to support this type of environment in business and some universities

what does persistence mean? google will provide persistent links? are folks pointing to the book or a page in the book? what is a persistent identifier? redundancy across organiazations as well – how do you organize the links?


Day 1 – DSpace Tutorial

It's been an *awesome* day – the DSpace tutorial was excellent. The tutorial was led by Dorothea Salo, the Digital Repository Services Librarian for MARS at GMU and Tim Donahue, Research Programmer for IDEALS, at the University of Urbana-Champaign.

Highlights of the morning:

  1. Dspace community is very active!! 130 institutions using the system (that have registered)
  2. Dspace community is moving toward a committee approach. Today, MIT and HP still guard the code (even though it's open source). If you have customizations you want in the code base, you can email 5 guys in charge of the API; today, they determine what goes in the next release. Very soon this will change and working committees will be formed to address specific layers of Dspace
  3. Texas A&M is creating a *very* cool project to be released this summer. The project is the result of the first Dspace working group (mentioned above) and will be called Manakin. It is a purely XML based UI that sits on top of Dspace. Texas Digital Library is already using Manakin – it allows much greater flexibility in terms of where things are placed on the page; XML, XSLT and CSS make up the architecture
  4. Version 1.4 will of Dspace be out at the end of the summer with new features such as browsing by subjects and better modularization of code.
  5. Dspace Architecture includes: Ant (a Java build tool based on XML config files), JSP, Tomcat, XML, XSLT, war files…
  6. JSP knowledge is needed to modify the front-end layer – all customizations rely on Messages.properties that contains all "text" for the web portion of the site(they didn't mention this but this properties file is a popular way to building Java applications) Dorothea did not have prior JSP or Tomcat experience, however, and was able to customize; she was a Python programmer before taking on the MARS project
  7. Dspace out of the box will require Tomcat and Ant knowledge — XML as well.
  8. To modify the business layer (written in Java), you will need to have knowledge of Java. Not many folks other than programmers contributing to Dspace have tackled the business layer.
  9. If someone knows Java/JSP it would take about 6-8 weeks to customize;
  10. You will need an IDE (Integrated Development Environment) to modify Jsp's. (my own notes here: use Eclipse – it's free and *very* good).
  11. Dspace's CSS is over specified according to Dorothea. Dorothea completely rewrote the CSS when customizing MARS
  12. Create a local directory to save all new JSPs. When you build the app with Ant, it will look into /local first and use customizations before out of the box files
  13. Admin interface is available.
  14. Do not delete dublin core data elements that ship with Dspace!! You can delete elements that you've added.
  15. In Version 1.4, Dspace will introduce other metadata schemes (other than dublin core). Still no rich, hierarchical schemas, but Dspace developers are looking into this
  16. Adding dublin core elements and having them searchable requires changes in many files…(will go over in person)
  17. Make sure you both have a test and prod installation
  18. Take into account and plan for the fact that you will have downtime in production – building the app takes a few minutes
  19. Submissions forms that ship with Dspace need to be altered a bit if you want to change the workflow (Tim has some great code for submission workflow that we can use)
  20. Forms for submission are written strictly in XML (very cool stuff – entire dspace application may move in this direction)
  21. Server space will vary from institution to institution based on what you want to store in the system. Illinois will has been in a beta pilot for 1 year; they have been testing how files can be uploaded, how interested folks around campus are, etc. They've estimated that they wont' go over 1 terebyte the 1st year they roll out. Both Tim & Dorothea suggested that partnering with Academic Technologies is a good thing on this type of project…

There's tons more that I have documented separately – I also have a Dspace How-to Guide and slides from the presentation. Met many folks looking at Dspace – not many who have already installed it.

Talked a lot with the Web Applications Manager from libraries at Albany, Systems Librarian at University of Kansas and Tim Donahue at University of Illinois.

Dspace lunch is Tuesday with 2 folks from MIT who are the main coders for the API

Have talked to UNC system folks and they were open to my suggestion for a site visit. They are doing 3 pilots over the summer. They are also beginning a project campus-wide to get Dspace and Fedora to talk to each other…