JCDL 2006 Conference Notes

Category Archive

The following is a list of all entries from the Day 2 category.

Day 2 – People

Talked about Code4lib with Bess Sadler, Metadata Specialist at UVA Digital Library; They have recently installed a new CM system that is open source – she said it was a great tool.

Met Christina Deane, Project Manager at UVA Digital Library;

Talked to Dorothea Salo at GMU again about DSpace;

Met Brandeis university Metadata Librarian – talked about IR;

Talked with former professors, Barbara Wildemuth and Stephanie Haas

Met Ingrid Hsieh-Yee, Library Science Professor at Catholic University; They are very interested in student opportunities at VCU libraries

Talked with recent SILS grad who will head up the new Digital Initiatives at NC State


Day 2 – Afternoon Panel Session

Augmenting Interoperability Across Scholarly Reposoitories

Don Waters – mellon foundatation

history: late 1999 – provisional agreement – santa fe conventions – now we know this as OAI for metadata harvester; simplicity; does allow for complex features – exchange of native metadata structures; growing frustration with dublin core; repositories need new ways to interchange complex objects; demand for something more than oai; microsoft, mellon, etc. are interested in forming this new protocol; need new data model; framework must be intelligent about various objects; basic question:how do we enable communities who care?

tony hey – microsoft

looking to support scientists and engineers in scholarly communities

new science paradigms: e-science / data-centric science; microsoft understands it needs to embrace open standards

need more than text – weokring on IVO: astronomy data grid; skyserver.sdss.org

chemistry – e-prints of text of paper to graphic of paper to see raw data; analyze the data yourself

pubmed central – portable version by microsoft; federate through web services;

e-science mashups – combine services to give added value – combined datasets used to perform analysis;

interoperable repositories?

arXIV at Cornell –

NIH PubMedCentral – Microsoft funded
EPrints project in Southhampton – JISC-funded TARDis project

Herbert Van de Sompel – Los Alamos

– pathways project – nsf grant – cornell and los alamos
– context – emergence of repositories; ir; publisher repositories; dataset repositories
– compound digital objects – multiple media and content: paper, dataset, simulations, software, etc

– leverage materials in ir; reuse and use them; rather than making them accessible only to local users, but as active nodes in a global environment

motivators for something other than oai

– motivation 1: richer cross-repository services; objects as source materials; e.g. chemical search engine – machine readable chemical formulas; no foundation today to achieve; one would need a digital object representation of the formula; need semantics

– motivation 2: scholarly communication workflow; global workflow across repositories; recombine existing material, add value and store new object

– looking to a shared data model and services across repositories

– scholarly communication is a long-term endeavor; abstract definitions of repository interfaces; selective framework;

– new model: 3 interfaces: obtain, harvest, put; e.g. submit surrogates -> available through harvest and obtain intefaces -> service is populated by harvesting surrogates -> need lightweight service registry (like an object catalog in a federation – we don't need this as the surrogates carry their own information)

Carl Lagoze – Cornell Information Science

– Pathways Project; NSF grant http://www.infosci.cornell.edu/pathways

– set of metadata like dublin core is not sufficient; want to address modeling complex objects; datamodels (e.g. Dspace, Fedora, Mets, ePrints, etc.)

– pathways core data model: sits above individual models; abstract model vs. pkg for asset transfer

– avoid IP issues; allow 'live' references rather than static objects;

– key requirements of data model: 1 – identity; 2 – persistence; 3 – lineage; 4 – semantics; 5 – recursion; 6 – link to concrete representation;

– serialize data model; ship surrogates back and forth between services; obtain and harvest; deposit via pdf;

meeting website: http://msc.mellon.org/Meetings/Interop

Day 2 – Session 2 – Document Analysis

DOM Tree and Geometric Layout Analysis for online medical journal article segmentation

Information Retrieved From articles: medline citations; use the DOM model to search through html documents;

DOM node categorization: insignificant node, inline node, line-break node

visually same pages can have completely different DOM trees

zone tree model: basic assumption – journal article html doc authors use geometric layout to organize the page

Automatically Categorizing Figures in Scientific Documents

goal: location and extraction of non-textual informaiton from scientific/academic documents

problems: identification, categorization; data extraction; indexing and retrieval

Figures/Images in scientific documents (e.g., line graphs, flow charts, photographs) today, we cannot search by these images

data within figures: automatic data extraction – time consuming; automated totols exist for this

eg's a document containing gardening pix…or a paper reporting experiments on human computer interface

overview of work: semantics-sensitive; content-based feature extraction; machine-learning based classification

prior work: document retrieval (metadata extraction, name disambiguation); document image understanding (image representation to semantics, structure analysis);

extraction of figures: use adobe acrobat image extraction; does not work for scanned documents

classification: support vector machine

experiment setup: C, mySQL, Linux; dataset: ~2000 pdf files from citeseer; adobe acrobat extraction tool; manual annotation

XML Views for Electronic Editions

electronic editions; document-centric xml for electronic editions

xtagger 2005 – mapping between the data presentation model and the data access model; replication must occur between the 2 layers

software filters out teh intersting parts of the document-centric xml;

xml views: a set of xml nodes; reduces the size of the data access model; smaller serialization; based on xpath; and regex for tag/attribute names, attribute values

Day 2 – Session 1 – Visualization for Libraries

1st talk: etana – dl presentation: archeology dig library
browse by space, object, time for archeologists

visualizaton system wth hyperbolic trees; used for analytics – e.g., loking at a specific time period and % of bones found

28 studnets worked with the sytem for usabilty study; various tasks students performed

interesting feature of page: can save the breadcrumb path – well-liked by the students in usaiblity test

conclusions: approach dl based on dl theory;

lemmas : searching and browsing are the same; can produce the same results; go through a searching process and get a set of reseults and go through browsing, there's an inverse relationship; browsing results associated with a navigated path: relationships with the browsing sequences;

q: presenting things visually when dealing with more formats; but, it's not.

dicussion: information exploration; visualization; exploratory searching; tough to evaluate information visualization systems

difficulty with usabiliy studies in these systems; maryland will run a workshop on how to evaluate info visualization systems;

2nd talk: mixed-initiative system for representing collections as compositions of image and text surrogates – combiformination presentation (andruid kerne)

interface ecology lab -texas a&m computer science department


information discovery: emergence of new ideas; information serves a stimulus; intellectual tasks; collecting new sets of existing information resources; ability to manipulate to support information discovery

surrogate: comes from library science; representing the original – let's you access the original

digital surrogate: special type of hyperlink; formed systematically from metadata; eclipses the object;

text-based surrogates breakdown when you want to see relationships

images + text: working memory; separate cognitive resources; seems like a good idea to engage both together; e.g., video surrogates, navigational surrogates – overwhelmingly users prefercompositional format compared to text-based formats

composition: assemble collected elements to form a connected whoe; visual composition; spatial organization; compositing; fading

people understand this visual composition better;

mixed-initiatives: user and agents working concurrently; requries dialogue and feedback

combinformation: mixed-initiative composition of surrogates; composition space: space where composition is created; user can manipulate it; search it

launch demo: combinFormation – can search rss feeds, flickr, allow crawling into, the web, web site; can have as many searches as you like; can put together information from more than 1 search; crawler paths: crawling deeper into each site and crawl anywhere (follows cross links)

(java based) socket connection btw browser and application

uses google to run the searches and then it will download pages from initial searches and images patch up with text – VERY COOL!!

when you mouse over the surrogate, there's structured metadata: location, the gist (what google said about the page), title

user can move images using the grab tool; resize; you can create the composition in the middle of hte page; the "cool space" just for user; the "hot space" is shared with the agent

MAIN THEME OF THE SYSTEM: users can create a composition of their own images and search results by drag and drop technology; this becomes the user space;

metadata: details on demail in context rather than forcing the user to look elsewhere; deployment was i a class called the "design process"; experimented with 182 students; asked them to use combinFormation (divided the class – group A & group B). could only use combinFormation on 1 project; findings: students did better on both parts of the assignment when using combinFormation;

benefits: qualitative approach to collection visualization -using image and text together; using visual communications techniques (get both html & xml version of your composition space); users see some unexpected results; serendipity like physical library stacks;

future work: better semantic modeling; alternative visualizations; discovery tasks are not finding tasks;

3rd talk: infogallery: informative arts services for physical library spaces
center for interactionspaces; research project on future hybrid libraries;

i floor: interactive flooor

info-column: digital "poster-column" exhibits electronic publications; librarians promote library resources;

info-gallery concept: exhibit digital resources in the physical space; natural pick up of digital materials;

look at chalmers university, play group and georgia tech, info canvas

presentation interface: "remarkable" visualizations and animations of infobjects; surrogates floating around; users can come by and pick them up in the physical environment

look at royal library for information distribution channels; editors showing up on different channels

pick up of digital resources by bluetooth phone or by email address; fully web-integrated architecture; http, mysql, xml, .net, flash; rss integration; sensor layer for bluetooth

different appearances: walls, floors, smart boards, plasma, projections; workstations

can't really describe how incredibly new age and cool this product is!!! bouncing balls on an interactive display, push the ball and you get more information about the digital object or go directly to the original resource; can send yourself an email of the digital resource; can inspect digital channels; can create new concepts for saving as rss feeds

how's it's used: 65 librarians producing content; 50 channels; 250 interactive objecs on avg.

placed the galleries in the physical library at the desk, the refreshment areas; book-return matchine and moving these galleries to various parts of the city – reach users that don't come to the physical library;

future: working with local artists on different "skins"


Day 2 – Opening Session

Monday lunch – Open JCDL community; future of JCDL
Tues lunch – demonstration at HSL Collaboration Center;

Plenary Panel – Getting Books Online –

alot of debates about topic at past conference due to topic: Google as Library

Dan Clancy – Engineering Directory: Google Books; used to be IS director at NASA
undergrad at Duke
David Ferrario – Chief Exec – Research Libraries at NY Public Library. A G5 site – contributing to google books project. Director of library at Duke for 8 years
Dan Greenstein – UL CA Dig Lib Project;
Moderator: Cliff Finch – 18 years at U of CA – Director of Library Automation; now Internet Archive Director

– online resources; captures public imagination; implications for teaching and learning; preservation, copyright, economic models for publishing industry – ideas that go back a long way – beginning of dig age. much controversy

– alot of talk about getting books online but not about what we'll do with them once online

– google has talked about 'snippets' of books to ensure copyright

– what happens if we succeed? what does it mean to have open content in the context of large collections of digital books?

Google – Dan Clancy

– how many google interns have been in a library in teh last year? about half raised their hands. testament to ease of use of internet; access to primary sources of information increased. inverse impact for things not on the internet

– google's mission: "organize the world's information and make it universally accessible to all"

– goal: create a comprehensive virtual card catalog of aall books in all languages, while respecting publishers' right

– initiative intended to get at the 85% of books out-of-print

– try not to make editorial decisions

– 92% of world's books are neither generating revenue for the cp holder nor easily accessible to potential readers

– digitization was fair use

– 2 user experiences: sample pages view ~15%; snippet view 65%; full view

– snippet view – controversial; book in copyright; cannot page through book; google picks which snippets you see

– most popular searches – harry potter(!)

– scale -30 million books

– google developed its own scanning technology

research challenges
– "scanning is the easy part"; "what are some of the research challenges?

– how to balance cpright with public interest?
technical, social

– previewing vs. using for long periods of time – challenge in digital world;

– ontology of objects – how do we connect objects? link structure makes the web; annotated relationships btw objects. all user created. web of implicit relationships between the content; frbr hierarchy; references; authorship; temporal relationships; topical similiarity; the more content the harder it is to find information

David Ferraro – Ny Public Library

– why would ny public library get involved with google?
under discussion with google 3 years before he arrived; lots of committees – very conservation org. rationale – involved with massive digitization projects (digital gallery and inmotion – 13 diff african american migration; streaming video,etc.; worked with ny public schools to create lesson plans); google offered a way to get non-unique material digitized. could give them experience with corporate partnership; put ny public library at table with other partners; library partner perspective – incredible collaboration wtih google; not "evil corporate empire". resulted in attitude changes; learning about librarianship on the part of google; spent 31 years at mit – very similiar working with google

– public domain books; duplication throughout the research libraries involved with google – what does this mean for research library collections

– conerns about hwo much info google has about library user communities

– longterm preservation issues; librfaries are tyring to turn into dig preservation activity

Dan Greenstein – Open content alliance

project of internet archives – scanning out of cpright books; funding from corporate and private donors

– rlg – undertaken standards process for scanning…

– scanning furiously; results of scanning: raw files thrown away; jpeg2000 archival masters; searchable pdf; xml file that contains the ocr ; bibliographic metadata

– funding streams much smaller than google; in oca, there's a thin lead organization; not an alternative to google books online;

– qualities of open – access to public domain; both oca and gogole making access available to out of cpright books

Cliff's questions to panel:

1. Scanning; ocr – how much is an issue is ocr? do we need large scale investements in ocr research?

google:books propose different challenges for ocr: in the past 98% ocr quality is fine; however, when turning these into speech, it becomes more difficult. currently ocr is the most expensive phase of the google process – computationally. google is not focusing as much on the optical but is focusing on software on top of that. more complex lang models. ai problem; our lang models are a far cry from human ability; humans use book-specific models – learnign the curvature of letters in each books. computers don't do this well. need more emphasis on more obscure languages. ocr is very text based today; need data-driven, machine learning approach.

ny public:99% isn't quite good enough for most research libraries

oca: abobe is looking at this; ocr needs to be researched further

google releasing an open-source ocr package

2. international aspects to putting books online? strategies/priorities to diversification?

oca: aspiration is to digitize internatioally

ny public lib: 5 partners analysis – 49% of collections english lang; 430 different langugages represented

google: 50% foreign in terms of what get's digitized. ocr can handle most latin scripts now; arabic and chinese pose challenges

3. preservation

google did not design optimized process for high dollar brittle content/collections. qulaity of content and digization questions: libraries have different notions of what is preservation quality. focused on scale but try to ensure that quality is improved.

oca: for single editions of materials – won't be preservation quality; think of it in terms of collection mgmt. what's the quality sufficient to get rid of redundant materials in the collection; avoid cost and spend on materials that are particularly special.

4. google is amassing huge textual databases; theoretically an intellectual advantage on the part of google; can be translated (maybe?) into a business advantage; creation of enormous computational resources; access is private now


reesearch advantage:
interested in how googlecan help further research; what can researchers tell google in terms of helping researchers do their work

business advantage: company investing money can gain advantage; intellectutual pursuit is to encourage research that can be done in scale; what is the balance? says that the ai algorithms are not sophisticated enough to mine all the data

ny public library: topic that came up at original meeting;

oca: nobody has secure right and title to the entire source of google books; recognition that over time folks relax; more a problem with universities – how do we platform this information in a way that enables the computational aspects to happen; expectation form university side that we cannot go through the inforjmation computational; only certain infrastructures can support it; non-trivial problem to support this type of environment in business and some universities

what does persistence mean? google will provide persistent links? are folks pointing to the book or a page in the book? what is a persistent identifier? redundancy across organiazations as well – how do you organize the links?