JCDL 2006 Conference Notes

Day 4 – Usage and Relationships

An architecture for the aggregation and analyssi of scholarly usage data

parterning with sfx; blackbox project

also partnered with california state university; have made sfx usage logs available

findings – increased activity in fall semester

journal ranking – usage; deemed similiar if downloaded by the same ppl (similar to google)

top was jama-j am med assoc, 2nd science

looked at both journals cited but not read and those used but not cited…

usage-based recommender system 


Day 4 – Metadata in Action

first session

talk 1: frbr – enriching and integrating digital libraries

– proposal – not a standard

publications: group I entities; heart of FRBR model; four-level hierarchical tree
used greenstone;
frbr is currently in corba; moving to soap
automatic extraction – difficult and unreliable
manual entry – expensive
authority – whose frbr to trust?

talk 2: scaffolding the infrastructure of hte computational science digital library

shodor eduction foundation – computational science digital library
ohio supercomputing center
evaluate, modify, collection and tag the best of computational maodel based materials;
outlet for peer reviewed publication
using plone, cms and metadata repository – cwis – collection workflow integration system
multifaceted search

talk 3: dynamic generation of oai servers

university of americas – mexico
libraries are willing to share digital collections; face obstacles to build oai servers
goal is to have a server generator that automically creates an oai server
voai – for releational dbs
and xoai


talk 4: looking back. looking forward: a metadata standard for lanl's adore

nnssa – los alamos labs

"the library without walls"

adore http://purl.lanl.gov/aDORe/projects/adoreArchive

consolidate metadata schemas from vendors
requirements: granularity, transparency, extensibility, xml based

marcxml – looked to be the best despite its perceived downfalls (which actually turned into strengths)

fears about ridgitity were unfounded; have mapped 85 million records into marcxml; 40 million used by downstream applications

talk 5: learning from artifacts: metadata utilization analysis (texas center for digital knowledge)

research project looking at how catalogers are using marcxml descriptive elements

metadata assumptions

– essential in dig library apps

– will end up with different schemes

– increasing use of machine-generated metadata

– questioning of role of handcrafted metadata

metadata record as artifact

mcdu project

– provide empirical evidence of catalogers' use of marc content designation

– to what extent are metadata catalogers exploiting the complexity of marc?

– dataset 56 million from worldcat (store these records in mysql)

– project website: http://www.mcdu.unt.edu

– ended up with 20 datasets

Day 3 – Panel – NDIIPP Preservation Network

NDIIPP Preservation Network:

Progress, problems & promise


LC – William Furgy


Us context for dig preservation

Gov at various levels




Consortial domains


Decentralized in focus – library is node on network

Facilitating digital preservation in US/UK has more centralized approach


3 areas of focus:

– network of pres partners

– arch framework for pres

– dig pres research


2 phases of investment

2004 & 2006; library has used this fund official projects, research


congress provided 100 million for project;


network of preservation partners –

content scope: public tv, dot-com era documents



–         identify and preserve significant at-risk content

–         leverage resources thru collaboration

–         digital stewardship network

–         technical infrastructure

–         public policy issues

–         2010 report to congress


national network – interoperability

value chain




los Alamos tools

storage- distributed – san diego supercomputer center, ejournal edeposit

joint or shared repositories at thee state level

          ask sam about va – could write grant for this

study group section 108 – looks at IP issues – group is ½ from libraries/archives and ½ from content industries – recommendations for how to rewrite the law


phase 2 investments:

–         preserving creative america –commercial content producers

–         working with states

–         additional bus models

–         projects:

o        data replication

o        risk assessment

o        data integrity assurance

o        content validation


DigArch Program – Helen Tibbo

VidArch Team – preserving video; preserving meaning and context;

Goals: make video accessible and understandable in the future; context; preservation framework;

Background: oais reference model; finding aids merged with rich nature of video

METS, NLNZ, PREMIS – metadata schemas today;

This project looks at longterm understandabilty;

Expensive to capture context;

Part of project is partnering with NASA

ACM also has a collection;

OAIS framework – need to develop better articulation

VidArch – typology of elements to be documented within video collection

FAs – considering these as digital objects that should be ingested into repository

Collaborations: sils, ibiblio, open video; renaissance computing center; internet archive & prelinger archive



jim tuttle – geospatial data librarian at nc state

nc geospatial data archiving project

state & local content

NC Onemap – provides framework

Content: vector data;

Local data often more detailed…

Enormous amount of data

Risk data: future supports of data formats; web services; no metadata; geospatial databases – difficult to archive


Trying to influence data producers in NC

Using Dspace repository;

Changing thinking: ajax


Odom institute:

Data-pass: meeting the challenges of a digital data world

Survey, polls data – how to preserve these archives? Social science purposes

Largest repository: ICPSR

Sas data files

Today can do text searches of questionnaires

Day 3 – First Session – Time and Space

Talk 1 – Supporting Literary Scholars with Data Mining and Visual Interfaces:

visual interfaces: accessible, provacative

text mining just beginning in the humanities

nora project: http://www.noraproject.org

systems today provide access not necessarily text analysis

text analysis – new area; classificiation problems; scholars typically need assistance;

other work being done to visualize metadata;

users: small group of computer programmers; broad base of scholars uninterested in computational tools themselves, but doing the work

users' needs: classifying documents; reading; finding indicators – what makes a document fall into one class or another

case study: emily dickinson's letters; 300 xml encoded documents


manual classificaiton -> automatic classification -> correlations with document metadata

manually rate documents through system ; this serves as training set for data mining classifier

start analysis -> data mining algorithm determines likelihood and ratio of being in 1 class or another

manual classification takes a bit of time;

found that the word indicators were not as helpful as the computational probability

after classificaiton want to understand relationship btw the documents you've classified. look for correlations

uses naive bayes algorithm;

Talk 2 – Time Period Directories

search in humanities – chronology, geo, bio, subject

trying to develop search capabilities to search 4 facets

want to try use metadata as infrastructure; search across genres

what metadata to use for temporal aspect? chronology?

date/time standards, hard to put on a timeline

named time period problems: unstable; multiple names; ambiguous; how to disambiguate between periods and dates; all problems occur with places as well

place name gazatteer; use structure – associate witha date and associate where it happened and the time of event -> this becomes the time period directory

this was then put into an xml schema

prototype developed from LC SH authority records


map interface: location data and puts on a map

timeline browse

country browse – list



Day 3 – Opening Session


open information: redaction | restriction | removal

keynote: jonathan zittrain – harvard university and university of oxford

google search: “milk supply terrorists”

security breach information act; law about metadata; 2003 if you are a company with a lot of data and it could be compromised, you must alert the users

ways ot protect personal data – borrow from ip?

Sysinternals blog – software can possibly spy to get habit usage

Soultion is to not continue fighting war – antispy, etc. but to think about privacy and the expression of your identity; often that means to contextualize data about you; different than traditional view of privacy; more accepting of open environment

You tube – encourage folks to broadcast yourself;

Mashups – podcast, music, etc. retracting any of this is difficult to do; ppl are willing to put themselves out there

What does redaction mean in an open environment?

Best example: enron – shredding content: “accurate document destruction”

Technological future makes it more difficult to retract, recall information

e.g. omniva – every email generated is encrypted; key generated for each day of the week; for a company, you would only have to destroy the key which destroys all relevant documents;

libraries decent point of control for distribution; libraries “best friends” to publishers/content providers/book sellers rather than adversaries; creating systems that ressemble systems like omniva; libraries would be where you to retrieve content rather than the “open jungle” of the web

libraries: what’s a library for? Are there commonalities between public, academic;

LOCKSS mentioned: mirror and synchronize across libraries

Libraries are so far the best hope for those in a position to release something; privacy with libraries; largest advances in digital library space from “left-field”

When to pull something back?

is running a library just about indexing? or is it like brewster kale?

what is the purpose of a library?

one conception: the fortress; keeping non-scholars away; filtering what's important and what's not; if there's no limit on what dig libraries store, is there a reason to discriminate?

ask jeeves – everytime someone asks him instead of a librarian…jeeves doesn't have authority control;

idea of collections – libraries have collections that become archives;

non-institutional collections that mirror the library:

– "gawker stalker" email gawker if you're in ny and see a celebrity; up within 15 minutes

– facebook; 90% of american college students have entries;

– riya photo search – face recognition technology; new incoming photos are autmoatically tagged using face recognition; makes the libraries "castle" seem like the outside; gps tagging;

– databases that transform the way we understand information

– protest/gatherings you bring your identity to

non-institutional judgements

today rudimentary system like ebay's star system

cyworld – one of the most popular sites in the world in korea; wake up in the morning check the world's collective judgement about you; as you interact with ppl, they rank you;

systems of collective judgement for which library can play role in saying what information is credible; maybe the decision is not about whether to keep it; wikipedia ex: seigenthaler article that was removed. should the history have been removed? muhammad cartoon controversy – one of wikipedia's best moments that libraries and news have not done





Day 3 – Afternoon Session – DL Education

Supporting Digital Library Education:

Factors Motivating Use of Digital Libraries

– findings: faculty dont' necessarily distinguish between a web page with a series of links and a digital library

– google preferred overall to academics ; they use it for pages they go to regularly

– using google to find things quickly – looking to update existing lecture materials

– barriers:

– lack of awareness; information overload; priorities not lack of time; no motivation to use digital learning materials

emerging questions

– should we match what faculy are using

– granularity of items

– what are faculty dev strategies that work?

– faculty do their own analysis of information they find;


recruiting institutions that might be interested in survey

alot of visitors are coming through google

Day 2 – People

Talked about Code4lib with Bess Sadler, Metadata Specialist at UVA Digital Library; They have recently installed a new CM system that is open source – she said it was a great tool.

Met Christina Deane, Project Manager at UVA Digital Library;

Talked to Dorothea Salo at GMU again about DSpace;

Met Brandeis university Metadata Librarian – talked about IR;

Talked with former professors, Barbara Wildemuth and Stephanie Haas

Met Ingrid Hsieh-Yee, Library Science Professor at Catholic University; They are very interested in student opportunities at VCU libraries

Talked with recent SILS grad who will head up the new Digital Initiatives at NC State

Day 2 – Afternoon Panel Session

Augmenting Interoperability Across Scholarly Reposoitories

Don Waters – mellon foundatation

history: late 1999 – provisional agreement – santa fe conventions – now we know this as OAI for metadata harvester; simplicity; does allow for complex features – exchange of native metadata structures; growing frustration with dublin core; repositories need new ways to interchange complex objects; demand for something more than oai; microsoft, mellon, etc. are interested in forming this new protocol; need new data model; framework must be intelligent about various objects; basic question:how do we enable communities who care?

tony hey – microsoft

looking to support scientists and engineers in scholarly communities

new science paradigms: e-science / data-centric science; microsoft understands it needs to embrace open standards

need more than text – weokring on IVO: astronomy data grid; skyserver.sdss.org

chemistry – e-prints of text of paper to graphic of paper to see raw data; analyze the data yourself

pubmed central – portable version by microsoft; federate through web services;

e-science mashups – combine services to give added value – combined datasets used to perform analysis;

interoperable repositories?

arXIV at Cornell –

NIH PubMedCentral – Microsoft funded
EPrints project in Southhampton – JISC-funded TARDis project

Herbert Van de Sompel – Los Alamos

– pathways project – nsf grant – cornell and los alamos
– context – emergence of repositories; ir; publisher repositories; dataset repositories
– compound digital objects – multiple media and content: paper, dataset, simulations, software, etc

– leverage materials in ir; reuse and use them; rather than making them accessible only to local users, but as active nodes in a global environment

motivators for something other than oai

– motivation 1: richer cross-repository services; objects as source materials; e.g. chemical search engine – machine readable chemical formulas; no foundation today to achieve; one would need a digital object representation of the formula; need semantics

– motivation 2: scholarly communication workflow; global workflow across repositories; recombine existing material, add value and store new object

– looking to a shared data model and services across repositories

– scholarly communication is a long-term endeavor; abstract definitions of repository interfaces; selective framework;

– new model: 3 interfaces: obtain, harvest, put; e.g. submit surrogates -> available through harvest and obtain intefaces -> service is populated by harvesting surrogates -> need lightweight service registry (like an object catalog in a federation – we don't need this as the surrogates carry their own information)

Carl Lagoze – Cornell Information Science

– Pathways Project; NSF grant http://www.infosci.cornell.edu/pathways

– set of metadata like dublin core is not sufficient; want to address modeling complex objects; datamodels (e.g. Dspace, Fedora, Mets, ePrints, etc.)

– pathways core data model: sits above individual models; abstract model vs. pkg for asset transfer

– avoid IP issues; allow 'live' references rather than static objects;

– key requirements of data model: 1 – identity; 2 – persistence; 3 – lineage; 4 – semantics; 5 – recursion; 6 – link to concrete representation;

– serialize data model; ship surrogates back and forth between services; obtain and harvest; deposit via pdf;

meeting website: http://msc.mellon.org/Meetings/Interop

Day 2 – Session 2 – Document Analysis

DOM Tree and Geometric Layout Analysis for online medical journal article segmentation

Information Retrieved From articles: medline citations; use the DOM model to search through html documents;

DOM node categorization: insignificant node, inline node, line-break node

visually same pages can have completely different DOM trees

zone tree model: basic assumption – journal article html doc authors use geometric layout to organize the page

Automatically Categorizing Figures in Scientific Documents

goal: location and extraction of non-textual informaiton from scientific/academic documents

problems: identification, categorization; data extraction; indexing and retrieval

Figures/Images in scientific documents (e.g., line graphs, flow charts, photographs) today, we cannot search by these images

data within figures: automatic data extraction – time consuming; automated totols exist for this

eg's a document containing gardening pix…or a paper reporting experiments on human computer interface

overview of work: semantics-sensitive; content-based feature extraction; machine-learning based classification

prior work: document retrieval (metadata extraction, name disambiguation); document image understanding (image representation to semantics, structure analysis);

extraction of figures: use adobe acrobat image extraction; does not work for scanned documents

classification: support vector machine

experiment setup: C, mySQL, Linux; dataset: ~2000 pdf files from citeseer; adobe acrobat extraction tool; manual annotation

XML Views for Electronic Editions

electronic editions; document-centric xml for electronic editions

xtagger 2005 – mapping between the data presentation model and the data access model; replication must occur between the 2 layers

software filters out teh intersting parts of the document-centric xml;

xml views: a set of xml nodes; reduces the size of the data access model; smaller serialization; based on xpath; and regex for tag/attribute names, attribute values

Day 2 – Session 1 – Visualization for Libraries

1st talk: etana – dl presentation: archeology dig library
browse by space, object, time for archeologists

visualizaton system wth hyperbolic trees; used for analytics – e.g., loking at a specific time period and % of bones found

28 studnets worked with the sytem for usabilty study; various tasks students performed

interesting feature of page: can save the breadcrumb path – well-liked by the students in usaiblity test

conclusions: approach dl based on dl theory;

lemmas : searching and browsing are the same; can produce the same results; go through a searching process and get a set of reseults and go through browsing, there's an inverse relationship; browsing results associated with a navigated path: relationships with the browsing sequences;

q: presenting things visually when dealing with more formats; but, it's not.

dicussion: information exploration; visualization; exploratory searching; tough to evaluate information visualization systems

difficulty with usabiliy studies in these systems; maryland will run a workshop on how to evaluate info visualization systems;

2nd talk: mixed-initiative system for representing collections as compositions of image and text surrogates – combiformination presentation (andruid kerne)

interface ecology lab -texas a&m computer science department


information discovery: emergence of new ideas; information serves a stimulus; intellectual tasks; collecting new sets of existing information resources; ability to manipulate to support information discovery

surrogate: comes from library science; representing the original – let's you access the original

digital surrogate: special type of hyperlink; formed systematically from metadata; eclipses the object;

text-based surrogates breakdown when you want to see relationships

images + text: working memory; separate cognitive resources; seems like a good idea to engage both together; e.g., video surrogates, navigational surrogates – overwhelmingly users prefercompositional format compared to text-based formats

composition: assemble collected elements to form a connected whoe; visual composition; spatial organization; compositing; fading

people understand this visual composition better;

mixed-initiatives: user and agents working concurrently; requries dialogue and feedback

combinformation: mixed-initiative composition of surrogates; composition space: space where composition is created; user can manipulate it; search it

launch demo: combinFormation – can search rss feeds, flickr, allow crawling into, the web, web site; can have as many searches as you like; can put together information from more than 1 search; crawler paths: crawling deeper into each site and crawl anywhere (follows cross links)

(java based) socket connection btw browser and application

uses google to run the searches and then it will download pages from initial searches and images patch up with text – VERY COOL!!

when you mouse over the surrogate, there's structured metadata: location, the gist (what google said about the page), title

user can move images using the grab tool; resize; you can create the composition in the middle of hte page; the "cool space" just for user; the "hot space" is shared with the agent

MAIN THEME OF THE SYSTEM: users can create a composition of their own images and search results by drag and drop technology; this becomes the user space;

metadata: details on demail in context rather than forcing the user to look elsewhere; deployment was i a class called the "design process"; experimented with 182 students; asked them to use combinFormation (divided the class – group A & group B). could only use combinFormation on 1 project; findings: students did better on both parts of the assignment when using combinFormation;

benefits: qualitative approach to collection visualization -using image and text together; using visual communications techniques (get both html & xml version of your composition space); users see some unexpected results; serendipity like physical library stacks;

future work: better semantic modeling; alternative visualizations; discovery tasks are not finding tasks;

3rd talk: infogallery: informative arts services for physical library spaces
center for interactionspaces; research project on future hybrid libraries;

i floor: interactive flooor

info-column: digital "poster-column" exhibits electronic publications; librarians promote library resources;

info-gallery concept: exhibit digital resources in the physical space; natural pick up of digital materials;

look at chalmers university, play group and georgia tech, info canvas

presentation interface: "remarkable" visualizations and animations of infobjects; surrogates floating around; users can come by and pick them up in the physical environment

look at royal library for information distribution channels; editors showing up on different channels

pick up of digital resources by bluetooth phone or by email address; fully web-integrated architecture; http, mysql, xml, .net, flash; rss integration; sensor layer for bluetooth

different appearances: walls, floors, smart boards, plasma, projections; workstations

can't really describe how incredibly new age and cool this product is!!! bouncing balls on an interactive display, push the ball and you get more information about the digital object or go directly to the original resource; can send yourself an email of the digital resource; can inspect digital channels; can create new concepts for saving as rss feeds

how's it's used: 65 librarians producing content; 50 channels; 250 interactive objecs on avg.

placed the galleries in the physical library at the desk, the refreshment areas; book-return matchine and moving these galleries to various parts of the city – reach users that don't come to the physical library;

future: working with local artists on different "skins"