JCDL 2006 Conference Notes

Day 2 – Session 2 – Document Analysis

DOM Tree and Geometric Layout Analysis for online medical journal article segmentation

Information Retrieved From articles: medline citations; use the DOM model to search through html documents;

DOM node categorization: insignificant node, inline node, line-break node

visually same pages can have completely different DOM trees

zone tree model: basic assumption – journal article html doc authors use geometric layout to organize the page

Automatically Categorizing Figures in Scientific Documents

goal: location and extraction of non-textual informaiton from scientific/academic documents

problems: identification, categorization; data extraction; indexing and retrieval

Figures/Images in scientific documents (e.g., line graphs, flow charts, photographs) today, we cannot search by these images

data within figures: automatic data extraction – time consuming; automated totols exist for this

eg's a document containing gardening pix…or a paper reporting experiments on human computer interface

overview of work: semantics-sensitive; content-based feature extraction; machine-learning based classification

prior work: document retrieval (metadata extraction, name disambiguation); document image understanding (image representation to semantics, structure analysis);

extraction of figures: use adobe acrobat image extraction; does not work for scanned documents

classification: support vector machine

experiment setup: C, mySQL, Linux; dataset: ~2000 pdf files from citeseer; adobe acrobat extraction tool; manual annotation

XML Views for Electronic Editions

electronic editions; document-centric xml for electronic editions

xtagger 2005 – mapping between the data presentation model and the data access model; replication must occur between the 2 layers

software filters out teh intersting parts of the document-centric xml;

xml views: a set of xml nodes; reduces the size of the data access model; smaller serialization; based on xpath; and regex for tag/attribute names, attribute values


