| Web pages may often contain “clutter” (defined by us as unnecessary images, navigational menus and extraneous links) around the body of an article that may distract a user from actual content. Extraction of “useful and relevant” content from web pages has many applications, including speech rendering for the visually disabled, cell phone and PDA browsing, and text summarization. Most existing approaches to making content more directly accessible involve changing font size or removing HTML and data components such as images, which may take away from a webpage’s inherent look and feel. Unlike “Content Reformatting”, which aims to reproduce the entire webpage in a more convenient form, our solution directly addresses “Content Extraction”.
|
We use DOM tree based content extraction rather than directly processing HTML as flat files. Crunch is a versatile solution, allowing programmers and administrators to add heuristics to the framework. These heuristics act as filters that can be parameterized and toggled to perform the content extraction. Crunch reduces human involvement in the application of thresholds for the heuristics by automatically detecting and utilizing the content genre (context) of a given website. Genre detection is accomplished via the use of frequency distributions of words associated with the webpages. These distributions are used to improve the extraction process by comparing them to previously known results that work well for certain genres of sites and utilizing those settings.
Home | About | People | Publications | Software | Register