CRUNCH – Programming Systems Laboratory

Crunch is a web proxy, usable with essentially all web browsers, that performs content extraction (or clutter reduction) from HTML web pages. Crunch includes a flexible plug-in API so that various heuristics can be integrated to act as filters, collectively, to remove non-content and perform content extraction.

This proxy has evolved from a program where individual settings had to be tweaked by hand by the end user, to an extraction system that is designed to adapt to the user’s workflow and needs, classifying web pages based on genre and utilizing this information to extract content in similar manners from similar sites. It reduces human involvement in applying heuristic settings for websites and instead tries to automate the job by detecting and utilizing the content genre of a given website.

One of the major goals of Crunch is to be able to make web pages more accessible to people with disabilities and we believed that preprocessing web pages with Crunch would make inaccessible web pages more accessible.

Publications

Suhit Gupta, Gail Kaiser, “CRUNCH – Web-based Collaboration for Persons with Disabilities”, W3C Web Accessibility Initiative, Teleconference on Making Collaboration Technologies Accessible for Persons with Disabilities, Apr 2003.

Suhit Gupta, Gail Kaiser, David Neistadt, Peter Grimm “DOM-based Content Extraction of HTML Documents” WWW2003

Suhit Gupta; Gail E Kaiser, Peter Grimm, Michael F Chiang, Justin Starren, “Automating Content Extraction of HTML Documents” World Wide Web Journal, January 2004

Michael F. Chiang, Roy G. Cole, Suhit Gupta, Gail E Kaiser, Justin Starren, “World Wide Web Accessibility by Visually Disabled Patients: Problems and Solutions”, Submitted to the Journal of Opthalmology, January 2004

Suhit Gupta; Gail E Kaiser, Salvatore Stolfo, “Extracting Context To Improve Accuracy For HTML Content Extraction”, Poster at the World Wide Web Conference 2005

Suhit Gupta, Gail E Kaiser, Salvatore Stolfo, Hila Becker, Genre Classification of Websites Using Search Engine Snippets for Content Extraction”, Submitted to SIGIR 2005

Suhit Gupta, Gail Kaiser, “Extracting content from accessible webpages”, Proceedings of the 2005 International Cross-Disciplinary Workshop on Web Accessibility (W4A), May 2005