| Crunch is a web proxy, usable with essentially all web browsers, that performs content extraction (or clutter reduction) from HTML web pages. Crunch includes a flexible plug-in API so that various heuristics can be integrated to act as filters, collectively, to remove non-content and perform content extraction. This proxy has evolved from a program where individual settings had to be tweaked by hand by the end user, to an extraction system that is designed to adapt to the user’s workflow and needs, classifying web pages based on genre and utilizing this information to extract content in similar manners from similar sites. It reduces human involvement in applying heuristic settings for websites and instead tries to automate the job by detecting and utilizing the content genre of a given website. One of the major goals of Crunch is to be able to make web pages more accessible to people with disabilities and we believed that preprocessing web pages with Crunch would make inaccessible web pages more accessible. |

Home | About | People | Publications | Software | Register