unfluff

Statistical HTML content extraction in Python
Download

unfluff Ranking & Summary

Advertisement

  • Rating:
  • License:
  • BSD License
  • Publisher Name:
  • Tim Cuthbertson

unfluff Tags


unfluff Description

Statistical HTML content extraction in Python unfluff is a statistical content extraction tool written in python - remove the useless fluff from arbitrary HTML pages.Based on methods discussed (and implemented) in various places, but most directly: * http://www.spicylogic.com/allenday/blog/2008/05/27/statistical-html-content-extraction/ * http://www2003.org/cdrom /papers/refereed/p583/p583-gupta.htmlAn experiment / work in progress.Usage:The command line tool can either take a file or a URL to extract. It prints the content tree to stdout:unfluff /path/to/something.htmlorunfluff -u 'http://some-website.com/interesting-article.html'The unfluff library has a few functions, which pretty much all do the same thing via different formats:import unfluffunfluff.from_url('http://whatever/')unfluff.from_file('/tmp/input.html')unfluff.from_string("< html >inline content< /html >")Both of these are native (C) extensions, which means you're best off looking for them in your friendly neighborhood package manager. Requirements: · Python · lxml · SciPy


unfluff Related Software