Tag Soup

TagSoup is a SAX2 parser written in Java.
Download

Tag Soup Ranking & Summary

Advertisement

  • Rating:
  • License:
  • GPL
  • Price:
  • FREE
  • Publisher Name:
  • John Cowan
  • Publisher web site:
  • http://mercury.ccil.org/~cowan/XML/tagsoup/

Tag Soup Tags


Tag Soup Description

TagSoup is a SAX2 parser written in Java. TagSoup is a SAX2 parser written in Java that, instead of parsing well-formed or valid XML. Tag Soup parses HTML as it is found in the wild: nasty and brutish, though quite often far from short.By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. It is a parser, not a whole application; it isn't intended to permanently clean up bad HTML, as HTML Tidy does, only to parse it on the fly.The following options are understood:--files Output into individual files, with html extensions changed to xhtml. Otherwise, all output is sent to the standard output. --html Output is in clean HTML: the XML declaration is suppressed, as are end-tags for the known empty elements. --omit-xml-declaration The XML declaration is suppressed. --method=html End-tags for the known empty HTML elements are suppressed. --pyx Output is in PYX format. --pyxin Input is in PYXoid format (need not be well-formed). --nons Namespaces are suppressed. Normally, all elements are in the XHTML 1.x namespace, and all attributes are in no namespace. --nobogons Bogons (unknown elements) are suppressed. Normally, they are treated as empty. --nodefaults suppress default attribute values --nocolons change explicit colons in element and attribute names to underscores --norestart don't restart any normally restartable elements --any Bogons are given a content model of ANY rather than EMPTY. --lexical Pass through HTML comments. Has no effect when output is in PYX format. --reuse Reuse a single instance of TagSoup parser throughout. Normally, a new one is instantiated for each input file. --nocdata Change the content models of the script and style elements to treat them as ordinary #PCDATA (text-only) elements, as in XHTML, rather than with the special CDATA content model. --encoding=encoding Specify the input encoding. The default is the Java platform default. --help Print help. --version Print the version number.Requirements:· Java 1.4.2 or laterWhat's New in This Release:· The main issue was with HTML comments, which were very badly broken: any > character would terminate one, so commenting out elements did not work properly.· Everything should now be correct.· Everyone should update who possibly can.· Additionally, &#Xnnnn (with capital X) now works, some debugging code was removed from PYXWriter, a Unicode BOM at the beginning of a document is skipped, and the new version of Saxon is supported as an XSLT processor.· Documentation has been added on SAX features and properties specific to TagSoup.


Tag Soup Related Software