Jericho HTML Parser

A simple but powerful java library allowing analysis and manipulation of parts of an HTML document.
Download

Jericho HTML Parser Ranking & Summary

Advertisement

  • Rating:
  • License:
  • LGPL
  • Price:
  • FREE
  • Publisher Name:
  • Martin Jericho
  • Publisher web site:

Jericho HTML Parser Tags


Jericho HTML Parser Description

A simple but powerful java library allowing analysis and manipulation of parts of an HTML document. Jerich HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions.Jericho HTML Parser project is an open source library released under the GNU Lesser General Public License (LGPL). You are therefore free to use it in commercial applications subject to the terms detailed in the licence document. Here are some key features of "Jericho HTML Parser": · No parse tree of the entire document is ever generated. The document source text is searched only for the markup relevant to the current operation. This allows the library to analyse and modify documents containing incorrect or badly formatted HTML or any other server or client side code, script, macro or markup. Most other parsers can't handle content that they are not explicitly programmed to accept. · The beginning and end positions in the source text of all parsed segments are accessible, allowing modification of only selected segments of the document without having to reconstruct the entire document from a parse tree. This feature, in combination with the one above, makes the toolkit extremely powerful in its simplicity. · Provides a simple but comprehensive interface for the analysis and manipulation of HTML form controls, including the extraction and population of initial values, and conversion to read-only or data display modes. Analysis of the form controls also allows data received from the form to be stored and presented in an appropriate manner. · ASP, JSP, PSP, PHP and Mason server tags can be registered for recognition by the parser, and are recognised as accurately as is possible without incorporating actual parsers for these languages into the library. The library then allows any of these segments to be ignored when parsing the rest of the document so that they do not interfere with the HTML syntax. (see Segment.ignoreWhenParsing()) · Custom tag types can be easily defined and registered for recognition by the parser. What's New in This Release: Bug Fixes: · Infinite loop on Segment.getAllStartTags() · Infinite loop on Segment.getAllElements() · Segment.getFirst* methods returned segments outside the bounding segment. · Segment.getAllElements methods did not return all enclosed elements in some circumstances. · Fixed documentation errors in Segment.getAllElements methods. · Added StreamedSource class. CHANGES THAT COULD AFFECT THE BEHAVIOUR OF EXISTING PROGRAMS: · Changed ParseText from class to interface. · Segment.getNodeIterator() now returns character references as separate nodes. · Added tag search methods based on attribute value regular expressions. · Added tag search methods based on HTML class attribute. · Added static Source.LegacyNodeIteratorCompatabilityMode property temporarily to restore Segment.getNodeIterator() functionality to that of previous versions. · Removed char[] based search methods in ParseText. · Added CharacterReference.appendCharTo(Appendable) method. · Added OutputDocument(Segment) constructor. · Added StreamedSourceCopy sample program.


Jericho HTML Parser Related Software