|
Relax end-tag parsing for </script> and </style> so they accept attribute content like the other end-tags. This does not strictly conform to the HTML specifications but is sometimes found in real-world HTML.
New EndLine, EndCol, and EndPos functions determine the end of the current HTML piece.
Parse <![CDATA[ beginning of ptCDataSection case-sensitively, as per specification.
Parse <![CDATA[ … ]]> sections separately inside JavaScript comments. This fixes a problem with pages that use a commented CDATA section inside a script element but do not properly close this comment before the closing </script> end tag. Such end tags are now recognized by DIHtmlParser.
ExtractText demo works better with Delphi Unicode versions.
Library source code compiles with FreePascal ( Win32).
New TDICustomHtmlWriterPlugin intermediate interface for greater flexibilty in customizing TDIHtmlWriterPlugin.
New TDIHtmlParser.DataAsStrTrim8 convenience method.
Change case of HTML tag constants to lower case. This achieves slightly better results for HTML compression.
Bring DIHtmlParser_BookmarkParser demo up to date with latest Mozilla and Chrome bookmark files.
Improved documentation layout.
TDIHtmlParser: When parsing JavaScript, a forward slash "/" inside a regular expression character class was not recognized as such and could lead to an infinite loop.
TDIHtmlCharSetPlugin: Correct decoding function for "GBK" encoding which did not read the 1 to 127 character range.
Work around an unexpected Delphi 2009 automatic numeric AnsiChar Unicode conversion in DIUtils which caused an error when compiled on a Windows OS set to a non-European (Asian, Cyrillic, etc.) codepage.
TDIHtmlTag, TDICustomTag, TDISsiTag: .ConCatValue must not escape a '&' character in an attribute value immediately followed by a '{' character ( HTML 4.0.1 Section B.7.1).
Multiple fixes for filtering, most notable for TDITagFilters.SetStart.
Better HTML title parsing according to how FireFox does it.
TDIHtmlParser.TrimAttribValues behaved exactly opposit as intended.
Modify DIHtmlParser_ C6.bpk so that it should compile and install again with C++ Builder 6.
CharSetConverter demo: Add BOM detection.
ExtractText demo: Optional Unicode output controlled by compiler directive. Also add more tags to improve HTML → Text conversion.
WebDownload demo: Improve generation of document names if URI has a query part.
WriterPlugin demo: Support DIHtmlParser1.EnableHtmlTags.
Some new, simple console demos inspired by support questions.
Improve compatibility for parallel installation with other DI packages.
Some code cleanup.
Delphi 2007 support.
-
Compatibility with DIConverters 1.11. If you are using DIHtmlParser with DIConverters and encounter incompatibility problems after upgrading to this new version, be sure to use the new version of DIConverters as well.
Add XP Themes to Demo projects.
Fixed a problem when parsing certain kinds of regular expression escapes in JavaScript.
Reduced memory requirements for quickly skipping over JavaScript.
Fixed filtering bugs in TDIHtmlParser.FindHtmlTag, TDIHtmlParser.FindSsiTag, and TDIHtmlParser.ParseNextHtmlTag.
Added compatibility with Delphi 2006 Win32.
New TDIHtmlParser.EnableHtmlTags property which controls if HTML tags are properly recognized as such or are simply treated as text. Ignoring HTML tags can be useful for HTML scripting.
New TDIHtmlParser.TrimAttribValues property which controls if whitespace are automatically trimmed when parsing the attribute values of tags.
Improved parsing of CustomTags and ASP.
Fixed an error which could prematurely stop TDIUnicodeReader when a pushed source was popped at the end of a nested document.
Added Delphi 3 compatibility to the utility units.
Resolved dependency issues when DIHtmlParser is used in parallel with other DI products.
Added the options to link DIHtmlParser against DIConverters, which enables DIHtmlParser to read and write 130+ character encodings.
Added native Pascal implementation for reading / decoding and writing / encoding the following character sets:
Mac Arabic, Mac Dingbats, Mac Central Europe, Mac Croatian, Mac Cyrillic, Mac Farsi, Mac Greek, Mac Hebrew, Mac Iceland, Mac Roman, Mac Romanian, Mac Thai, Mac Turkish
UCS-2 LE, CS-2 BE,
UCS-4 LE, UCS-4 BE
UTF-32 LE, UTF-32 BE
UTF-7 (Write_UTF_7 / Read_UTF_7)
UTF-7 Optional Direct Characters (Write_UTF_7_ODC / reads as Read_UTF_7)
JIS X0201, NextStep, TIS 620
Improved the parser's handling of malicious markup frequently used in Spam E-Mail: The parser now treats invalid tags (like '<k$R>') as HTML Tags instead of Text. There is also a new piece type ptExclamationMarkup covering inserts starting with an exclamation mark like '<!A>'. It is returned for the character patterns '<! … >' which are not Comments, CData Sections, Document Templates, or SSI.
Improved parsing of non-conformant XML Processing Instruction (XmlPI), marked as '<?XML Char* ?>'. By specification, XmlPI must terminate with '?>', but the '?' is sometimes missing. Specification conformant parsing would then cause DIHtmlParser unintentionally to interpret lengthy stretches as XmlPI. This is now fixed by recognizing both variants as ending an XmlPI.
Improved the recognition of HTML entities lacking a terminating semicolon character (like ' ') in some cases.
Added mapping of some illegal but commonly used HTML numeric entities into their appropriate Unicode value.
Changed the TDIHtmlParser.StopParseAll procedure to a TDIHtmlParser.StopParse property. This must be set to True to stop the current parsing process. It applies to both TDIHtmlParser.ParseAll as well as to TDIHtmlParser.ParseNextPiece, where it cancels an ongoing parsing process which did not yet return to the caller.
Introduced TDIAbstractHtmlAttribsPlugin as ancestor class of TDIHtmlLinksPlugin, which now responds to a much wider range of link combinations, including multiple links contained within a single tag. Applications can also add custom Tag / Attribute combinations to report by calling TDIAbstractHtmlAttribsPlugin.AddAttrib. The TDIHtmlLinksPluginEvent callback definition has changed slightly and requires an interface change to existing applications.
Added a TDIHtmlWriterPlugin.PredefinedEntities option which allows to specify some known predefined entities which will alway be encoded by default when writing HTML text, regardless of other entity registrations.
Shortened procedure name of TDITag.ForceAttribValue to TDITag.ForceAttrib.
TDITag and descendent classes benefit from changes to DIContainers ancestors. This includes speed optimizations as well as some interface simplifications.
products/htmlparser/history.txt · Last modified: 2011/12/08 17:34 (external edit)
|