Yunqa • The Delphi Inspiration

Delphi Components and Applications

User Tools

Site Tools


DIHtmlParser: Version History

DIHtmlParser is a component suite to parse, analyze, extract information from, and generate HTML, XHTML, and XML documents for Delphi (Embarcadero, CodeGear, Borland).

DIHtmlParser v8.3.0 – 22 Nov 2023

  • Support Delphi 12 Athens Win32 and Win64.
  • New TDIHtmlTablesPlugin.CurrentTable property.
  • New TDIHtmlTable.TableNum property.
  • Fix buffer underrun in DIUri_3986 when cleaning up leading /../ and /.. dot segments on an empty base.

DIHtmlParser 8.2.0 – 16 Sept 2021

  • Support Delphi 11 Alexandria Win32 and Win64.
  • Update DIUtils.pas Unicode functions to Unicode 14.0.0.

DIHtmlParser 8.1.0 – 5 Jun 2020

  • Support Delphi 10.4 Sydney Win32 and Win64.

DIHtmlParser 8.0.1 – 30 Oct 2019

Delphi compilers with support for the inline directive (starting with Delphi 2005) failed to compile DIHtmlParser *.bpl packages for the Demo and Commercial editions. They generated a “[dcc32 Fatal Error] DIUtils: F2051 Unit DIContainers was compiled with a different version of DIUtils.StrSameIW”. Regular *.exe applications compiled without problems. The DIHtmlParser Source Code also compiled to both *.bpl packages and *.exe applications with no problems.

DIHtmlParser 8.0.0 – 8 Oct 2019

Extend character support to the full range of Unicode Code Points from $000000 to $10FFFF.

Up to now, DIHtmlParser stored code points as WideChars. This limited Unicode support to the Basic Multilingual Plane (BMP) from $0000 to $FFFF. Code points from the Supplementary Planes were converted to the $FFFD replacement character. This went well with a great number of languages. But less common scripts did not work, just like the increasingly popular emojis from the Symbols and Pictographs Unicode blocks.

DIHtmlParser 8.0.0 overcomes these limitations and now covers the complete Unicode range. Changes are almost entirely internal and maintain backwards compatibility as much as possible. Existing applications should compile with no or minor changes only. WideChar routines are marked as deprecated and hint at their new complementary UCP routines.

TDIHtmlParser.Data is still a WideChar buffer. However, its contents is now fully UTF-16 encoded. This means that it may contain code points > $FFFF which take up two WideChars (surrogate pairs). As a result, indexed access to the buffer is no longer guaranteed. TDIHtmlParser.Data related methods, like TDIHtmlParser.DataAsStrTrimW are adjusted accordingly.

UnicodeString utility routines are rewritten to handle full UTF-16, including surrogate pairs. Most of them are in DIUtils.pas. YuUtf.pas also contains new utility routines for UTF-16 testing, encoding, and decoding. If possible, string handling routines now take NativeInt type parameters for the buffer length.

Other noteworthy changes:

  • TDIHtmlParser.UCP complements TDIHtmlParser.Char.
  • The WideChar property TDIHtmlParser.CustomTagStartChar has new a UCS4Char complement CustomTagStartUcp. The same holds for TDIHtmlWriterPlugin.CustomTagStartChar and CustomTagStartUcp.
  • TDICustomTag.GetStartCode has a new UCS4Char overload. So do GetEmptyElementCode and GetEndCode.
  • Changed the type of TDIHtmlParser.StartCol, EndCol, StartLine, EndLine, StartPos, and EndPos from unsigned Cardinal to signed NativeInt.
  • Removed conditional compilation directives DI_No_Classes and DI_No_Unicode_Component (source code only). TDIHtmlParser and TDIHtmlParserPlugin now always descends from TComponent and the Classes unit is always used. Source code only.
  • Improve DIUtils.pas Unicode processing to support Unicode Code Points from $000000 to $10FFFF. Adjust remaining source code accordingly.
  • Update DIUtils.pas Unicode functions to Unicode 12.1.0.
  • Delphi 4 and Delphi 5 crash when compiling DIUtils.pas. There is no error message, so it is not possible to work around the problem. Support for these compilers is therefore removed. At least Delphi 6 is now required to compile DIHtmlParser.
  • Remove include file. Directly link in instead.

DIHtmlParser 7.12.0 – 7 Mar 2019

  • Fix potential TDIUnicodeWriter memory leak if TDIUnicodeWriteMethods.Init allocates its own memory.
  • TDIUnicodeWriter.Clear calls TDIUnicodeWriteMethods.Flush to reset encoder state.
  • KOI8-U converter now maps 0xB4 to U+0404 instead of U+0403.
  • Update DIUtils.pas Unicode functions to Unicode 12.
  • Compatibility update with DIConverters 1.18.0. These changes only affect projects using DIConverters:
    • Add ISO-2022-CP-MS encoding: Read_iso_2022_jp_ms read methods and Write_iso_2022_jp_ms write methods. This is recognized by TDIHtmlCharSetPlugin.
    • DIConverters converter functions now use the native unsigned integer type for the length of a string and support stings longer than 2 GB.
    • UTF-8 converter functions reject surrogates and out-of-range code points, namely the in the ranges 0xD800..0xDFFF and >= 0x110000.
    • Fix error handling in UCS-2, UCS-4, and UTF-32 decoder functions.
    • Tweak the GB18030 converter functions to map 0x8135F437 to U+E7C7.
    • Update the CP1255 converter functions to map 0xCA to U+05BA.

DIHtmlParser 7.11.0 – 24 Dec 2018

  • Support Delphi 10.3 Rio Win32 and Win64.

DIHtmlParser 7.10.0 – 3 Apr 2017

  • Support Delphi 10.2 Tokyo Win32 and Win64.

DIHtmlParser 7.9.0 – 7 May 2016

  • Support Delphi 10.1 Berlin Win32 and Win64.

DIHtmlParser 7.8.0 – 5 Apr 2016

  • New TDIHtmlWriterPlugin.PredefinedEntities:
    • peLtAttribValue to encode “<” as &lt; in attribute values. Required for XML conformance.
    • peGtAttribValue to encode “>” as &gt; in attribute values.
    • peQuotNum to encode quotation mark as numeric &#34; instead of &quot;.
  • Fix: peAposNum was not applied to attribute values.

DIHtmlParser 7.7.0 – 3 Mar 2016

  • New TDIHtmlWriterPlugin properties to force the character used to quote attribute values:
    • QuoteHtmlTagsChar
    • QuoteCustomTagsChar
    • QuoteSsiTagsChar

DIHtmlParser 7.6.2 – 15 Sep 2015

  • Support Delphi 10 Seattle Win32 and Win64.

DIHtmlParser 7.6.1 – 25 Apr 2015

  • Add support for Delphi XE8 Win32 and Win64.

DIHtmlParser 7.6.0 – 3 Oct 2014

  • Support Delphi XE7 Win32 and Win64.
  • Mark unit DIUri as deprecated.
  • TDIHtmlChangeLinksPlugin uses unit DIUri_3986 instead of the deprecated unit DIUri.
  • Improved documentation shows inherited class members.

DIHtmlParser 7.5.0 – 28 Apr 2014

  • Support Delphi XE6 Win32 and Win64.
  • Minor improvements to demo projects.

DIHtmlParser 7.0.1 – 17 Feb 2014

  • Compatibility update with other Yunqa products.

DIHtmlParser 7.0.0 – 25 Sep 2013

  • Support Delphi XE5 Win32 and Win64.

DIHtmlParser 6.6.0 – 14 Jun 2013

  • Support Delphi XE4 Win32 and Win64.

DIHtmlParser 6.5.1 – 24 Jan 2013

  • Compatibility update with other Yunqa products.

DIHtmlParser 6.5.0 – 4 Oct 2012

  • Support Delphi XE3 Win32 and Win64.
  • TDIHtmlCharSetPlugin: Fix that a second <meta http-equiv> tag which is not a content type does not reset the decoding to the default decoding.
  • Fix the DIHtmlParser_CharSetConverter demo so that the new character encoding is always written to the document, even if auto-detection is disabled.

DIHtmlParser 6.3.0 – 22 Jun 2012

  • HTML5 Updates:
    • Add new HTML5 tag and attribute names and IDs, for example TAG_SECTION, TAG_SECTION_ID and ATTRIB_PLACEHOLDER and ATTRIB_PLACEHOLDER_ID. The new HTML5 tags and attributes are automatically registered calling RegisterHtmlTags and RegisterHtmlAttribs.
    • Add new HTML5 named character references, known as entities in HTML4. After calling RegisterHtmlDecodingEntities, DIHtmlParser now recognizes all 2231 references listed in the current HTML5 draft.
    • Parse named character references / entities according to HTML5. In particular, a terminating semicolon ';' is no longer required. For example, &amp is recognized as '&' just as &amp;, &AMP, and &AMP;.
    • Named character references / entities can now be registered with and without terminating semicolon ';'. Change: If a terminating semicolon ';' is present, RegisterDecodingEntity now demands that it must be present in the entity name.
    • TDIHtmlCharSetPlugin recognizes the new HTML5 <meta charset=“name”> character encoding declaration.
  • Add DIUri_3986.TDIUri.AssignPath and DIUri_3986.TDIUri.AssignHost methods, plus DIUri_3986.UritoFileName with DIUri_3986.TDIUri URI input and UnicodeString filename output.

DIHtmlParser 6.2.0 – 14 Apr 2012

  • Fix: When parsing from TDIHtmlParser.SourceStream, the size of the internal source buffer was not correctly calculated. Depending on the decoding, this slowed down reading or even stoped it before the end of the stream was reached.
  • Fix: Parsing JavaScript, a regular expression character class containing just a single forward slash was not properly terminated.
  • New DIUri_3986.pas unit implements URI parsing and resolution according to RCF 3986.
  • DIUri.UriToFileName removes 'localhost' from authority, if present. Despite this change, DIUri is now deprecated. use DIUri_3986 instead.
  • ColorFromHtml: Improve parsing of #color values, in particular different lengths. Parse non conforming #color values as legacy color values.
  • Add optional EmptyAttribValues parameter (default = false) to
    • TDIHtmlTag.GetCode, TDIHtmlTag.GetStartCode, TDIHtmlTag.GetEmptyElementCode,
    • TDICustomTag.GetCode, TDICustomTag.GetStartCode, TDICustomTag.GetEmptyElementCode,
    • TDISsiTag.GetCode, TDISsiTag.GetStartCode, TDISsiTag.GetEmptyElementCode.
  • Work around a compiler warning in TDIHtmlParser.FillSourceBuffer (source code edition only).

DIHtmlParser 6.1.1 – 8 Dec 2011

  • Relax end-tag parsing for </script> and </style> so they accept attribute content like the other end-tags. This does not strictly conform to the HTML specifications but is sometimes found in real-world HTML.
  • New EndLine, EndCol, and EndPos functions determine the end of the current HTML piece.

DIHtmlParser 6.1.0 – 9 Nov 2011

  • Support Delphi XE2 Win64.
  • Fix AV when sorting empty TDIVector or descendents like TDITag and TDIHtmlTag.

DIHtmlParser 6.0.0 – 15 Oct 2011

  • Support Delphi XE2 Win32 (binary editions) and Win64 (source code edition only right now).
  • Fix a JavaScript parsing endless loop if the script ended with a slahes comment and its </SCRIPT> end tag was missing.

DIHtmlParser 5.2.2 – 7 Jul 2011

  • Improve handling of comments and CDATA for JavaScipt contents beween <script> and </script> elements.

DIHtmlParser 5.2.1 – 21 Feb 2011

  • Parse <![CDATA[ beginning of ptCDataSection case-sensitively, as per specification.
  • Parse <![CDATA[]]> sections separately inside JavaScript comments. This fixes a problem with pages that use a commented CDATA section inside a script element but do not properly close this comment before the closing </script> end tag. Such end tags are now recognized by DIHtmlParser.
  • ExtractText demo works better with Delphi Unicode versions.
  • Library source code compiles with FreePascal (Win32).

DIHtmlParser 5.2.0 – 28 Sep 2010

  • Delphi XE support.
  • Fix DIHtmlParser_ColoredCode demo for Unicode Delphis.

DIHtmlParser 5.1.2 – 24 Apr 2010

  • New TDICustomHtmlWriterPlugin intermediate interface for greater flexibilty in customizing TDIHtmlWriterPlugin.
  • New TDIHtmlParser.DataAsStrTrim8 convenience method.
  • Change case of HTML tag constants to lower case. This achieves slightly better results for HTML compression.
  • Bring DIHtmlParser_BookmarkParser demo up to date with latest Mozilla and Chrome bookmark files.
  • Improved documentation layout.

DIHtmlParser 5.1.1 – 17 Dec 2009

  • Additions and bug fixes to DIUtils.pas.

DIHtmlParser 5.1. – 14 Sep 2009

  • Delphi 2010 support.
  • Added the following TDIHtmlParser parsing options:
    • TDIHtmlParser.EnableComments.
    • TDIHtmlParser.EnableEntities.
    • TDIHtmlParser.EnableExclamationMarkups.
  • Allow custom tag attributes from a wider range of characters than for HTML tags.
  • New DIHtmlParser_MailMerge demo.

DIHtmlParser 5.0.1 – 31 Jan 2009

  • TDIHtmlParser: When parsing JavaScript, a forward slash “/” inside a regular expression character class was not recognized as such and could lead to an infinite loop.
  • TDIHtmlCharSetPlugin: Correct decoding function for “GBK” encoding which did not read the 1 to 127 character range.
  • Work around an unexpected Delphi 2009 automatic numeric AnsiChar Unicode conversion in DIUtils.pas which caused an error when compiled on a Windows OS set to a non-European (Asian, Cyrillic, etc.) codepage.

DIHtmlParser 5.0.0 – 24 Nov 2008

  • Delphi 2009 support.

DIHtmlParser 4.5.0 – 30 Jul 2008

  • TDIHtmlTag, TDICustomTag, TDISsiTag: .ConCatValue must not escape a '&' character in an attribute value immediately followed by a '{' character (HTML 4.0.1 Section B.7.1).
  • Multiple fixes for filtering, most notable for TDITagFilters.SetStart.
  • Better HTML title parsing according to how FireFox does it.
  • TDIHtmlParser.TrimAttribValues behaved exactly opposit as intended.
  • Modify DIHtmlParser_C6.bpk so that it should compile and install again with C++ Builder 6.
  • CharSetConverter demo: Add BOM detection.
  • ExtractText demo: Optional Unicode output controlled by compiler directive. Also add more tags to improve HTML → Text conversion.
  • WebDownload demo: Improve generation of document names if URI has a query part.
  • WriterPlugin demo: Support DIHtmlParser1.EnableHtmlTags.
  • Some new, simple console demos inspired by support questions.
  • Improve compatibility for parallel installation with other DI packages.
  • Some code cleanup.

DIHtmlParser 4.4.1 – 15 May 2007

  • Add some missing units to the DIHtmlParser *.dpk packages so suppress irritating hints during compilation.

DIHtmlParser 4.4.0 – 13 May 2007

  • Delphi 2007 support.
  • New HTML parser plugins:
    • TDIHtmlLinksPlugin2.
    • TDIHtmlCollectLinksPlugin.
    • TDIHtmlChangeLinksPlugin.
  • Compatibility with DIConverters 1.11. If you are using DIHtmlParser with DIConverters and encounter incompatibility problems after upgrading to this new version, be sure to use the new version of DIConverters as well.
  • Add XP Themes to Demo projects.

DIHtmlParser 4.3.1 – 20 Jun 2006

  • Fixed a problem when parsing certain kinds of regular expression escapes in JavaScript.
  • Reduced memory requirements for quickly skipping over JavaScript.
  • Fixed filtering bugs in TDIHtmlParser.FindHtmlTag, TDIHtmlParser.FindSsiTag, and TDIHtmlParser.ParseNextHtmlTag.

DIHtmlParser 4.3 – 28 Dec 2005

  • Added compatibility with Delphi 2006 Win32.

DIHtmlParser 4.2 – 14 Oct 2005

  • New TDIHtmlParser.EnableHtmlTags property which controls if HTML tags are properly recognized as such or are simply treated as text. Ignoring HTML tags can be useful for HTML scripting.
  • New TDIHtmlParser.TrimAttribValues property which controls if whitespace are automatically trimmed when parsing the attribute values of tags.
  • Improved parsing of CustomTags and ASP.
  • Fixed an error which could prematurely stop TDIUnicodeReader when a pushed source was popped at the end of a nested document.
  • Added Delphi 3 compatibility to the utility units.
  • Resolved dependency issues when DIHtmlParser is used in parallel with other DI products.

DIHtmlParser 4.1.1 – 2 Sep 2005

  • Eliminated some compiler warnings regarding C++ Builder compatibility.
  • Fixed a small packaging bug in the Demo edition which unfortunately slipped into the last update.

DIHtmlParser 4.1 – 31 Aug 2005

  • Improved parsing of script contents:
    • Extended the internal JavaScript parser in order to improve the recognition of '/…/' regular expressions within JavaScipt. Due to the nature of the JavaScript syntax, there is no 100% save way to tell the difference between '/' as a divisor sign and '/' as the beginning of a regular expression, but the algorithms applied does a pretty good job and fixes a problem which occured with certain HTML documents.
    • The new advanced JavaScript parsing is now the default, unless the the script is identified as not being JavaScript.
    • The appropriate <META …> tag is being read to determine the default scripting language. The current content script type is available via the TDIHtmlParser.ContentScriptType property.
    • New TDIHtmlParser.DefaultContentScriptType property to determine the content script type from outside the HTML document.
  • Compatibility with other DI products.

DIHtmlParser 4.0 – 14 Apr 2005

  • Added the options to link DIHtmlParser against DIConverters, which enables DIHtmlParser to read and write 130+ character encodings.
  • Added native Pascal implementation for reading / decoding and writing / encoding the following character sets:
    • Mac Arabic, Mac Dingbats, Mac Central Europe, Mac Croatian, Mac Cyrillic, Mac Farsi, Mac Greek, Mac Hebrew, Mac Iceland, Mac Roman, Mac Romanian, Mac Thai, Mac Turkish
    • UCS-2 LE, CS-2 BE,
    • UCS-4 LE, UCS-4 BE
    • UTF-32 LE, UTF-32 BE
    • UTF-7 (Write_UTF_7 / Read_UTF_7)
    • UTF-7 Optional Direct Characters (Write_UTF_7_ODC / reads as Read_UTF_7)
    • JIS X0201, NextStep, TIS 620
  • Improved the parser's handling of malicious markup frequently used in Spam E-Mail: The parser now treats invalid tags (like '<k$R>') as HTML Tags instead of Text. There is also a new piece type ptExclamationMarkup covering inserts starting with an exclamation mark like '<!A>'. It is returned for the character patterns '<! … >' which are not Comments, CData Sections, Document Templates, or SSI.
  • Improved parsing of non-conformant XML Processing Instruction (XmlPI), marked as '<?XML Char* ?>'. By specification, XmlPI must terminate with '?>', but the '?' is sometimes missing. Specification conformant parsing would then cause DIHtmlParser unintentionally to interpret lengthy stretches as XmlPI. This is now fixed by recognizing both variants as ending an XmlPI.
  • Improved the recognition of HTML entities lacking a terminating semicolon character (like '&nbsp') in some cases.
  • Added mapping of some illegal but commonly used HTML numeric entities into their appropriate Unicode value.
  • Changed the TDIHtmlParser.StopParseAll procedure to a TDIHtmlParser.StopParse property. This must be set to True to stop the current parsing process. It applies to both TDIHtmlParser.ParseAll as well as to TDIHtmlParser.ParseNextPiece, where it cancels an ongoing parsing process which did not yet return to the caller.
  • Introduced TDIAbstractHtmlAttribsPlugin as ancestor class of TDIHtmlLinksPlugin, which now responds to a much wider range of link combinations, including multiple links contained within a single tag. Applications can also add custom Tag / Attribute combinations to report by calling TDIAbstractHtmlAttribsPlugin.AddAttrib. The TDIHtmlLinksPluginEvent callback definition has changed slightly and requires an interface change to existing applications.
  • Added a TDIHtmlWriterPlugin.PredefinedEntities option which allows to specify some known predefined entities which will alway be encoded by default when writing HTML text, regardless of other entity registrations.
  • Shortened procedure name of TDITag.ForceAttribValue to TDITag.ForceAttrib.
  • TDITag and descendent classes benefit from changes to DIContainers ancestors. This includes speed optimizations as well as some interface simplifications.
products/htmlparser/history.txt · Last modified: 2023/11/23 10:18 by