products:htmlparser:index
Delphi 12 Athens Updates Available!
To download, click your product: DIContainers, DIConverters, DICreole, DIFileFinder, DIGoogleReader, DIHtmlLabel, DIHtmlParser, DIMime, DIRegEx, DISQLite3, DITidy, DIUcl, DIUnicode, DIXml, YuBrotli, YuImage, YuNetSurf, YuOpenSSL, YuPcre2, YuPdf, YuStemmer, YuXmlSec, YuZip.
To download, click your product: DIContainers, DIConverters, DICreole, DIFileFinder, DIGoogleReader, DIHtmlLabel, DIHtmlParser, DIMime, DIRegEx, DISQLite3, DITidy, DIUcl, DIUnicode, DIXml, YuBrotli, YuImage, YuNetSurf, YuOpenSSL, YuPcre2, YuPdf, YuStemmer, YuXmlSec, YuZip.
no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
— | products:htmlparser:index [2016/01/22 15:08] (current) – created - external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== DIHtmlParser ====== | ||
+ | {{page> | ||
+ | |||
+ | ===== Overview ===== | ||
+ | |||
+ | * Full Unicode support (UnicodeString or WideString, depending on Delphi version). | ||
+ | * Reads and writes over 70 character sets natively (independent of the OS). More than 150 are supported with the help of [[products: | ||
+ | * Operates on TStreams, memory buffers or strings. | ||
+ | * Returns a single piece of HTML to the application at a time. | ||
+ | * Extends easily via the [[plugins|TDIHtmlParserPlugin]] interface. | ||
+ | |||
+ | ===== Recognized HTML Pieces ===== | ||
+ | |||
+ | DIHtmlParser recognizes 10 pieces of HTML plus 4 pieces of Non-HTML. | ||
+ | |||
+ | The HTML pieces are: | ||
+ | |||
+ | * **CData Sections:** CData Sections, found in XML, are used to escape blocks of text containing characters which would otherwise be recognized as markup. A CData section begins with ''< | ||
+ | |||
+ | * **Comments: | ||
+ | |||
+ | * **Document Type Definitions: | ||
+ | |||
+ | * **HTML Processing Instructions: | ||
+ | |||
+ | * **HTML-Tags: | ||
+ | |||
+ | * **Scripts: | ||
+ | |||
+ | * **Styles:** DIHtmlParser returns the contents between the ''< | ||
+ | |||
+ | * **Text:** Text is everything which is not markup. If the '' | ||
+ | |||
+ | * **Titles:** DIHtmlParser returns the contents between the ''< | ||
+ | |||
+ | * **XML Processing Instructions: | ||
+ | |||
+ | The Non-HTML pieces are: | ||
+ | |||
+ | * **Active Server Pages (ASP):** Active Server Page markup is often used to enclose scripting macros. It begins with ''< | ||
+ | |||
+ | * **Custom-Tags: | ||
+ | |||
+ | * **PHP:** PHP is a powerful and popular scripting language. Its markup begins with ''<? | ||
+ | |||
+ | * **Server Side Includes (SSI):** SSI, an extension of the Apache Web Server, starts with ''< | ||
+ | |||
+ | ===== Parsing Efficiency ===== | ||
+ | |||
+ | DIHtmlParser is extremely fast, especially when parsing huge files. Thanks to the internal buffer mechanism, it does not need to load the entire file into memory at once but can read one small chunk after the other at a single time only. DIHtmlParser parses up to 50 000 tags per second even with an outdated 166 MHz processor. On modern machines the score goes up to more than 15 MB of HTML data per second. | ||
+ | |||
+ | DIHtmlParser only parses what it needs to parse. Thanks to its filtering mechanism, the parser can skip all pieces of HTML which the application did not request. Even though the parser must eventually touch each single character of a HTML document, it might only need to store a fraction of that data for further processing. We call this "Smart Parsing", | ||
+ | |||
+ | Another trick of "Smart Parsing" | ||
+ | |||
+ | ===== Individual Tag Filtering ===== | ||
+ | |||
+ | Tag filtering forwards the general filtering to individual tags. It enables the programmer to instruct the parser to hold back all tags which are not relevant to the application. Why bother with ''< | ||
+ | |||
+ | ===== Further Customization ===== | ||
+ | |||
+ | [[plugins|DIHtmlParser Plugins]] are the next step to customized HTML parsing. A single instance ot TDIHtmlParser can run any number of parsing processes in parallel to the its main parsing process. Each [[plugins|plugin]] features its own flexible filtering mechanism just as the main parser. The plugin architecture keeps overhead to a minimum, as each of them informs the parser about its requirements ahead of the parsing. So even with many plugins in effect, DIHtmlParser will never parse more than what your application actually asks for. | ||
+ | |||
+ | More information on DIHtmlParser Plugins is available [[plugins|on this page]]. | ||
+ | |||
+ | {{tag> |
products/htmlparser/index.txt · Last modified: 2016/01/22 15:08 by 127.0.0.1