Delphi Inspiration

Components and Applications

User Tools

Site Tools


products:pcre2:history

YuPcre2: Version History

YuPcre2 is a new regular expression library for Delphi with Perl syntax. Directly supports UnicodeString, AnsiString, or UCS4String, as well as UTF-8, and UTF-16.

YuPcre2 1.7.0 – 16 Aug 2017

  • Implement PCRE2_ENDANCHORED, coEndAnchored, and moEndAnchored.
  • Add an explicit limit on the amount of heap used by pcre2_match, set by pcre2_set_heap_limit, TDIPerlRegEx2_8.HeapLimit, TDIDfaRegEx2_16.HeapLimit, and the pattern start (*LIMIT_HEAP=xxx).
  • Extend auto-anchoring etc. to ignore groups with a zero qualifier and single-branch conditions with a false condition (e.g. DEFINE) at the start of a branch. For example, (?(DEFINE)…)^A and (…){0}^B are now flagged as anchored.
  • Implement PCRE2_EXTENDED_MORE and coExtendedMore, and related /xx and (?xx) features.
  • Implement (?n: for PCRE2_NO_AUTO_CAPTURE and coNoAutoCapture, because Perl now has this.
  • Implement extra compile options in the compile context:
    • PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES and coAllowSurrogateEscapes;
    • PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL and coBadEscapeIsLiteral;
    • PCRE2_EXTRA_MATCH_LINE and coMatchLine;
    • PCRE2_EXTRA_MATCH_WORD and coMatchWord.
  • Implement newline type PCRE2_NEWLINE_NUL.
  • A lookbehind assertion that had a zero-length branch caused undefined behaviour when processed by pcre2_dfa_match.
  • The match limit value now also applies to pcre2_dfa_match as there are patterns that can use up a lot of resources without necessarily recursing very deeply.
  • Implement PCRE2_LITERAL and coLiteral.
  • Increased the limit for searching for a “must be present” code unit in subjects from 1000 to 2000 for 8-bit searches, since they are much faster.
  • Arrange for anchored patterns to record and use “first code unit” data, because this can give a fast “no match” without searching for a “required code unit”. Previously only non-anchored patterns did this.
  • Upgraded the Unicode tables from Unicode 8.0.0 to Unicode 10.0.0.
  • Update extended grapheme breaking rules to the latest set that are in Unicode Standard Annex #29.
  • Added experimental foreign pattern conversion facilities (pcre2_pattern_convert and friends).
  • If a hyphen that follows a character class is the last character in the class, Perl does not give a warning. PCRE2 now also treats this as a literal.
  • PCRE2 was not throwing an error for [\d-X] (and similar escapes), as is documented.

YuPcre2 1.6.0 – 3 Apr 2017

New features:

  • Support Delphi 10.2 Tokyo Win32 and Win64.
  • The main interpreter, pcre2_match, has been refactored into a new version that does not use recursive function calls (and therefore the stack) for remembering backtracking positions. The new implementation allows backtracking into recursive group calls in patterns, making it more compatible with Perl, and also fixes some other hard-to-do issues.
    • Now that pcre2_match no longer uses recursive function calls (see above), the “match limit recursion” value seems misnamed. It still exists, and limits the depth of tree that is searched. To avoid future confusion, it has been renamed as “depth limit” in all relevant places (TDIRegEx2Base.MatchLimitDepth, PCRE2_INFO_DEPTHLIMIT, PCRE2_CONFIG_DEPTHLIMIT, PCRE2_ERROR_DEPTHLIMIT, pcre2_set_depth_limit, etc.) but the old names are still available for backwards compatibility.
    • PCRE2_CONFIG_STACKRECURSE is no longer used and deprecated.
  • Added the PCRE2_INFO_FRAMESIZE item to pcre2_pattern_info and the InfoFrameSize property to TDIRegEx2_8 as well as TDIRegEx2_16.InfoFrameSize.
  • The depth (formerly recursion) limit now applies to DFA matching.

Bug fixes:

  • In the 32-bit library in non-UTF mode, an attempt to find a Unicode property for a character with a code point greater than 0x10ffff (the Unicode maximum) caused a crash.
  • If a lookbehind assertion that contained a back reference to a group appearing later in the pattern was compiled with the PCRE2_ANCHORED option, undefined actions (often a segmentation fault) could occur, depending on what other options were set. An example assertion is (?<!\1(abc)) where the reference \1 precedes the group (abc).
  • Fix memory leak in pcre2_serialize_decode when the input is invalid.
  • Fix potential nil dereference in pcre2_callout_enumerate if called with a nil pattern pointer.
  • The alternative matching function, pcre2_dfa_match misbehaved if it encountered a character class with a possessive repeat, for example [a-f]{3}+.

YuPcre2 1.5.0 – 17 Feb 2017

New features:

  • Implemented pcre2_code_copy_with_tables.
  • \g{+<number>} (e.g. \g{+2}) is now supported. It is a “forward back reference” and can be useful in repetitions (compare \g{-<number>}). Perl does not recognize this syntax.

Optimizations:

  • When a pattern is too complicated, PCRE2 gives up trying to find a minimum matching length and just records zero. Typically this happens when there are too many nested or recursive back references. If the limit was reached in certain recursive cases it failed to be triggered and an internal error could be the result.
  • The pcre2_dfa_match function now takes note of the recursion limit for the internal recursive calls that are used for lookrounds and recursions within the pattern.
  • Detecting patterns that are too large inside the length-measuring loop saves processing ridiculously long patterns to their end.
  • When autopossessifying, skip empty branches without recursion, to reduce stack usage. Example pattern: X?(R||){3335}.
  • A pattern with very many explicit back references to a group that is a long way from the start of the pattern could take a long time to compile because searching for the referenced group in order to find the minimum length was being done repeatedly. Now up to 128 group minimum lengths are cached and the attempt to find a minimum length is abandoned if there is a back reference to a group whose number is greater than 128. (In that case, the pattern is so complicated that this optimization probably isn't worth it.)

Bug fixes:

  • In any wide-character mode (8-bit UTF or any 16-bit or 32-bit mode), without PCRE2_UCP set, a negative character type such as \D in a positive class should cause all characters greater than 255 to match, whatever else is in the class. There was a bug that caused this not to happen if a Unicode property item was added to such a class, for example [\D\P{Nd}] or [\W\pL].
  • There has been a major re-factoring of pcre2_compile. Most syntax checking is now done in the pre-pass that identifies capturing groups. While doing this, some minor bugs and Perl incompatibilities were fixed, including:
    1. \Q\E in the middle of a quantifier such as A+\Q\E+ is now ignored instead of giving an invalid quantifier error.
    2. {0} can now be used after a group in a lookbehind assertion; previously this caused an “assertion is not fixed length” error.
    3. Perl always treats (?(DEFINE) as a “define” group, even if a group with the name “DEFINE” exists. PCRE2 now does likewise.
    4. A recursion condition test such as (?(R2)…) must now refer to an existing subpattern.
    5. A conditional recursion test such as (?(R)…) misbehaved if there was a group whose name began with “R”.
    6. A hyphen appearing immediately after a POSIX character class (for example [[:ascii:]-z]) now generates an error. Perl does accept this as a literal, but gives a warning, so it seems best to fail it in PCRE.
    7. An empty \Q\E sequence may appear after a callout that precedes an assertion condition (it is, of course, ignored).

      One effect of the refactoring is that some error numbers and messages have changed, and the pattern offset given for compiling errors is not always the right-most character that has been read. In particular, for a variable-length lookbehind assertion it now points to the start of the assertion. Another change is that when a callout appears before a group, the “length of next pattern item” that is passed now just gives the length of the opening parenthesis item, not the length of the whole group. A length of zero is now given only for a callout at the end of the pattern. Automatic callouts are no longer inserted before and after explicit callouts in the pattern. * Back references are now permitted in lookbehind assertions when there are no duplicated group numbers (that is, (?| has not been used), and, if the reference is by name, there is only one group of that name. The referenced group must, of course be of fixed length.
  • Automatic callouts are no longer generated before and after callouts in the pattern.
  • A number of bugs have been mended relating to match start-up optimizations when the first thing in a pattern is a positive lookahead. These all applied only when PCRE2_NO_START_OPTIMIZE was *not* set:
    1. A pattern such as (?=.*X)X$ was incorrectly optimized as if it needed both an initial 'X' and a following 'X'.
    2. Some patterns starting with an assertion that started with .* were incorrectly optimized as having to match at the start of the subject or after a newline. There are cases where this is not true, for example, (?=.*[A-Z])(?=.{8,16})(?!.*[\s]) matches after the start in lines that start with spaces. Starting .* in an assertion is no longer taken as an indication of matching at the start (or after a newline).
  • A pattern with PCRE2_DOTALL (/s) set but not PCRE2_NO_DOTSTAR_ANCHOR, and which started with .* inside a positive lookahead was incorrectly being compiled as implicitly anchored.
  • Fix out-of-bounds read for partial matching of . against an empty string when the newline type is CRLF.
  • The appearance of \p, \P, or \X in a substitution string when PCRE2_SUBSTITUTE_EXTENDED was set caused a segmentation fault (nil dereference).
  • If the starting offset was specified as greater than the subject length in a call to pcre2_substitute an out-of-bounds memory reference could occur.
  • Incorrect data was compiled for a pattern with PCRE2_UCP set without PCRE2_UTF if a class required all wide characters to match (for example, [\s[:^ascii:]]).
  • The limit in the auto-possessification code that was intended to catch overly-complicated patterns and not spend too much time auto-possessifying was being reset too often, resulting in very long compile times for some patterns. Now such patterns are no longer completely auto-possessified.
  • Ignore PCRE2_CASELESS when processing \h, \H, \v, and \V in classes as it just wastes time. In the UTF case it can also produce redundant entries in XCLASS lists caused by characters with multiple other cases and pairs of characters in the same “not-x” sublists.

YuPcre2 1.4.0 – 31 Jul 2016

New Features:

  • Implemented pcre2_code_copy to make a copy of a compiled pattern.
  • Implemented the PCRE2_NO_JIT option for pcre2_match and moNoJit option for TDIRegEx2Base.MatchOptions.
  • Calls to pcre2_get_error_message with error numbers that are never returned by PCRE2 functions were returning empty strings. Now the error code PCRE2_ERROR_BADDATA is returned.
  • Allow \C in lookbehinds and DFA matching in UTF-32 mode.

Bug fixes:

  • Detect unmatched closing parentheses and give the error in the pre-scan instead of later. Previously the pre-scan carried on and could give a misleading incorrect error message. For example, (?J)(?'a'))(?'a') gave a message about invalid duplicate group names.
  • A pattern that included (*ACCEPT) in the middle of a sufficiently deeply nested set of parentheses of sufficient size caused an overflow of the compiling workspace (which was diagnosed, but of course is not desirable).
  • Detect missing closing parentheses during the pre-pass for group identification.
  • Fix a racing condition in JIT.
  • Fix register overwrite in JIT when SSE2 acceleration is enabled.

YuPcre2 1.3.0 – 7 May 2016

  • Support Delphi 10.1 Berlin Win32 and Win64.

YuPcre2 1.2.0 – 4 Mar 2016

New features:

  • New option to limit the length of a pattern: TDIRegEx2Base.MaxPatternLength and pcre2_set_max_pattern_length.
  • New option to limit the offset of unanchored matches: TDIRegEx2Base.OffsetLimit and pcre2_set_offset_limit.
  • New pcre2_substitute options PCRE2_SUBSTITUTE_EXTENDED, PCRE2_SUBSTITUTE_UNSET_EMPTY, PCRE2_SUBSTITUTE_UNKNOWN_UNSET, and PCRE2_SUBSTITUTE_OVERFLOW_LENGTH.

Bug fixes:

  • In a character class such as [\W\p{Any}] where both a negative-type escape (“not a word character”) and a property escape were present, the property escape was being ignored.
  • Fixed integer overflow for patterns whose minimum matching length is very, very large.
  • The special sequences [[:<:]] and [[:>:]] gave rise to incorrect compiling errors or other strange effects if compiled in UCP mode.
  • Adding group information caching improves the speed of compiling when checking whether a group has a fixed length and/or could match an empty string, especially when recursion or subroutine calls are involved.
  • If [:^ascii:] or [:^xdigit:] are present in a non-negated class, all characters with code points greater than 255 are in the class. When a Unicode property was also in the class (if PCRE2_UCP is set, escapes such as \w are turned into Unicode properties), wide characters were not correctly handled, and could fail to match. Negated classes such as [^[:^ascii:]\d] were also not working correctly in UCP mode.
  • If PCRE2_AUTO_CALLOUT was set on a pattern that had a (?# comment between an item and its qualifier (for example, A(?#comment)?B) pcre2_compile misbehaved.
  • Similarly, if an isolated \E was present between an item and its qualifier when PCRE2_AUTO_CALLOUT was set, pcre2_compile misbehaved.
  • The error for an invalid UTF pattern string always gave the code unit offset as zero instead of where the invalidity was found.
  • An empty \Q\E sequence between an item and its qualifier caused pcre2_compile to misbehave when auto callouts were enabled.
  • If both PCRE2_ALT_VERBNAMES and PCRE2_EXTENDED were set, and a (*MARK) or other verb “name” ended with whitespace immediately before the closing parenthesis, pcre2_compile misbehaved. Example: (*:abc ), but only when both those options were set.
  • In a number of places pcre2_compile was not handling nil characters correctly.
  • If a pattern that was compiled with PCRE2_EXTENDED started with white space or a #-type comment that was followed by (?-x), which turns off PCRE2_EXTENDED, and there was no subsequent (?x) to turn it on again, pcre2_compile assumed that (?-x) applied to the whole pattern and consequently mis-compiled it. The fix for this bug means that a setting of any of the (?imsxU) options at the start of a pattern is no longer transferred to the options that are returned by PCRE2_INFO_ALLOPTIONS. In fact, this was an anachronism that should have changed when the effects of those options were all moved to compile time.
  • An escaped closing parenthesis in the “name” part of a (*verb) when PCRE2_ALT_VERBNAMES was set caused pcre2_compile to malfunction.

YuPcre2 1.1.0 – 15 Sep 2015

  • Support Delphi 10 Seattle Win32 and Win64.
  • Match limit check added to recursion.
  • Arrange for the UTF check in pcre2_match and pcre2_dfa_match to look only at the part of the subject that is relevant when the starting offset is non-zero.
  • Improve first character match in JIT with SSE2 on x86.
  • Fixed two assertion fails in JIT.
  • Fixed a corner case of range optimization in JIT.
  • Add the ${*MARK} facility to pcre2_substitute.
  • Implemented PCRE2_ALT_VERBNAMES and coAltVerbnames.
  • Fixed two issues in JIT.

YuPcre2 1.0.1 – 8 Aug 2015

  • Pathological patterns containing many nested occurrences of [: caused pcre2_compile to run for a very long time.
  • A missing closing parenthesis for a callout with a string argument was not being diagnosed, possibly leading to a buffer overflow.
  • A conditional group with only one branch has an implicit empty alternative branch and must therefore be treated as potentially matching an empty string.
  • If (?R was followed by - or + incorrect behaviour happened instead of a diagnostic.
  • Conditional groups whose condition was an assertion preceded by an explicit callout with a string argument might be incorrectly processed, especially if the string contained \Q.
  • Fix buffer overflow while checking a UTF-8 string if the final multi-byte UTF-8 character was truncated.
  • Finding the minimum matching length of complex patterns with back references and/or recursions can take a long time. There is now a cut-off that gives up trying to find a minimum length when things get too complex.
  • An optimization has been added that speeds up finding the minimum matching length for patterns containing repeated capturing groups or recursions.
  • If a pattern contained a back reference to a group whose number was duplicated as a result of appearing in a (?|…) group, the computation of the minimum matching length gave a wrong result, which could cause incorrect “no match” errors. For such patterns, a minimum matching length cannot at present be computed.
  • Added a check for integer overflow in conditions (?(<digits>) and (?(R<digits>).
  • Fixed an issue when \p{Any} inside an xclass did not read the current character.
  • The JIT compiler did not restore the control verb head in case of *THEN control verbs.
  • The way recursive references such as (?3) are compiled has been re-written because the old way was the cause of many issues. Now, conversion of the group number into a pattern offset does not happen until the pattern has been completely compiled. This does mean that detection of all infinitely looping recursions is postponed till match time. In the past, some easy ones were detected at compile time.
  • A test for a back reference to a non-existent group was missing for items such as \987. This caused incorrect code to be compiled.
  • Error messages for syntax errors following \g and \k were giving inaccurate offsets in the pattern.
  • Improve the performance of starting single character repetitions in JIT.
  • (*LIMIT_MATCH=) now gives an error instead of setting the value to 0.
  • Error messages for syntax errors in *LIMIT_MATCH and *LIMIT_RECURSION now give the right offset instead of zero.
  • The JIT compiler should not check repeats after a {0,1} repeat byte code.
  • The JIT compiler should restore the control chain for empty possessive repeats.

YuPcre2 1.0.0 – 22 Jul 2015

  • Initial release.
products/pcre2/history.txt · Last modified: 2017/08/17 18:13 (external edit)