Yunqa • The Delphi Inspiration

Delphi Components and Applications

User Tools

Site Tools


products:regex:history

Table of Contents

DIRegEx: Version History

DIRegEx is a library of components and procedures that implement regular expression pattern matching using the same syntax and semantics as Perl for Delphi (Embarcadero / CodeGear / Borland).

DIRegEx v8.16.0 – 22 Nov 2023

  • Support Delphi 12 Athens Win32 and Win64.

DIRegEx 8.15.0 – 16 Sept 2021

  • Support Delphi 11 Alexandria Win32 and Win64.

DIRegEx 8.14.0 – 14 Jul 2021

In mid 2015, DIRegEx has been superseded by the YuPcre2 Delphi Regular Expression Library. Since then, both DIRegEx and YuPcre2 have been developed in parallel. DIRegEx continued to receive security fixes, optimizations, and even improvements. Development of new features, however, took place in YuPcre2.

Beginning with DIRegEx 8.12.0 (25 Mar 2020), development focuses entirely on YuPcre2 and DIRegEx received just very few updates. Starting with this version, DIRegEx updates will be even less. This may even be the last release.

Users of DIRegEx are encouraged to deploy YuPcre2 for the newest Delphi regular expression technology. To ease migration, the YuPcre2 units and components are named as closely as possible to those of DIRegEx. A demo is available to start the conversion of existing projects, including testing.


Changes in this version:

  • Fix a memory leak if a compile error occurred when there are more than 20 named groups.
  • Fix a (*MARK) bug in the interpreter.

DIRegEx 8.13.0 – 5 Jun 2020

  • Support Delphi 10.4 Sydney Win32 and Win64.

DIRegEx 8.12.0 – 25 Mar 2020

  • JIT compiler update.
  • Check the size of the number after (?C as it is read, in order to avoid integer overflow.
  • Update DIUtils.pas Unicode functions to Unicode 13.0.

DIRegEx 8.11.0 – 8 Oct 2019

  • JIT compiler update.
  • Improve DIUtils.pas Unicode processing to support Unicode Code Points from $000000 to $10FFFF. Adjust remaining source code accordingly.
  • Update DIUtils.pas Unicode functions to Unicode 12.1.0.
  • Remove DI.inc include file. Directly link in DICompilers.inc instead.
  • Remove support for the DI_No_RegEx_Component compiler directive. TDIRegEx always descend from TComponent. Source code only.
  • Remove support for the DI_No_Classes compiler directive. The Classes unit is always used. Source code only.

DIRegEx 8.10.0 – 7 Mar 2019

  • Fix: TDIRegEx.Replace and TDIRegEx16.Replace did not return the start of the string if StartOffset > 0.
  • Adjust TDIRegExSearchStream_Enc to DIConverters 1.18.0: Converter functions now use the native unsigned integer type for the length of a string and support stings longer than 2 GB. This change only affects projects using DIConverters.
  • In a pattern such as [^\x{100}-\x{ffff}]*[\x80-\xff] which has a repeated negative class with no characters less than 0x100 followed by a positive class with only characters less than 0x100, the first class was incorrectly being auto-possessified, causing incorrect match failures.
  • If the only branch in a conditional subpattern was anchored, the whole subpattern was treated as anchored, when it should not have been, since the assumed empty second branch cannot be anchored. Demonstrated by test patterns such as (?(1)^())b or (?(?=^))b.
  • Fix subject buffer overread in JIT when UTF is disabled and \X or \R has a greater than 1 fixed quantifier.
  • If a pattern started with a subroutine call that had a quantifier with a minimum of zero, an incorrect “match must start with this character” could be recorded. Example: (?&xxx)*ABC(?<xxx>XYZ) would (incorrectly) expect 'A' to be the first character of a match.
  • Using pcre_dfa_exec, in UTF mode when UCP support was not defined, there was the possibility of a false positive match when caselessly matching a “not this character” item such as [^\x{1234}] (with a code point greater than 127) because the “other case” variable was not being initialized.
  • Although pcre_jit_exec checks whether the pattern is compiled in a given mode, it was also expected that at least one mode is available. This is fixed and pcre_jit_exec returns with PCRE_ERROR_JIT_BADOPTION when the pattern is not optimized by JIT at all.
  • If a backreference with a minimum repeat count of zero was first in a pattern, apart from assertions, an incorrect first matching character could be recorded. For example, for the pattern (?=(a))\1?b, “b” was incorrectly set as the first character of a match.
  • Fix out-of-bounds read for partial matching of . against an empty string when the newline type is CRLF.
  • Matching the pattern (*UTF)\C[^\v]+\x80 against an 8-bit string containing multi-code-unit characters caused bad behaviour and possibly a crash.

DIRegEx 8.9.0 – 24 Dec 2018

  • Support Delphi 10.3 Rio Win32 and Win64.

DIRegEx 8.8.1 – 19 Jul 2017

  • A (?# style comment is now ignored between a basic quantifier and a following '+' or '?' (example: X+(?#comment)?Y.
  • In the 32-bit library in non-UTF mode, an attempt to find a Unicode property for a character with a code point greater than #$10FFFF (the Unicode maximum) caused a crash.
  • The alternative matching function pcre_dfa_exec misbehaved if it encountered a character class with a possessive repeat, for example [a-f]{3}+.

DIRegEx 8.8.0 – 3 Apr 2017

  • Support Delphi 10.2 Tokyo Win32 and Win64.

DIRegEx 8.7.2 – 13 Jan 2017

  • Fix register overwite in JIT when SSE2 acceleration is enabled.
  • Fix JIT unaligned accesses on x86.
  • In any wide-character mode (8-bit UTF or any 16-bit or 32-bit mode), without PCRE_UCP set, a negative character type such as \D in a positive class should cause all characters greater than 255 to match, whatever else is in the class. There was a bug that caused this not to happen if a Unicode property item was added to such a class, for example [\D\P{Nd}] or [\W\pL].
  • A pattern such as (?<RA>abc)(?(R)xyz) was incorrectly compiled such that the conditional was interpreted as a reference to capturing group 1 instead of a test for recursion. Any group whose name began with R was misinterpreted in this way. (The reference interpretation should only happen if the group's name is precisely R.)
  • A number of bugs have been mended relating to match start-up optimizations when the first thing in a pattern is a positive lookahead. These all applied only when PCRE_NO_START_OPTIMIZE was not set:
    1. A pattern such as (?=.*X)X$ was incorrectly optimized as if it needed both an initial 'X' and a following 'X'.
    2. Some patterns starting with an assertion that started with .* were incorrectly optimized as having to match at the start of the subject or after a newline. There are cases where this is not true, for example, (?=.*[A-Z])(?=.{8,16})(?!.*[\s]) matches after the start in lines that start with spaces. Starting .* in an assertion is no longer taken as an indication of matching at the start (or after a newline).

DIRegEx 8.7.1 – 14 Jun 2016

  • Fix an intermittent access violation when compiling a pattern with JIT, reported on the mailing list. This affected Win32 only.
  • Fix a racing condition in JIT.

DIRegEx 8.7.0 – 7 May 2016

  • Support Delphi 10.1 Berlin Win32 and Win64.

DIRegEx 8.6.9 – 7 Mar 2016

  • If PCRE_AUTO_CALLOUT was set on a pattern that had a (?# comment between an item and its qualifier (for example, A(?#comment)?B) pcre_compile misbehaved.
  • Similar to the above, if an isolated \E was present between an item and its qualifier when PCRE_AUTO_CALLOUT was set, pcre_compile misbehaved.
  • Negated classes such as [^[:^ascii:]\d] were not working correctly in UCP mode.
  • An empty \Q\E sequence between an item and its qualifier caused pcre_compile to misbehave when auto callouts were enabled.
  • If a pattern that was compiled with PCRE_EXTENDED started with white space or a #-type comment that was followed by (?-x), which turns off PCRE_EXTENDED, and there was no subsequent (?x) to turn it on again, pcre_compile assumed that (?-x) applied to the whole pattern and consequently mis-compiled it.
  • A call of pcre_copy_named_substring for a named substring whose number was greater than the space in the ovector could cause a crash.
  • Fix a buffer overflow bug which involved duplicate named groups with a group that reset capture numbers.
  • pcre_get_substring_list crashed if the use of \K in a match caused the start of the match to be earlier than the end.
  • Improvements to JIT.
  • A pattern that included (*ACCEPT) in the middle of a sufficiently deeply nested set of parentheses of sufficient size caused an overflow of the compiling workspace (which was diagnosed, but of course is not desirable).
  • Fix a buffer overflow bug involving nested duplicate named groups with a nested back reference.
  • An invalid pattern fragment such as (?(?C)0 was not diagnosing an error (“assertion expected”) when (?(?C) was not followed by an opening parenthesis.

DIRegEx 8.6.8 – 24 Nov 2015

  • Fixed a corner case of range optimization in JIT.
  • An incorrect error “overran compiling workspace” was given if there were exactly enough group forward references such that the last one extended into the workspace safety margin. The next one would have expanded the workspace. The test for overflow was not including the safety margin.
  • A match limit issue is fixed in JIT.
  • In a character class such as [\W\p{Any}] where both a negative-type escape (“not a word character”) and a property escape were present, the property escape was being ignored.
  • Fix crash caused by very long (*MARK) or (*THEN) names.
  • A sequence such as [[:punct:]b] that is, a POSIX character class followed by a single ASCII character in a class item, was incorrectly compiled in UCP mode. The POSIX class got lost, but only if the single character followed it.
  • [:punct:] in UCP mode was matching some characters in the range 128-255 that should not have been matched.
  • If [:^ascii:] or [:^xdigit:] or [:^cntrl:] are present in a non-negated class, all characters with code points greater than 255 are in the class. When a Unicode property was also in the class (if PCRE_UCP is set, escapes such as \w are turned into Unicode properties), wide characters were not correctly handled, and could fail to match.

DIRegEx 8.6.7 – 15 Sep 2015

  • Support Delphi 10 Seattle Win32 and Win64.

DIRegEx 8.6.6 – 24 Aug 2015

  • Quantification of certain items (e.g. atomic back references) could cause incorrect code to be compiled when recursive forward references were involved. Example pattern: (?1)()((((((\1++))\x85)+)|)).
  • A repeated conditional group whose condition was a reference by name caused a buffer overflow if there was more than one group with the given name.
  • A recursive back reference by name within a group that had the same name as another group caused a buffer overflow. Example pattern: (?J)(?'d'(?'d'\g{d})).
  • A forward reference by name to a group whose number is the same as the current group, for example in this pattern: (?|(\k'Pm')|(?'Pm')), caused a buffer overflow at compile time.
  • A lookbehind assertion within a set of mutually recursive subpatterns could provoke a buffer overflow.
  • Another buffer overflow bug involved duplicate named groups with a reference between their definition, with a group that reset capture numbers, for example: (?J:(?|(?'R')(\k'R')|((?'R')))).
  • There was no check for integer overflow in subroutine calls such as (?123).
  • The table entry for \l in EBCDIC environments was incorrect, leading to its being treated as a literal 'l' instead of causing an error.
  • There was a buffer overflow if pcre_exec was called with an ovector of size 1.
  • If a non-capturing group containing a conditional group that could match an empty string was repeated, it was not identified as matching an empty string itself. For example: ^(?:(?(1)x|)+)+$().
  • A pattern with an unmatched closing parenthesis that contained a backward assertion which itself contained a forward reference caused buffer overflow. And example pattern is: (?=di(?⇐(?1))|(?=(.)))).
  • JIT should return with error when the compiled pattern requires more stack space than the maximum.
  • A possessively repeated conditional group that could match an empty string, for example, (?(R))*+, was incorrectly compiled.
  • Fix infinite recursion in the JIT compiler when certain patterns such as (?:|a|){100}x are analysed.
  • Some patterns with character classes involving [: and \\ were incorrectly compiled and could cause reading from uninitialized memory or an incorrect error diagnosis.
  • Pathological patterns containing many nested occurrences of [: caused pcre_compile to run for a very long time.
  • A conditional group with only one branch has an implicit empty alternative branch and must therefore be treated as potentially matching an empty string.
  • If (?R was followed by - or + incorrect behaviour happened instead of a diagnostic.
  • Arrange to give up on finding the minimum matching length for overly complex patterns.
  • In a pattern with duplicated named groups and an occurrence of (?| it is possible for an apparently non-recursive back reference to become recursive if a later named group with the relevant number is encountered. This could lead to a buffer overflow.
  • The JIT compiler did not restore the control verb head in case of *THEN control verbs.
  • Error messages for syntax errors following \g and \k were giving inaccurate offsets in the pattern.
  • Added a check for integer overflow in conditions (?(<digits>) and (?(R<digits>).
  • The JIT compiler should not check repeats after a {0,1} repeat byte code.
  • The JIT compiler should restore the control chain for empty possessive repeats.
  • Match limit check added to JIT recursion.

DIRegEx 8.6.5 – 4 May 2015

  • If a group that contained a recursive back reference also contained a forward reference subroutine call followed by a non-forward-reference subroutine call, for example .((?2)(?R)\1)(), pcre_compile failed to compile correct code, leading to undefined behaviour or an internally detected error.
  • The use of \K in a positive lookbehind assertion in a non-anchored pattern (e.g. (?⇐\Ka)) could make pcregrep loop.
  • If a greedy quantified \X was preceded by \C in UTF mode (e.g. \C\X*), and a subsequent item in the pattern caused a non-match, backtracking over the repeated \X did not stop, but carried on past the start of the subject, causing reference to random memory and/or a segfault. There were also some other cases where backtracking after \C could crash.
  • The function for finding the minimum length of a matching string could take a very long time if mutual recursion was present many times in a pattern, for example, ((?2){73}(?2))((?1)). A better mutual recursion detection method has been implemented.

DIRegEx 8.6.4 – 25 Apr 2015

  • Add support for Delphi XE8 Win32 and Win64.

DIRegEx 8.6.3 – 2 Apr 2015

  • Fix a crash if /K and /F were both set with the option to save the compiled pattern.
  • Fix a crash if the option to print captured substrings in a callout was combined with setting a null ovector, for example \O\C+ as a subject string.
  • A pattern such as ((?2){0,1999}())?, which has a group containing a forward reference repeated a large (but limited) number of times within a repeated outer group that has a zero minimum quantifier, caused incorrect code to be compiled, leading to the error “internal error: previously-checked referenced subpattern not found” when an incorrect memory address was read.
  • A pattern such as “((?+1)(\1))/” containing a forward reference subroutine call within a group that also contained a recursive back reference caused incorrect code to be compiled.
  • A pattern such as (?i)[A-`], where characters in the other case are adjacent to the end of the range, and the range contained characters with more than one other case, caused incorrect behaviour when compiled in UTF mode. In that example, the range a-j was left out of the class.
  • Fix JIT compilation of conditional blocks, which assertion is converted to (*FAIL). E.g: (?(?!)).
  • The pattern (?(?!)^) caused references to random memory.
  • The assertion (?!) is optimized to (*FAIL). This was not handled correctly when this assertion was used as a condition, for example (?(?!)a|b). In pcre_exec it worked by luck; in pcre_dfa_exec it gave an incorrect error about an unsupported item.
  • For some types of pattern, for example Z*(|d*){216}, the auto-possessification code could take exponential time to complete. A recursion depth limit of 1000 has been imposed to limit the resources used by this optimization.
  • A pattern such as (*UTF)[\S\V\H], which contains a negated special class such as \S in non-UCP mode, explicit wide characters (> 255) can be ignored because \S ensures they are all in the class. The code for doing this was interacting badly with the code for computing the amount of space needed to compile the pattern, leading to a buffer overflow.
  • A pattern such as ((?2)+)((?1)) which has mutual recursion nested inside other kinds of group caused stack overflow at compile time.
  • A pattern such as (?1)(?#?'){8}(a) which had a parenthesized comment between a subroutine call and its quantifier was incorrectly compiled, leading to buffer overflow or other errors.
  • The illegal pattern (?(?<E>.*!.*)?) was not being diagnosed as missing an assertion after (?(. The code was failing to check the character after (?(?< for the ! or = that would indicate a lookbehind assertion.
  • A pattern such as X((?2)()*+){2}+ which has a possessive quantifier with a fixed maximum following a group that contains a subroutine reference was incorrectly compiled and could trigger buffer overflow.
  • A mutual recursion within a lookbehind assertion such as (?⇐((?2))((?1))) caused a stack overflow instead of the diagnosis of a non-fixed length lookbehind assertion.

DIRegEx 8.6.2 – 27 Feb 2015

  • If an assertion condition was quantified with a minimum of zero (an odd thing to do, but it happened), SIGSEGV or other misbehaviour could occur.
  • Fixed a memory leak during matching that could occur for a subpattern subroutine call (recursive or otherwise) if the number of captured groups that had to be saved was greater than ten.
  • Catch a bad opcode during auto-possessification after compiling a bad UTF string with PCRE_NO_UTF[8|16|32]_CHECK. This is a tidyup, not a bug fix, as passing bad UTF with PCRE_NO_UTF[8|16|32]_CHECK is documented as having an undefined outcome.
  • A UTF pattern containing a “not” match of a non-ASCII character and a subroutine reference could loop at compile time. Example: [^\xff]((?1)).
  • When a pattern is compiled, it remembers the highest back reference so that when matching, if the ovector is too small, extra memory can be obtained to use instead. A conditional subpattern whose condition is a check on a capture having happened, such as, for example in the pattern ^(?:(a)|b)(?(1)A|B), is another kind of back reference, but it was not setting the highest backreference number. This mattered only if pcre_exec was called with an ovector that was too small to hold the capture, and there was no other kind of back reference (a situation which is probably quite rare). The effect of the bug was that the condition was always treated as FALSE when the capture could not be consulted, leading to a incorrect behaviour by pcre_exec. This bug has been fixed.
  • A reference to a duplicated named group (either a back reference or a test for being set in a conditional) that occurred in a part of the pattern where PCRE_DUPNAMES was not set caused the amount of memory needed for the pattern to be incorrectly calculated, leading to overwriting.
  • A mutually recursive set of back references such as (\2)(\1) caused a segfault at study time (while trying to find the minimum matching length). The infinite loop is now broken (with the minimum length unset, that is, zero).
  • If an assertion that was used as a condition was quantified with a minimum of zero, matching went wrong. In particular, if the whole group had unlimited repetition and could match an empty string, a segfault was likely. The pattern (?(?=0)?)+ is an example that caused this. Perl allows assertions to be quantified, but not if they are being used as conditions, so the above pattern is faulted by Perl. PCRE has now been changed so that it also rejects such patterns.
  • A possessive capturing group such as (a)*+ with a minimum repeat of zero failed to allow the zero-repeat case if pcre_exec was called with an ovector too small to capture the group.

DIRegEx 8.6.1 – 19 Nov 2014

  • Fix bug when there are unset groups prior to (*ACCEPT) within a capturing group. When an (*ACCEPT) is triggered inside capturing parentheses, it arranges for those parentheses to be closed with whatever has been captured so far. However, it was failing to mark any other groups between the hightest capture so far and the currrent group as “unset”. Thus, the ovector for those groups contained whatever was previously there. An example is the pattern (x)|((*ACCEPT)) when matched against “abcd”.

DIRegEx 8.6.0 – 3 Oct 2014

  • Support Delphi XE7 Win32 and Win64.
  • Potential security fix: A pattern such as ((?(R)a|(?1)))+, which contains a recursion within a group that is quantified with an indefinite repeat, caused a compile-time loop which used up all the system stack and provoked a segmentation fault which could cause the application to crash.

DIRegEx 8.5.2 – 5 Aug 2014

Improvements

  • New TDIRegEx.FormatSubStrChar property.
  • New TDIRegEx.FormatOptions and TDIRegEx16.FormatOptions properties.
  • New Occurrence argument for TDIRegEx.Replace2 and TDIRegEx16.Replace2.
  • If a pattern that started with a caseless match for a character with more than one “other case” was studied, PCRE did not set up the starting code unit bit map for the list of possible characters. Now it does.
  • The Unicode data tables have been updated to Unicode 7.0.0.
  • Documentation update: Inherited members for classes, constant groups, layout, plus lots of small changes.

Bug Fixes

  • Incorrect code was compiled if a group that contained an internal recursive back reference was optional (had quantifier with a minimum of zero). This example compiled incorrect code: (((a\2)|(a*)\g←1>))* and other examples caused segmentation faults because of stack overflows at compile time.
  • The JIT compiler did not generate match limit checks for certain bracketed expressions with quantifiers. This may lead to exponential backtracking, instead of returning with PCRE_ERROR_MATCHLIMIT. This issue should be resolved now.
  • Fixed an issue, which occured when nested alternatives are optimized with table jumps.
  • Fixed a bug concerned with zero-minimum possessive groups that could match an empty string, which sometimes were behaving incorrectly in the interpreter (though correctly in the JIT matcher).
  • Fixed a bug that was incorrectly auto-possessifying \w+ in the pattern ^\w+(?>\s*)(?⇐\w) which caused it not to match “test test”.
  • Give a compile-time error for \o{} (as Perl does) and for \x{} (which Perl does not).
  • Fix a bug that caused the amount of memory needed to hold a pattern to be incorrectly computed (too small) when there were named back references to duplicated names. This now causes “internal error: code overflow” or “double free or corruption” or other memory handling errors.
  • When named subpatterns had the same prefixes, back references could be confused. For example, in this pattern: (?P<Name>a)?(?P<Name2>b)?(?(<Name>)c|d)*l the reference to 'Name' was incorrectly treated as a reference to a duplicate name.
  • A caseless pattern such as ^s?c where the optional character has more than one “other case” was incorrectly compiled such that it would only try to match starting at “c”.
  • When a pattern starting with \s was studied, VT was not included in the list of possible starting characters.
  • If a character class started [\Qx]… where x is any character, the class was incorrectly terminated at the ].

DIRegEx 8.5.0 – 28 Apr 2014

  • Support Delphi XE6 Win32 and Win64.

DIRegEx 8.1.1 – 8 Apr 2014

  • Improvements:
    • Optimize property checks when studying XCLASS-es.
    • Improve auto-possessification of character sets: a normal and an extended character set can be compared now. Furthermore the JIT compiler optimizes more character set checks.
    • Fast forward search is improved in JIT. Instead of the first three characters, any three characters with fixed position can be searched. Search order: first, last, middle.
    • Improve character range checks in JIT. Characters are read by an inprecise function now, which returns with an unknown value if the character code is above a certain treshold (e.g: 256). The only limitation is that the value must be bigger than the treshold as well. This function is useful, when the characters above the treshold are handled in the same way.
    • JIT uses table jumps for selecting the correct backtracking path, when more than four alternatives are present inside a bracket.
    • JIT avoids empty match checks if the minimum length is greater than zero, and there is no \K in the pattern.
    • Improve pattern prefix search by a simplified Boyer-Moore algorithm in JIT. The algorithm provides a way to skip certain starting offsets, and usually faster than linear prefix searches.
    • Add a new global variable called pcre_stack_guard that can be set to point to an external function to check stack availability. It is called at the start of processing every parenthesized group.
  • Bug Fixes:
    • In a caseless character class with UCP support, when a character with more than one alternative case was not the first character of a range, not all the alternative cases were added to the class. For example, s and \x{17f} are both alternative cases for S: the class [RST] was handled correctly, but [R-T] was not.
    • The fast forward newline mechanism could enter to an infinite loop on certain invalid UTF-8 input.
    • In ungreedy mode the max/min qualifier behaved like a min-possessive qualifier, and, for example, a{1,3}b did not match “ab”.
    • When UTF was disabled, the JIT program reported some incorrect compile errors. These messages are silenced now.
  • Other Changes:
    • Remove support for the DI_No_RegEx_Range_Checking compiler directive.
    • WildCardToPcre, WildCardToPcreA, and WildCardToPcreW replace the overloaded WildCardToPcre. Delphi 5 could not figure out the correct overload.

DIRegEx 8.1.0 – 2 Jan 2014

  • Feature Improvements:
    • Refactoed and extended the amount of “auto-possessification”. coNoAutoPossess compile option added.
    • Improvement: Implemented TDIRegEx.InfoMatchEmpty / TDIRegEx16.InfoMatchEmpty which yields 1 if the pattern can match an empty string.
    • Unicode character properties were updated from Unicode 6.3.0.
    • pcre_jit_free_unused_memory, pcre16_jit_free_unused_memory, and pcre32_jit_free_unused_memory forcibly free unused JIT executable memory.
    • Processing unduplicated named groups should now be as fast at numerical groups, and processing duplicated groups should be faster than before.
    • Whereas an item such as A{4}+ ignored the possessivenes of the quantifier (because it's meaningless), this was not happening when coCaseLess was set. Not wrong, but inefficient.
    • Implement coNeverUtf to lock out the use of UTF, in particular, blocking (*UTF) etc.
    • Added support for [[:<:]] and [[:>:]] as used in the BSD POSIX library to mean “start of word” and “end of word”, respectively, as a transition aid.
  • Potential Incompatibility Changes:
    • There is now a limit (default 250) on the depth of nesting of parentheses. This limit is imposed to control the amount of system stack used at compile time.
  • Perl Compatibility:
    • A back reference to a named subpattern when there is more than one of the same name now checks them in the order in which they appear in the pattern. The first one that is set is used for the reference. Previously only the first one was inspected.
    • The vertical tab character (VT) has been added to the set of characters that match \s and are generally treated as white space, following this same change in Perl 5.18. There is now no difference between “Perl space” and “POSIX space”.
    • Perl has changed its handling of \8 and \9. If there is no previously encountered capturing group of those numbers, they are treated as the literal characters 8 and 9 instead of a binary zero followed by the literals. DIRegEx now does the same.
    • Added \o{} to specify codepoints in octal, making it possible to specify values greater than 0777 and also making them unambiguous.
    • In UCP mode, \s was not matching two of the characters that Perl matches, namely NEL (U+0085) and MONGOLIAN VOWEL SEPARATOR (U+180E), though they were matched by \h.
    • Upgraded the handling of the POSIX classes [:graph:], [:print:], and [:punct:] when coUCP is set so as to include the same characters as Perl does in Unicode mode.
    • Perl no longer allows group names to start with digits, so this change is now also in DIRegEx.
    • Perl now gives an error for missing closing braces after \x{… instead of treating the string as literal. DIRegEx now does the same.
    • Character classes such as [A-\d] or [a-[:digit:]] now cause compile-time errors. Perl warns for these when in warning mode, but DIRegEx has no facility for giving warnings.
    • In extended mode, Perl ignores spaces before a + that indicates a possessive quantifier. DIRegEx allowed a space before the quantifier, but not before the possessive +. It now does.
    • The vertical tab character (VT) has been added to the default set of characters that match \s and are generally treated as white space, following this same change in Perl 5.18. There is now no difference between “Perl space” and “POSIX space”. Whether VT is treated as white space in other locales depends on the locale.
  • Bug Fixes:
    • In UTF mode, the code for checking whether a group could match an empty string (which is used for indefinitely repeated groups to allow for breaking an infinite loop) was broken when the group contained a repeated negated single-character class with a character that occupied more than one data item and had a minimum repetition of zero (for example, [^\x{100}]* in UTF-8 mode). The effect was undefined: the group might or might not be deemed as matching an empty string, or the program might have crashed.
    • The code for checking whether a group could match an empty string was not recognizing that \h, \H, \v, \V, and \R must match a character.
    • Two related bugs that applied to Unicode extended grapheme clusters that were repeated with a maximizing qualifier (e.g. \X* or \X{2,5}) when matched by pcre_exec() without using JIT:
      1. If the rest of the pattern did not match after a maximal run of grapheme clusters, the code for backing up to try with fewer of them did not always back up over a full grapheme when characters that do not have the modifier quality were involved, e.g. Hangul syllables.
      2. If the match point in a subject started with modifier character, and there was no match, the code could incorrectly back up beyond the match point, and potentially beyond the first character in the subject, leading to a segfault or an incorrect match result.
    • A conditional group with an assertion condition could lead to recording an incorrect first data item for a match if no other first data item was recorded. For example, the pattern (?(?=ab)ab) recorded “a” as a first data item, and therefore matched “ca” after “c” instead of at the start.
    • If coAutoCallout and coUCP were set for a pattern that contained character types such as \d or \w, too many callouts were inserted, and the data that they returned was rubbish.
    • The use of \K (reset reported match start) within a repeated possessive group such as (a\Kb)*+ was not working.

DIRegEx 8.0.0 – 25 Sep 2013

  • Support Delphi XE5 Win32 and Win64.

DIRegEx 7.5.0 – 14 Jun 2013

  • Support Delphi XE4 Win32 and Win64.
  • Changed TDIRegEx.InfoSize, TDIRegEx.InfoStudySize, and TDIRegEx.InfoJitSize output type from Cardinal to C_size_t, which is different in Win32 and Win64. If you are compiling for Win64, please adjust your code.
  • Fix: TDICustomRegExSearch did not find some matches on buffer boundaries.
  • JIT improvements:
    • Add support for callouts.
    • Inline subpatterns in recursions, when certain conditions are fulfilled.
    • Optimize fast forward start searches.
    • Auto-detect and optimize limited repetitions.
    • (*PRUNE), (*SKIP) and (*THEN) are now supported.
    • Control verbs are handled in the same way in JIT and interpreter.
  • JIT fixes:
    • Unoptimized capturing brackets incorrectly reset on backtrack.
    • Minimum length was not checked before the matching was started.
    • Two buffer over read issues in 16 and 32 bit modes.
  • Syntax improvements:
    • Add the property \p{Xuc} for matching characters that can be expressed in certain programming languages using Universal Character Names.
    • Implemented (*LIMIT_MATCH) and (*LIMIT_RECURSION).
    • Perl confines (*SKIP) and (*PRUNE) to within a recursive subpattern, this has now been done, just as with (*COMMIT).
    • The way backtracking verbs are handled has been changed in two ways:
      1. Previously, in something like (*COMMIT)(*SKIP), COMMIT would override SKIP. Now, PCRE acts on whichever backtracking verb is reached first by backtracking. In some cases this makes it more Perl-compatible, but Perl's rather obscure rules do not always do the same thing.
      2. Previously, backtracking verbs were confined within assertions. This is no longer the case for positive assertions, except for (*ACCEPT). Again, this sometimes improves Perl compatibility, and sometimes does not.
    • Allow an explicit callout to be inserted before an assertion that is the condition for a conditional group, for compatibility with automatic callouts, which always insert a callout at this point.
  • The value of capture_last that is passed to callouts was incorrect in some cases when there was a capture on one path that was subsequently abandoned after a backtrack. Also, the capture_last value is now reset after a recursion, since all captures are also reset in this case.
  • The interpreter no longer returns the “too many substrings” error in the case when an overflowing capture is in a branch that is subsequently abandoned after a backtrack.
  • Partial matches now set offsets[2] to the “bumpalong” value, that is, the offset of the starting point of the matching process, provided the offsets vector is large enough.
  • The \A escape now records a lookbehind value of 1, though its execution does not actually inspect the previous character. This is to ensure that, in partial multi-segment matching, at least one character from the old segment is retained when a new segment is processed. Otherwise, if there are no lookbehinds in the pattern, \A might match incorrectly at the start of a new segment.
  • Unicode validation has been updated in the light of Unicode Corrigendum #9, which points out that “non characters” are not “characters that may not appear in Unicode strings” but rather “characters that are reserved for internal use and have only local meaning”.
  • When a pattern was compiled with automatic callouts (PCRE_AUTO_CALLOUT) and there was a conditional group that depended on an assertion, if the assertion was false, the callout that immediately followed the alternation in the condition was skipped when pcre_exec was used for matching.
  • Fix infinite loop when (?⇐(*SKIP)ac)a is matched against aa.
  • Fix the case where there are two or more SKIPs with arguments that may be ignored.
  • Fix: An opening parenthesis in a MARK/PRUNE/SKIP/THEN name in a pattern that contained a forward subroutine reference caused a compile error.
  • Fix: Segfault when pcre_dfa_exec is called with an output vector of length less than 2.
  • In the interpreter, maximizing pattern repetitions for characters and character types now use tail recursion, which reduces stack usage.

DIRegEx 7.1.1 – 24 Jan 2013

  • Speed improvements to regular expression matching.
  • Fix forward search bug in JIT.

DIRegEx 7.1.0 – 30 Nov 2012

New Features:

  • New support for 32-bit character strings, and UTF-32.
  • Improved Unicode support for \X so that it now matches a Unicode extended grapheme cluster.
  • Improved Unicode support for characters that have more than one “other case”.
    • Codepoints less than 256 whose other case is greater than 256 are now correctly matched caselessly. Previously, the high codepoint matched the low one, but not vice versa.
    • Caseless back references now work with characters that have more than one other case.
    • General caseless matching of characters with more than one other case is supported.
  • Unicode character properties were updated from Unicode 6.2.0
  • Improveed matching speed of capturing brackets.
  • Added support for PCRE_STUDY_EXTRA_NEEDED.
  • JIT compiler improvements: Many patterns run 20-40% faster.
    • Improved JIT compiler optimizations for first character search, single character iterators and character ranges.
  • Add a native interface for JIT. Through this interface, the compiled machine code can be directly executed. The purpose of this interface is to provide fast pattern matching, so several sanity checks are not performed. However, feature tests are still performed. The new interface provides 1.4x speedup compared to the old one.
  • (*UTF) can now be used to start a pattern in any of the three libraries.

Bug Fixes:

  • A match that can occur only at the start of a line was incorrectly detected in cases where .* appeared inside atomic brackets at the start of a pattern, or where there was a subsequent *PRUNE or *SKIP.
  • If pcre_exec or pcre_dfa_exec was called with a negative value for the subject string length, the error given was PCRE_ERROR_BADOFFSET, which was confusing. There is now a new error PCRE_ERROR_BADLENGTH for this case.

DIRegEx 7.0.0 – 4 Oct 2012

  • Support Delphi XE3 Win32 and Win64.
  • Improve the matching speed of capturing brackets.
  • Improve the first n character searches.
  • Changed the meaning of \X so that it now matches a Unicode extended grapheme cluster.
  • Added support for PCRE_STUDY_EXTRA_NEEDED.

DIRegEx 6.4.0 – 12 Jul 2012

  • \s*\R was auto-possessifying the \s* when it should not, whereas \S*\R was not doing so when it should.
  • When PCRE_UCP was not set, \w+\x{c4} was incorrectly auto-possessifying the \w+ when the character tables indicated that \x{c4} was a word character. There were several related cases, all because the tests for doing a table lookup were testing for characters less than 127 instead of 255.
  • If a pattern contains capturing parentheses that are not used in a match, their slots in the ovector are set to -1. For those that are higher than any matched groups, this happens at the end of processing. In the case when there were back references that the ovector was too small to contain (causing temporary malloc'd memory to be used during matching), and the highest capturing number was not used, memory off the end of the ovector was incorrectly being set to -1.
  • Check for an overlong MARK name and give an error at compile time. The limit is 255 for the 8-bit library and 65535 for the 16-bit library.
  • When ((?:a?)*)*c or ((?>a?)*)*c was matched against “aac”, it set group 1 to “aa” instead of to an empty string. The bug affected repeated groups that could potentially match an empty string.
  • Wide characters specified with \uxxxx in JavaScript mode are now subject to the same checks as \x{…} characters in non-JavaScript mode. Specifically, codepoints that are too big for the mode are faulted, and in a UTF mode, disallowed codepoints are also faulted.
  • TDIDfaRegEx could cause incorrect processing when bytes with values greater than 127 were present. For TDIDfaRegEx16, the bug would be provoked by values in the range 0xfc00 to 0xdc00. In both cases the values are those that cannot be the first data item in a UTF character. The bug showed with recursions, possessively repeated groups, and atomic groups.
  • In 16-bit mode, studied patterns that started with \h* or \R* might have been incorrectly matched.
  • If .* appeared inside atomic brackets at the start of a pattern, or where there was a subsequent *PRUNE or *SKIP, the start of string (or line, in multiline mode) was determined incorrectly.
  • Improve JIT code generation for greedy plus quantifier, first character search, single character iterations, and character ranges.

DIRegEx 6.3.3 – 14 Apr 2012

  • Fixed a bug for backward assertions in the JIT compiler.
  • Support moNoStartOptimize / PCRE_NO_START_OPTIMIZE in JIT as (*MARK) support requires it.

DIRegEx 6.3.2 – 30 Mar 2012

  • Hotfix for DIRegEx_Workbench_Form.pas: Add missing array elements for coNoUtf16Check and moNoUtf16Check introduced in yesterday's release.
  • Use TDIRegExBase.CompileOptions and TDIRegExBase.MatchOptions for component streaming instead of their CompileOptionBits and MatchOptionBits counterparts. Existing forms are updated automatically.
  • TDIRegExBase.SetCompileOptions and TDIRegExBase.SetMatchOptions are now protected and virtual.
  • TDIRegExSearchStream_Enc and TDIRegExSearchStream_Utf8 compile and match option setters overloaded to include UTF-8 options.

DIRegEx 6.3.1 – 29 Mar 2012

  • Partial matching support is added to the JIT compiler.
  • Fixed several bugs concerned with partial matching of items that consist of more than one character:
    • ^(..)\1 did not partially match “aba” because checking references was done on an “all or nothing” basis. This also applied to repeated references.
    • \R did not give a hard partial match if CR was found at the end of the subject.
    • \X did not give a hard partial match after matching one or more characters at the end of the subject.
    • When newline was set to CRLF, a pattern such as a$ did not recognize a partial match for the string CR.
    • When newline was set to CRLF, the metacharacter “.” did not recognize a partial match for a CR character at the end of the subject string.
  • (*MARK) control verb is now supported by the JIT compiler.
  • Add PCRE_INFO_MAXLOOKBEHIND plus TDIRegEx.InfoMaxLookBehind, TDIRegEx16.InfoMaxLookBehind.
  • As documented, (*COMMIT) is now confined to within a recursive subpattern call.
  • As documented, (*COMMIT) is now confined to within a positive assertion.
  • (*COMMIT) control verb is now supported by the JIT compiler.
  • The Unicode data tables have been updated to Unicode 6.1.0.
  • Fix TDIPerlRegEx.SubStrCount, TDIPerlRegEx16.SubStrCount, TDIDfaRegEx.SubStrCount, TDIDfaRegEx16.SubStrCount for partial match results.
  • Add missing compile option coNoUtf16Check and match option moNoUtf16Check.

DIRegEx 6.3.0 – 19 Jan 2012

  • New 16-bit string processing classes TDIPerlRegEx16 and TDIDfaRegEx16. Both work on UnicodeStrings and WideStrings natively with not prior conversions. Full UTF-16 Unicode processing optional.
  • Fixed a bug in fixed-length calculation for lookbehinds that would show up only in quite long subpatterns.
  • For a non-anchored pattern, if (*SKIP) was given with a name that did not match a (*MARK), and the match failed at the start of the subject, a reference to memory before the start of the subject could occur.
  • A reference to an unset group with zero minimum repetition was giving totally wrong answers (in non-JavaScript-compatibility mode). For example, (another)?(\1?)test matched against “hello world test”.
  • Ovector size of 2 is also supported by JIT based pcre_exec and pcre16_exec (the ovector size rounding is not applied in this particular case).
  • Remove deprecated pcre_info. Use pcre_fullinfo instead.

DIRegEx 6.2.0 – 12 Dec 2011

  • New Just-In-Time Compiler (JIT) optimization, which can greatly speed up pattern matching. Available as auto-option poAutoJit or by passing soJIT to TDIRegEx.Study.
  • A possessively repeated conditional subpattern such as (?(?=c)c|d)++ was being incorrectly compiled and would have given unpredicatble results.
  • A possessively repeated subpattern with minimum repeat count greater than one behaved incorrectly. For example, (A){2,}+ behaved as if it was (A)(A)++ which meant that, after a subsequent mismatch, backtracking into the first (A) could occur when it should not.
  • In non-UTF-8 mode, \C is now supported in lookbehinds and DFA matching.
  • Perl does not support \N without a following name in a [] class; DIRegEx now also gives an error.
  • Removed the fixed limit of repeated forward references. Additional workspace is noew dynamically allocated and limited at about 200000 repeats for safety. At the same time, the filling in of repeated forward references has been sped up.
  • A repeated forward reference in a pattern such as (a)(?2){2}(.) was incorrectly expecting the subject to contain another “a” after the start.
  • When (*SKIP:name) is activated without a corresponding (*MARK:name) earlier in the match, the SKIP should be ignored. This was not happening; instead the SKIP was being treated as NOMATCH. For patterns such as A(*MARK:A)A+(*SKIP:B)Z|AAC this meant that the AAC branch was never tested.
  • The behaviour of (*MARK), (*PRUNE), and (*THEN) has been reworked and is now much more compatible with Perl, in particular in cases where the result is a non-match for a non-anchored pattern. For example, if b(*:m)f|a(*:n)w is matched against “abc”, the non-match returns the name “m”, where previously it did not return a name. A side effect of this change is that for partial matches, the last encountered mark name is returned, as for non matches. The refactoring has had the pleasing side effect of it stack requirements.
  • Retrieve executable code size support for the JIT compiler and fixing some warnings.
  • A caseless match of a UTF-8 character whose other case uses fewer bytes did not work when the shorter character appeared right at the end of the subject string.
  • Computation of memory usage for the table of capturing group names was giving an unnecessarily large value.

DIRegEx 6.1.2 – 16 Nov 2011

  • Fixed that the following items were rejected as fixed length: (*ACCEPT), (*COMMIT), (*FAIL), (*MARK), (*PRUNE), (*SKIP), (*THEN), \h, \H, \v, \V, and single character negative classes with fixed repetitions, e.g. [^a]{3}, with and without coCaseLess.

DIRegEx 6.1.1 – 15 Nov 2011

  • Supporting of \x, \U and \u in JavaScript compatibility mode based on the ECMA-262 standard.
  • Lookbehinds such as (?⇐a{2}b) that contained a fixed repetition were erroneously being rejected as “not fixed length” if coCaseLess was set.

DIRegEx 6.1.0 – 8 Nov 2011

  • Support Delphi XE2 Win64. Caution: DFA matching may cause access violations in 64-bit. Unfortunately, there is no way to locate their cause because *.obj file debugging is yet unavailable for Delphi XE2 64-bit (confirmed by Embarcadero in https://forums.embarcadero.com/thread.jspa?threadID=62631). PERL matching tests, however, pass without errors.
  • Changed some type names in DIRegEx_Api.pas so that they more closely resemble the PCRE original. The TDIRegEx classes (TDIPerlregEx, TDIDFARegEx) are not affected, but applications using the low level PCRE API might need small adjustments.
  • (*MARK) settings inside atomic groups that do not contain any capturing parentheses, for example, (?>a(*:m)), were not being passed out. This bug was introduced in DIRegEx 6.0.0.

DIRegEx 6.0-0 – 15 Oct 2011

  • Support Delphi XE2 Win32.
  • If a pattern such as (a)b|ac is matched against “ac”, there is no captured substring, but while checking the failing first alternative, substring 1 is temporarily captured. If the output vector supplied to pcre_exec was not big enough for this capture, the yield of the function was still zero (“insufficient space for captured substrings”). This cannot be totally fixed without adding another stack variable, which seems a lot of expense for a edge case. However, I the situation is now improved in cases such as (a)(b)x|abc matched against “abc”, where the return code indicates that fewer than the maximum number of slots in the ovector have been set.
  • Related to above: when there are more back references in a pattern than slots in the output vector, pcre_exec uses temporary memory during matching, and copies in the captures as far as possible afterwards. It was using the entire output vector, but this conflicts with the specification that only 2/3 is used for passing back captured substrings. Now it uses only the first 2/3, for compatibility. This is, of course, another edge case.
  • When the number of matches in a pcre_dfa_exec run exactly filled the ovector, the return from the function was zero, implying that there were other matches that did not fit. The correct “exactly full” value is now returned.
  • If a subpattern that was called recursively or as a subroutine contained (*PRUNE) or any other control that caused it to give a non-standard return, invalid errors such as PCRE_ERROR_RECURSELOOP or even infinite loops could occur.
  • If a pattern such as a(*SKIP)c|b(*ACCEPT)| was studied, it stopped computing the minimum length on reaching *ACCEPT, and so ended up with the wrong value of 1 rather than 0. Further investigation indicates that computing a minimum subject length in the presence of *ACCEPT is difficult (think back references, subroutine calls), and so the code was changed so that no minimum is registered for a pattern that contains *ACCEPT.
  • If (*THEN) was present in the first (true) branch of a conditional group, it was not handled as intended.
  • A pathological pattern such as (*ACCEPT)a was miscompiled, thinking that the first byte in a match must be “a”.
  • If (*THEN) appeared in a group that was called recursively or as a subroutine, it did not work as intended.
  • Consider the pattern A (B(*THEN)C) | D where A, B, C, and D are complex pattern fragments (but not containing any | characters). If A and B are matched, but there is a failure in C so that it backtracks to (*THEN), PCRE was behaving differently to Perl. PCRE backtracked into A, but Perl goes to D. In other words, Perl considers parentheses that do not contain any | characters to be part of a surrounding alternative, whereas PCRE was treading (B(*THEN)C) the same as (B(*THEN)C|(*FAIL)) – which Perl handles differently. PCRE now behaves in the same way as Perl, except in the case of subroutine/recursion calls such as (?1) which have in any case always been different (but PCRE had them first).
  • Perl does not treat the | in a conditional group as creating alternatives. Such a group is treated in the same way as an ordinary group without any | characters when processing (*THEN). PCRE has been changed to match Perl's behaviour.
  • A change in DIRegEx 5.3.3 caused atomic groups to use more stack. This is inevitable for groups that contain captures, but it can lead to a lot of stack use in large patterns. The old behaviour has been restored for atomic groups that do not contain any capturing parentheses.

DIRegEx 5.3.3 – 29 Aug 2011

  • Fix an offset by 1 error in TDIRegEx.SubStrMatched.
  • Mark pcre_info as deprecated. Use pcre_fullinfo instead.
  • The Unicode data table have been updated to Unicode 6.0.0.
  • There were a number of related bugs in the code for matching backrefences caselessly in UTF-8 mode when codes for the characters concerned were different numbers of bytes. For example, U+023A and U+2C65 are an upper and lower case pair, using 2 and 3 bytes, respectively. The main bugs were: (a) A reference to 3 copies of a 2-byte code matched only 2 of a 3-byte code. (b) A reference to 2 copies of a 3-byte code would not match 2 of a 2-byte code at the end of the subject (it thought there wasn't enough data left).
  • Comprehensive information about what went wrong is now returned by pcre_exec and pcre_dfa_exec when the UTF-8 string check fails, as long as the output vector has at least 2 elements. The offset of the start of the failing character and a reason code are placed in the vector.
  • When the UTF-8 string check fails for pcre_compile, the offset that is now returned is for the first byte of the failing character, instead of the last byte inspected. This is an incompatible change, but it should be small enough not to be a problem. It makes the returned offset consistent with pcre_exec and pcre_dfa_exec.
  • When \R was used with a maximizing quantifier it failed to skip backwards over a #13#10 pair if the subsequent match failed. Instead, it just skipped back over a single character (#10). This seems wrong (because it treated the two characters as a single entity when going forwards), conflicts with the documentation that \R is equivalent to (?>\r\n|\n|…etc), and makes the behaviour of \R* different to (\R)*, which also seems wrong. The behaviour has been changed.
  • Extensive internal refactoring has drastically reduced the number of recursive calls and the amount of stack used for possessively repeated groups such as (abc)++ when using pcre_exec.
  • Fix a number of bugs in the handling of groups:
    • (?⇐(a)+) was not diagnosed as invalid (non-fixed-length lookbehind).
    • (a|)*(?1) gave a compile-time internal error.
    • ((a|)+)+ did not notice that the outer group could match an empty string.
    • (^a|^)+ was not marked as anchored.
    • (.*a|.*)+ was not marked as matching at start or after a newline.
  • When (*ACCEPT) was used in a subpattern that was called recursively, the restoration of the capturing data to the outer values was not happening correctly.
  • If a recursively called subpattern ended with (*ACCEPT) and matched an empty string, and PCRE_NOTEMPTY was set, pcre_exec thought the whole pattern had matched an empty string, and so incorrectly returned a no match.
  • There was optimizing code for the last branch of non-capturing parentheses, and also for the obeyed branch of a conditional subexpression, which used tail recursion to cut down on stack usage. Unfortunately, now that there is the possibility of (*THEN) occurring in these branches, tail recursion is no longer possible because the return has to be checked for (*THEN). These two optimizations have therefore been removed.
  • If a pattern containing \R was studied, it was assumed that \R always matched two bytes, thus causing the minimum subject length to be incorrectly computed because \R can also match just one byte.
  • If a pattern containing (*ACCEPT) was studied, the minimum subject length was incorrectly computed.
  • When (*ACCEPT) was used in an assertion that matched an empty string and PCRE_NOTEMPTY was set, PCRE applied the non-empty test to the assertion.
  • When an atomic group that contained a capturing parenthesis was successfully matched, but the branch in which it appeared failed, the capturing was not being forgotten if a higher numbered group was later captured. For example, (?>(a))b|(a)c when matching “ac” set capturing group 1 to “a”, when in fact it should be unset. This applied to multi-branched capturing and non- capturing groups, repeated or not, and also to positive assertions (capturing in negative assertions does not happen in PCRE) and also to nested atomic groups.
  • The way atomic groups are processed by pcre_exec has been changed so that if they are repeated, backtracking one repetition now resets captured values correctly. For example, if ((?>(a+)b)+aabab) is matched against “aaaabaaabaabab” the value of captured group 2 is now correctly recorded as “aaa”. Previously, it would have been “a”. As part of this code refactoring, the way recursive calls are handled has also been changed.
  • If an assertion condition captured any substrings, they were not passed back unless some other capturing happened later. For example, if (?(?=(a))a) was matched against “a”, no capturing was returned.
  • When studying a pattern that contained subroutine calls or assertions, the code for finding the minimum length of a possible match was handling direct recursions such as (xxx(?1)|yyy) but not mutual recursions (where group 1 called group 2 while simultaneously a separate group 2 called group 1). A stack overflow occurred in this case. This is now fixed this by limiting the recursion depth to 10.
  • An instance of \X with an unlimited repeat could fail if at any point the first character it looked at was a mark character.
  • Some minor code refactoring concerning Unicode properties and scripts should reduce the stack requirement slightly.
  • If \k was not followed by a braced, angle-bracketed, or quoted name, PCRE compiled something random. Now it gives a compile-time error (as does Perl).
  • A *MARK encountered during the processing of a positive assertion is now recorded and passed back (compatible with Perl).
  • Previously, PCRE did not allow quantification of assertions. However, Perl does, and because of capturing effects, quantifying parenthesized assertions may at times be useful. Quantifiers are now allowed for parenthesized assertions.
  • \g was being checked for fancy things in a character class, when it should just be a literal “g”.
  • PCRE was rejecting [:a[:digit:]] whereas Perl was not. It seems that the appearance of a nested POSIX class supersedes an apparent external class. For example, [:a[:digit:]b:] matches “a”, “b”, “:”, or a digit. Also, unescaped square brackets may also appear as part of class names. For example, [:a[:abc]b:] gives unknown class “[:abc]b:]”. PCRE now behaves more like Perl.
  • PCRE was giving an error for \N with a braced quantifier such as {1,} (this was because it thought it was \N{name}, which is not supported).
  • PCRE tries to detect cases of infinite recursion at compile time, but it cannot analyze patterns in sufficient detail to catch mutual recursions such as ((?1))((?2)). There is now a runtime test that gives an error if a subgroup is called recursively as a subpattern for a second time at the same position in the subject string. In previous releases this might have been caught by the recursion limit, or it might have run out of stack.
  • A pattern such as (?(R)a+|(?R)b) is quite safe, as the recursion can happen only once. PCRE was, however incorrectly giving a compile time error “recursive call could loop indefinitely” because it cannot analyze the pattern in sufficient detail. The compile time test no longer happens when PCRE is compiling a conditional subpattern, but actual runaway loops are now caught at runtime.
  • It seems that Perl allows any characters other than a closing parenthesis to be part of the NAME in (*MARK:NAME) and other backtracking verbs. PCRE has been changed to be the same.
  • Add a pointer to the latest mark to the callout data block.
  • The pattern .(*F), when applied to “abc” with PCRE_PARTIAL_HARD, gave a partial match of an empty string instead of no match. This was specific to the use of “.”.
  • The pattern f.*, if compiled with PCRE_UTF8 and PCRE_DOTALL and applied to “for” with PCRE_PARTIAL_HARD, gave a complete match instead of a partial match. This bug was dependent on both the PCRE_UTF8 and PCRE_DOTALL options being set.
  • For a pattern such as \babc|\bdef pcre_study was failing to set up the starting byte set, because \b was not being ignored.

DIRegEx 5.3.2 – 20 Feb 2011

  • Compatibility update for parallel usage with other Yunqa Delphi products.

DIRegEx 5.3.1 – 30 Dec 2010

  • (*THEN) was not working properly if there were untried alternatives prior to it in the current branch. For example, in ((a|b)(*THEN)(*F)|c..) it backtracked to try for “b” instead of moving to the next alternative branch at the same level (in this case, to look for “c”). The Perl documentation is clear that when (*THEN) is backtracked onto, it goes to the “next alternative in the innermost enclosing group”.
  • (*COMMIT) was not overriding (*THEN), as it does in Perl. In a pattern such as (A(*COMMIT)B(*THEN)C|D) any failure after matching A should result in overall failure. Similarly, (*COMMIT) now overrides (*PRUNE) and (*SKIP), (*SKIP) overrides (*PRUNE) and (*THEN), and (*PRUNE) overrides (*THEN).
  • If \s appeared in a character class, it removed the VT character from the class, even if it had been included by some previous item, for example in [\x00-\xff\s]. (This was a bug related to the fact that VT is not part of \s, but is part of the POSIX “space” class.)
  • A partial match never returns an empty string (because you can always match an empty string at the end of the subject); however the checking for an empty string was starting at the “start of match” point. This has been changed to the “earliest inspected character” point, because the returned data for a partial match starts at this character. This means that, for example, /(?⇐abc)def/ gives a partial match for the subject “abc” (previously it gave “no match”).
  • Changes have been made to the way PCRE_PARTIAL_HARD affects the matching of $, \z, \Z, \b, and \B. If the match point is at the end of the string, previously a full match would be given. However, setting PCRE_PARTIAL_HARD has an implication that the given string is incomplete (because a partial match is preferred over a full match). For this reason, these items now give a partial match in this situation. [Aside: previously, the one case /t\b/ matched against “cat” with PCRE_PARTIAL_HARD set did return a partial match rather than a full match, which was wrong by the old rules, but is now correct.]
  • There was a bug in the handling of #-introduced comments, recognized when PCRE_EXTENDED is set, when PCRE_NEWLINE_ANY and PCRE_UTF8 were also set. If a UTF-8 multi-byte character included the byte 0x85 (e.g. +U0445, whose UTF-8 encoding is 0xd1,0x85), this was misinterpreted as a newline when scanning for the end of the comment. (*Character* 0x85 is an “any” newline, but *byte* 0x85 is not, in UTF-8 mode). This bug was present in several places in pcre_compile.
  • When pcre_compile was skipping #-introduced comments when looking ahead for named forward references to subpatterns, the only newline sequence it recognized was NL. It now handles newlines according to the set newline convention.
  • Neither pcre_exec nor pcre_dfa_exec was checking that the value given as a starting offset was within the subject string. There is now a new error, PCRE_ERROR_BADOFFSET, which is returned if the starting offset is negative or greater than the length of the string. In order to test this, pcretest is extended to allow the setting of negative starting offsets.
  • Added PCRE_ERROR_SHORTUTF8 to make it possible to distinguish between a bad UTF-8 sequence and one that is incomplete.
  • If \c was followed by a multibyte UTF-8 character, bad things happened. A compile-time error is now given if \c is not followed by an ASCII character, that is, a byte less than 128.
  • Recognize (*NO_START_OPT) at the start of a pattern to set the PCRE_NO_START_OPTIMIZE option, which is now allowed at compile time – but just passed through to pcre_exec or pcre_dfa_exec. This makes it available to pcregrep and other applications that have no direct access to PCRE options. The new /Y option in pcretest sets this option when calling pcre_compile.
  • Groups containing recursive back references were forced to be atomic, but in the case of named groups, the amount of memory required was incorrectly computed, leading to “Failed: internal error: code overflow”. This has been fixed.

DIRegEx 5.3.0 – 28 Sep 2010

  • Delphi XE support.

DIRegEx 5.2.0 – 26 Jun 2010

  • Added support for (*MARK:ARG) and for ARG additions to PRUNE, SKIP, and THEN.
  • (*ACCEPT) was not working when inside an atomic group.
  • Inside a character class, \R and \X were always treated as literals, whereas Perl faults them if its -w option is set. Changed so that they fault when coExtra is set.
  • Added support for \N which always matches any character other than newline. (It is the same as “.” when coDotAll is not set.)
  • Added four artifical Unicode properties to help with an option to make \s etc use properties. The new properties are: Xan (alphanumeric), Xsp (Perl space), Xps (POSIX space), and Xwd (word).
  • Added coUCP to make \b, \d, \s, \w, and certain POSIX character classes use Unicode properties. (*UCP) at the start of a pattern can be used to set this option.
  • In coUtf8 mode, if a pattern that was compiled with coCaseLess was studied, and the match started with a letter with a code point greater than 127 whose first byte was different to the first byte of the other case of the letter, the other case of this starting letter was not recognized.
  • TDIRegEx.Study now recognizes \h, \v, and \R when constructing a bit map of possible starting bytes for non-anchored patterns.
  • Extended the “auto-possessify” recognition during pattern compilation. Now \R and a number of cases that involve Unicode properties are recognized, both explicit and implicit when coUCP is set.
  • Fix a Study problem in UTF-8 mode if a pattern starts with certain non ASCII characters.
  • A pattern such as (?&t)(?#()(?(DEFINE)(?<t>a)) which has a forward reference to a subpattern the other side of a comment that contains an opening parenthesis caused either an internal compiling error, or a reference to the wrong subpattern.

DIRegEx 5.1.7 – 31 May 2010

  • Fix: If a repeated Unicode property match (e.g. \p{Lu}*) was used with non-UTF-8 input, it could crash or give wrong results if characters with values greater than #$C0 were present in the subject string. (Detail: it assumed UTF-8 input when processing these items.)

DIRegEx 5.1.6 – 15 May 2010

  • Correct a memory allocation problem in TDIRegEx.CompileFormatPattern.

DIRegEx 5.1.5 – 25 Apr 2010

  • Work around a weired D4, D5, and D6 memory corruption bug which surfaced in DIRegEx_MaskControls. For two AnsiStrings Str1 and Str2, these old Delphi versions do not compile Str1 := AnsiString(Str2); assignments equal to Str1 := Str2; but instead cause memory disorders further down the stack. Delphi 7 and newer are not affected.
  • The above also fixes the DIRegEx Workbench demo and the design-time editor for D4, D5, and D6 which could crash the application or the IDE.
  • Corrections to the OMMIT_PCRE_COMPILE compiler directive for DIRegEx_Api.pas (source code only).

DIRegEx 5.1.4 – 24 Mar 2010

  • The Unicode data tables have been updated to Unicode 5.2.0.
  • A pattern such as (?&t)*+(?(DEFINE)(?<t>.)) which has a possessive quantifier applied to a forward-referencing subroutine call, could compile incorrect code or give the error “internal error: previously-checked referenced subpattern not found”.
  • Fixed possible memory access outside allocated memory.
  • Hold memory texts as one long string to avoid too much relocation at load time.
  • Fix for \K giving a compile-time error if it appeared in a lookbehind assersion.
  • \K was not working if it appeared in an atomic group or in a group that was called as a “subroutine”, or in an assertion. Perl 5.11 documents that \K is “not well defined” if used in an assertion. DIRegEx now accepts it if the assertion is positive, but not if it is negative.
  • A pattern such as (?P<L1>(?P<L2>0)|(?P>L2)(?P>L1)) in which the only other item in branch that calls a recursion is a subroutine call – as in the second branch in the above example – was incorrectly given the compile-time error “recursive call could loop indefinitely” because pcre_compile was not correctly checking the subroutine for matching a non-empty string.
  • Completely revised the help generator to ease navigation and improve readability. Send your feedback!

DIRegEx 5.1.3 – 19 Jan 2010

  • A pattern such as ^(?!a(*SKIP)b) where a negative assertion contained one of the verbs SKIP, PRUNE, or COMMIT, did not work correctly. When the assertion pattern did not match (meaning that the assertion was true), it was incorrectly treated as false if the SKIP had been reached during the matching. This also applied to assertions used as conditions.
  • If an item that is not supported by pcre_dfa_exec() was encountered in an assertion subpattern, including such a pattern used as a condition, unpredictable results occurred, instead of the error return PCRE_ERROR_DFA_UITEM.
  • A subtle bug concerned with back references has been fixed by a change of specification, with a corresponding code fix. A pattern such as ^(xa|=?\1a)+$ which contains a back reference inside the group to which it refers, was giving matches when it shouldn't. For example, xa=xaaa would match that pattern. Interestingly, Perl (at least up to 5.11.3) has the same bug. Such groups have to be quantified to be useful, or contained inside another quantified group. (If there's no repetition, the reference can never match.) The problem arises because, having left the group and moved on to the rest of the pattern, a later failure that backtracks into the group uses the captured value from the final iteration of the group rather than the correct earlier one. This is now fixed by forcing any group that contains a reference to itself to be an atomic group; that is, there cannot be any backtracking into it once it has completed. This is similar to recursive and subroutine calls.

DIRegEx 5.1.2 – 15 Dec 2009

  • If a pattern contained a conditional subpattern with only one branch (in particular, this includes all (DEFINE) patterns), studying this pattern computed the wrong minimum data length and resulted in matching failures.
  • For patterns such as (?i)a(?-i)b|c where an option setting at the start of the pattern is reset in the first branch, compilation failed with “internal error: code overflow at offset…”. This happened only when the reset was to the original external option setting.

DIRegEx 5.1.1 – 29 Oct 2009

  • Change published TDIRegEx.MatchPattern property back to AnsiString. This was unfortunately required type to fix a Delphi 2009 / 2010 RawByteString streaming problem.
  • Add new public TDIRegEx.MatchPatternRaw: RawByteString property to allow Unicode Delphis to set the MatchPattern without automatic codepage conversion. This is now the recommended MatchPattern runtime property.
  • Improve Unicode in DIRegEx_MaskControls.pas. TDIRegExMaskEdit and TDIRegExMaskComboBox now automatically encode text to UTF-8 when their RegEx component is in UTF-8 mode.
  • The maximum size of a compiled regular expression is now 16 MB. This should make users happy which had hit the old 64 KB limit.
  • A UTF-8 pattern such as \x{123}{2,2}+ was incorrectly compiled; the trigger was a minimum greater than 1 for a wide character in a possessive repetition. The same bug could also affect UTF-8patterns like (\x{ff}{0,2})* which had an unlimited repeat of a nested, fixed maximum repeat of a wide character. Chaos in the form of incorrect output or a compiling loop could result.
  • The restrictions on what a pattern can contain when partial matching is requested for pcre_exec() have been removed. All patterns can now be partially matched by this function. In addition, if there are at least two slots in the offset vector, the offset of the earliest inspected character for the match and the offset of the end of the subject are set in them when PCRE_ERROR_PARTIAL is returned.
  • Partial matching has been split into two forms: PCRE_PARTIAL_SOFT, which is synonymous with PCRE_PARTIAL, for backwards compatibility, and PCRE_PARTIAL_HARD, which causes a partial match to supersede a full match, and may be more useful for multi-segment matching.
  • Partial matching with pcre_exec() is now more intuitive. A partial match used to be given if ever the end of the subject was reached; now it is given only if matching could not proceed because another character was needed. This makes a difference in some odd cases such as Z(*FAIL) with the string “Z”, which now yields “no match” instead of “partial match”. In the case of pcre_dfa_exec(), “no match” is given if every matching path for the final character ended with (*FAIL).
  • Restarting a match using pcre_dfa_exec() after a partial match did not work if the pattern had a “must contain” character that was already found in the earlier partial match, unless partial matching was again requested. For example, with the pattern dog.(body)?, the “must contain” character is “g”. If the first part-match was for the string “dog”, restarting with “sbody” failed. This bug has been fixed.
  • The string returned by pcre_dfa_exec() after a partial match has been changed so that it starts at the first inspected character rather than the first character of the match. This makes a difference only if the pattern starts with a lookbehind assertion or \b or \B (\K is not supported by pcre_dfa_exec()). It's an incompatible change, but it was required to make it compatible with pcre_exec().
  • If an odd number of negated classes containing just a single character interposed, within parentheses, between a forward reference to a named subpattern and the definition of the subpattern, compilation crashed with an internal error, complaining that it could not find the referenced subpattern. An example of a crashing pattern is (?&A)(([^m])(?<A>)).
  • Added moNotEmptyAtStart which makes it possible to have an empty string match not at the start, even when the pattern is anchored.
  • If the maximum number of capturing subpatterns in a recursion was greater than the maximum at the outer level, the higher number was returned, but with unset values at the outer level. The correct (outer level) value is now given.
  • If (*ACCEPT) appeared inside capturing parentheses, previous releases did not set those parentheses. The string so far is captured, making this feature compatible with Perl.
  • DIRegEx now allows subroutine calls in lookbehinds, as long as the subroutine pattern matches a fixed length string. Recursion is not allowed.
  • The minimum length of subject string that was needed in order to match a given pattern is now provided. This code has now been added to pcre_study(); it finds a lower bound to the length of subject needed. It is not necessarily the greatest lower bound, but using it to avoid searching strings that are too short does give some useful speed-ups. The value is available to calling programs via pcre_fullinfo().
  • If (?| is used to create subpatterns with duplicate numbers, they are now allowed to have the same name, even if PCRE_DUPNAMES is not set. However, on the other side of the coin, they are no longer allowed to have different names, because these cannot be distinguished.
  • When duplicate subpattern names are present (necessarily with different numbers), and a test is made by name in a conditional pattern, either for a subpattern having been matched, or for recursion in such a pattern, all the associated numbered subpatterns are tested, and the overall condition is true if the condition is true for any one of them. This is the way Perl works, and is also more like the way testing by number works.

DIRegEx 5.1.0 – 14 Sep 2009

  • Delphi 2010 support.

DIRegEx 5.0.2 – 18 Apr 2009

  • The pattern (?(?=.*b)b|^) was incorrectly compiled as “match must be at start or after a newline”, because the conditional assertion was not being correctly handled. The rule now is that both the assertion and what follows in the first alternative must satisfy the test.
  • If auto-callout was enabled in a pattern with a conditional group whose condition was an assertion, DIRegEx could crash during matching, both with pcre_exec() and pcre_dfa_exec().
  • The PCRE_DOLLAR_ENDONLY option was not working when pcre_dfa_exec() was used for matching.
  • Unicode property support in character classes was not working for characters (bytes) greater than 127 when not in UTF-8 mode.
  • Added the PCRE_NO_START_OPTIMIZE match-time option.
  • A conditional group that had only one branch was not being correctly recognized as an item that could match an empty string. This meant that an enclosing group might also not be so recognized, causing infinite looping (and probably a segfault) for patterns such as ^“((?(?=[a])[^”])|b)*“$ with the subject “ab”, where knowledge that the repeated group can match nothing is needed in order to break the loop.
  • If a pattern that was compiled with callouts was matched using pcre_dfa_ exec(), but without supplying a callout function, matching went wrong.
  • If PCRE_ERROR_MATCHLIMIT occurred during a recursion, there was a memory leak if the size of the offset vector was greater than 30. When the vector is smaller, the saved offsets during recursion go onto a local stack vector, but for larger vectors malloc() is used. It was failing to free when the recursion yielded PCRE_ERROR_MATCH_LIMIT (or any other “abnormal” error, in fact).
  • Forward references, both numeric and by name, in patterns that made use of duplicate group numbers, could behave incorrectly or give incorrect errors, because when scanning forward to find the reference group, PCRE was not taking into account the duplicate group numbers. A pattern such as ^X(?3)(a)(?|(b)|(q))(Y) is an example.
  • Added support for (*UTF8) at the start of a pattern.

DIRegEx 5.0.1 – 31 Jan 2009

  • Work around an unexpected Delphi 2009 automatic numeric AnsiChar Unicode conversion in DIUtils.pas which caused an error when compiled on a Windows OS set to a non-European (Asian, Cyrillic, etc.) codepage.

DIRegEx 5.0.0 – 24 Nov 2008

  • Delphi 2009 support.
  • Fix an expression study bug when a pattern contained a group with a zero qualifier.
  • Optimize Unicode Character Property searching, giving speed ups of 2 to 5 times on some simple patterns.
  • Updated the Unicode datatables to Unicode 5.1.0. This adds yet more scripts.
  • Fix caseless matching for non-ASCII characters in back references.
  • Fix overwriting or crash if the start of a pattern had top-level alternatives.
  • Fix a few cases where matching could read past the end of the subject.
  • Fix lazy qualifiers which were not working in some cases in UTF-8 mode.

DIRegEx 4.7.2 – 2 Jul 2008

  • Correct a unit name typo which caused IDE installation to fail.

DIRegEx 4.7.1 – 1 Jul 2008

  • Improve compatibility for parallel installation with other DI packages.

DIRegEx 4.7 – 8 May 2008

  • Added support for the Oniguruma syntax \g<name>, \g<n>, \g'name', \g'n', which, however, unlike Perl's \g{…}, are subroutine calls, not back references. DIRegEx supports relative numbers with this syntax.
  • Previously, a group with a zero repeat such as (…){0} was completely omitted from the compiled regex. However, this means that if the group was called as a subroutine from elsewhere in the pattern, things went wrong (an internal error was given). Such groups are now left in the compiled pattern, with a new opcode that causes them to be skipped at execution time.
  • Added the PCRE_JAVASCRIPT_COMPAT option. This makes the following changes to the way DIRegEx behaves:
    • A lone ] character is dis-allowed (Perl treats it as data).
    • A back reference to an unmatched subpattern matches an empty string (Perl fails the current match path).
    • A data ] in a character class must be notated as \] because if the first data character in a class is ], it defines an empty class. (In Perl it is not possible to have an empty class.) The empty class [] never matches; it forces failure and is equivalent to (*FAIL) or (?!). The negative empty class [^] matches any one character, independently of the DOTALL setting.
  • A pattern such as /(?2)[]a()b](abc)/ which had a forward reference to a non-existent subpattern following a character class starting with ']' and containing () gave an internal compiling error instead of “reference to non- existent subpattern”. This is now corrected.
  • Accept (*FAIL) for DFA matching

DIRegEx 4.6.1 Beta 2 – 6 Feb 2008 (internal)

  • Fix problems with empty FormatPattern introduced in 4.6.1 Beta 1.

DIRegEx 4.6.1 Beta 1 – 6 Feb 2008 (internal)

  • DIRegEx 4.6 missed to update the internal PCRE version number.
  • Fixed a problem with TDIRegEx.Format and duplicate substring names.
  • Removed conditional directives from DIRegEx_Workbench_Form.pas which caused problems to some Delphi versions.

DIRegEx 4.6 – 28 Jan 2008

  • Fix a potential buffer overflow which occured when compiling, in UTF-8 mode, a pattern containing a character class with a very large number of characters with codepoints greater than 255.

DIRegEx 4.5 – 14 Jan 2008

$(PRODUCT_NAME_VERSION) is mainly a bug-fix release:

  • Negative specials like \S did not work in character classes in UTF-8 mode. Characters greater than 255 were excluded from the class instead of being included. The same bug also applied to negated POSIX classes such as [:^space:].
  • The construct (?&) was not diagnosed as a syntax error (it referenced the first named subpattern) and a construct such as (?&a) would reference the first named subpattern whose name started with “a” (in other words, the length check was missing). Both these problems are fixed. “Subpattern name expected” is now given for (?&) (a zero-length name), and this patch also makes it give the same error for \k'' (previously it complained that that was a reference to a non- existent subpattern).
  • The erroneous patterns (?+-a) and (?-+a) give different error messages; this is right because (?- can be followed by option settings as well as by digits. I have, however, made the messages clearer.
  • Patterns such as (?(1)a|b) (a pattern that contains fewer subpatterns than the number used in the conditional) now cause a compile-time error. This is actually not compatible with Perl, which accepts such patterns, but treats the conditional as always being FALSE (as DIRegEx used to), but it seems that giving a diagnostic is better.
  • Correct some Unicode character properties which were in the wrong script.
  • The pattern (?=something)(?R) was not being diagnosed as a potentially infinitely looping recursion. The bug was that positive lookaheads were not being skipped when checking for a possible empty match (negative lookaheads and both kinds of lookbehind were skipped).
  • Specifying a possessive quantifier with a specific limit for a Unicode character property caused pcre_compile() to compile bad code, which led at runtime to PCRE_ERROR_INTERNAL (-14). Examples of patterns that caused this are: '\p{Zl}{2,3}+' and '\p{Cc}{2}+'. It was the possessive ”+“ that caused the error; without that there was no problem.
  • In UTF-8 mode, with newline set to “any”, a pattern such as .*a.*=.b.* crashed when matching a string such as a\x{2029}b (note that \x{2029} is a UTF-8 newline character). The key issue is that the pattern starts .*; this means that the match must be either at the beginning, or after a newline. The bug was in the code for advancing after a failed match and checking that the new position followed a newline. It was not taking account of UTF-8 characters correctly.
  • DIRegEx was behaving differently from Perl in the way it recognized POSIX character classes. DIRegEx was not treating the sequence [:…:] as a character class unless the … were all letters. Perl, however, seems to allow any characters between [: and :], though of course it rejects as unknown any “names” that contain non-letters, because all the known class names consist only of letters. Thus, Perl gives an error for [[:1234:]], for example, whereas DIRegEx did not – it did not recognize a POSIX character class. This seemed a bit dangerous, so the code has been changed to be closer to Perl. The behaviour is not identical to Perl, because DIRegEx will diagnose an unknown class for, for example, [[:l\ower:]] where Perl will treat it as [[:lower:]]. However, DIRegEx does now give “unknown” errors where Perl does, and where it didn't before.
  • Correct a potential one byte overflow by ansi_mbtowc and oem_mbtowc in DIRegEx_SearchStream.pas.

DIRegEx 4.4 – 21 Sep 2007

  • Extend TDIRegEx.MatchNext to match empty result strings. The new algorithm detects potential infinite loops and advances the search position as necessary.
  • Do not count [\s] as an explicit reference to CR or LF. So now DIRegEx will match single CR and LF only if the pattern contains \r or \n (or a literal CR or LF).
  • The appearance of (?J) was not reflected by the PCRE_INFO_JCHANGED facility.
  • Added options (at compile time and exec time) to change \R from matching any Unicode line ending sequence to just matching CR, LF, or CRLF.

DIRegEx 4.3 – 28 Aug 2007

  • The pattern .*$ when run in not-DOTALL UTF-8 mode with newline=any failed when the subject happened to end in the byte 0x85 (e.g. if the last character was \x{1ec5}). *Character* 0x85 is one of the “any” newline characters but of course it shouldn't be taken as a newline when it is part of another character. The bug was that, for an unlimited repeat of . in not-DOTALL UTF-8 mode, DIRegEx was advancing by bytes rather than by characters when looking for a newline.
  • A small performance improvement in the DOTALL UTF-8 mode .* case.
  • Remove the explicit limit of non-capturing parenthesis at the expense of using more stack.
  • Remove the artificial limitation on group length – now there is only the limit on the total length of the compiled pattern, which is set at 65535.
  • Because Perl interprets \Q…\E at a high level, and ignores orphan \E instances, patterns such as [\Q\E] or [\E] or even [^\E] cause an error, because the ] is interpreted as the first data character and the terminating ] is not found. DIRegEx has been made compatible with Perl in this regard. Previously, it interpreted [\Q\E] as an empty class, and [\E] could cause memory overwriting.
  • Like Perl, DIRegEx automatically breaks an unlimited repeat after an empty string has been matched (to stop an infinite loop). It was not recognizing a conditional subpattern that could match an empty string if that subpattern was within another subpattern. For example, it looped when trying to match (((?(1)X|))*) but it was OK with ((?(1)X|)*) where the condition was not nested. This bug has been fixed.
  • A pattern like \X?\d or \P{L}?\d in non-UTF-8 mode could cause a backtrack past the start of the subject in the presence of bytes with the top bit set, for example “\x8aBCD”.
  • Added Perl 5.10 experimental backtracking controls (*FAIL), (*F), (*PRUNE), (*SKIP), (*THEN), (*COMMIT), and (*ACCEPT).
  • Optimized (?!) to (*FAIL).
  • Updated the test for a valid UTF-8 string to conform to the later RFC 3629. This restricts code points to be within the range 0 to 0x10FFFF, excluding the “low surrogate” sequence 0xD800 to 0xDFFF. Previously, DIRegEx allowed the full range 0 to 0x7FFFFFFF, as defined by RFC 2279. Internally, it still does: it's just the validity check that is more restrictive.
  • Inserted checks for integer overflows during escape sequence (backslash) processing, and also fixed erroneous offset values for syntax errors during backslash processing.
  • Fixed another case of looking too far back in non-UTF-8 mode for patterns like [\PPP\x8a]{1,}\x80 with the subject “A\x80”.
  • An unterminated class in a pattern like (?1)\c[ with a “forward reference” caused an overrun.
  • A pattern like (?:[\PPa*]*){8,} which had an “extended class” (one with something other than just ASCII characters) inside a group that had an unlimited repeat caused a loop at compile time (while checking to see whether the group could match an empty string).
  • An orphan \E inside a character class could cause a crash.
  • A repeated capturing bracket such as (A)? could cause a wild memory reference during compilation.
  • There are several functions in pcre_compile() that scan along a compiled expression for various reasons (e.g. to see if it's fixed length for look behind). There were bugs in these functions when a repeated \p or \P was present in the pattern. These operators have additional parameters compared with \d, etc, and these were not being taken into account when moving along the compiled data. Specifically:
    • A item such as \p{Yi}{3} in a lookbehind was not treated as fixed length.
    • An item such as \pL+ within a repeated group could cause crashes or loops.
    • A pattern such as \p{Yi}+(\P{Yi}+)(?1) could give an incorrect “reference to non-existent subpattern” error.
    • A pattern like (\P{Yi}{2}\277)? could loop at compile time.
  • A repeated \S or \W in UTF-8 mode could give wrong answers when multibyte characters were involved (for example /\S{2}/8g with “A\x{a3}BC”).
  • Patterns such as [\P{Yi}A] which include \p or \P and just one other character were causing crashes (broken optimization).
  • Patterns such as (\P{Yi}*\277)* (group with possible zero repeat containing \p or \P) caused a compile-time loop.
  • More problems have arisen in unanchored patterns when CRLF is a valid line break. For example, the unstudied pattern [\r\n]A does not match the string “\r\nA”. However, the pattern \nA *does* match, because it doesn't start till \n, and if [\r\n]A is studied, the same is true. There doesn't seem any very clean way out of this, but to make sense for the common cases DIRegEx now takes note of whether there can be an explicit match for \r or \n anywhere in the pattern, and if so, does not advace CRLF by two bytes. As part of this change, there's a new PCRE_INFO_HASCRORLF option for finding out whether a compiled pattern has explicit CR or LF references.
  • Added (*CR) etc for changing newline setting at start of pattern.
  • Fix spelling of DIRegEx_Reg.dcr in the DIRegEx packages which caused a problem during IDE installation.
  • Documentation updates and fixes.

DIRegEx 4.2 – 20 Jun 2007

  • Add more features from Perl 5.10:
    • (?-n) (where n is a string of digits) is a relative subroutine or recursion call. It refers to the nth most recently opened parentheses.
    • (?+n) is also a relative subroutine call; it refers to the nth next to be opened parentheses.
    • Conditions that refer to capturing parentheses can be specified relatively, for example, (?(-2)… or (?(+3)…
    • \K resets the start of the current match so that everything before is not part of it.
    • \k{name} is synonymous with \k<name> and \k'name' (.NET compatible).
    • \g{name} is another synonym – part of Perl 5.10's unification of reference syntax.
    • (?| introduces a group in which the numbering of parentheses in each alternative starts with the same number.
    • \h, \H, \v, and \V match horizontal and vertical whitespace.
  • Fix: Matching a pattern such as (.*(.)?)* failed by either not terminating or by crashing.
  • Fix: A pattern with a very large number of alternatives (more than several hundred) was running out of internal workspace during the pre-compile phase. A bit of new cunning has reduced the workspace needed for groups with alternatives. The 1000-alternative test pattern now uses 12 bytes of workspace instead of running out of the 4096 that are available.
  • Fix: If \p or \P was used in non-UTF-8 mode on a character greater than 127 it matched the wrong number of bytes.
  • Added new method TDIRegEx.SubStrPtr.
  • Added two new info methods calls to TDIRegEx: InfoOkPartial and InfoJChanged.
  • Speed up performance of TDIRegEx.CompiledRegExpArray.
  • Added new menu entry to the DIRegEx Workbench to copy the pattern as a well formatted Pascal string.
  • Added a new demo DIRegEx_PreCompiled_Pattern.dpr which shows how to use precompiled regular expressions.

DIRegEx 4.1.1 – 7 May 2007

  • Delphi 2007 support.
  • Added coNewLineAnyCrLf which is like coNewLineAny, but matches only CR, LF, or CRLF as a newline sequence. The compile-option equivalent is moNewLineAnyCrLf. Only a single newline option may be set at the same time. Invalid combinations of newline options will raise an exception.

DIRegEx 4.1 – 3 Apr 2007

  • New classes to search for regular expressions in data / streams / files of arbitrary size by loading only a small portion of data into memory at a single time:
    • TDICustomRegExSearch
    • TDIRegExSearchStream
    • TDIRegExsEarchStream_Enc
    • TDIRegExSearchStream_ANSI
    • TDIRegExSearchStream_Binary
    • TDIRegExSearchStream_Binary16BE
    • TDIRegExSearchStream_Binary16LE
    • TDIRegExSearchStream_OEM
    • TDIRegExSearchStream_UTF16BE
    • TDIRegExSearchStream_UTF16LE
  • There is a new example project demonstrating the usage of these new classes.
  • Fixed a fairly obscure bugs concerned with quantified caseless matching with Unicode property support: For a maximizing quantifier, if the two different cases of the character were of different lengths in their UTF-8 codings, and the matching function had to back up over a mixture of the two cases, it incorrectly assumed they were both the same length.
  • In multiline mode when the newline sequence was set to “any”, the pattern ^$ would give a match between the CR and LF of a subject such as 'A'#13#10'B'. This doesn't seem right; it now treats the CRLF combination as the line ending, and so does not match in that case. It's only a pattern such as ^$ that would hit this one: something like ^ABC$ would have failed after CR and then tried again after CRLF.
  • SubStrCount returns the actual count of captured substrings, even for descendent classes. Fixed a problem where the wrong value was returned for TDIDfaRegEx. Likewise improved the regular expression workbench.
  • Fixed TDIRegExInspector to handle Windows XP themes.
  • Added XP Theme support to the GUI demo projects. Also increased the demo projects' maximum stack size to {$MAXSTACKSIZE $00200000} in order to reduce the potential of stack overflow when matching very demanding regular expressions.

DIRegEx 4.0 – 18 Jan 2007

  • New and improved List2 and Replace2 functions: They are different from the old List and Replace in that they return the number of matches listed / replaced and also work on empty matches. This can be usefull for replacing empty lines, for example.
  • In response to the growing importance of Unicode, the default character set for caseless matching and character classes is now Latin 1, a subset of Unicode. Use the poUserLocale Option if you are matching ANSI strings in the user's default locale.
  • Major re-factoring of the way pcre_compile computes the amount of memory needed for a compiled pattern. It now runs the real compile function in a “fake” mode that enables it to compute how much memory it would need, while actually only ever using a few hundred bytes of working memory and without too many tests of the mode. A side effect of this work is that the limit of 200 on the nesting depth of parentheses has been removed. However, there is a downside: pcre_compile now runs more slowly than before (30% or more, depending on the pattern). There is no effect on runtime performance.
  • Extended pcre_study to be more clever in cases where a branch of a subpattern has no definite first character. For example, (a*|b*)[cd] would previously give no result from pcre_study. Now it recognizes that the first character must be a, b, c, or d.
  • There was an incorrect error “recursive call could loop indefinitely” if a subpattern (or the entire pattern) that was being tested for matching an empty string contained only one non-empty item after a nested subpattern.
  • A new optimization is now able automatically to treat some sequences such as a*b as a*+b. More specifically, if something simple (such as a character or a simple class like \d) has an unlimited quantifier, and is followed by something that cannot possibly match the quantified thing, the quantifier is automatically “possessified”.
  • A recursive reference to a subpattern whose number was greater than 39 went wrong under certain circumstances in UTF-8 mode. This bug could also have affected the operation of pcre_study.
  • Possessive quantifiers such as a++ were previously implemented by turning them into atomic groups such as ($>a+). Now they have their own opcodes, which improves performance. This includes the automatically created ones from above.
  • A pattern such as (?=(\w+))\1: which simulates an atomic group using a lookahead was broken if it was not anchored. DIRegEx was mistakenly expecting the first matched character to be a colon. This applied both to named and numbered groups.
  • Forward references to subpatterns in conditions such as (?(2)…) where subpattern 2 is defined later cause pcre_compile to search forwards in the pattern for the relevant set of parentheses. This search went wrong when there were unescaped parentheses in a character class, parentheses escaped with \Q…\E, or parentheses in a #-comment in /x mode.
  • “Subroutine” calls and backreferences were previously restricted to referencing subpatterns earlier in the regex. This restriction has now been removed.
  • Added a number of extra features that are going to be in Perl 5.10. On the whole, these are just syntactic alternatives for features that DIRegEx had previously implemented using the Python syntax or my own invention. The other formats are all retained for compatibility.
    • Named groups can now be defined as (?…) or (?'name'…) as well as (?P…). The new forms, as well as being in Perl 5.10, are also .NET compatible.
    • A recursion or subroutine call to a named group can now be defined as (?&name) as well as (?P>name).
    • A backreference to a named group can now be defined as \k or \k'name' as well as (?P=name). The new forms, as well as being in Perl 5.10, are also .NET compatible.
    • A conditional reference to a named group can now use the syntax (?() or (?('name') as well as (?(name).
    • A “conditional group” of the form (?(DEFINE)…) can be used to define groups (named and numbered) that are never evaluated inline, but can be called as “subroutines” from elsewhere. In effect, the DEFINE condition is always false. There may be only one alternative in such a group.
    • A test for recursion can be given as (?(R1).. or (?(R&name)… as well as the simple (?(R). The condition is true only if the most recent recursion is that of the given number or name. It does not search out through the entire recursion stack.
    • The escape \gN or \g{N} has been added, where N is a positive or negative number, specifying an absolute or relative reference.
  • Updated the Unicode property tables to Unicode version 5.0.0. Amongst other things, this adds five new scripts.
  • Perl ignores orphaned \E escapes completely. DIRegEx now does the same. There were also incompatibilities regarding the handling of \Q..\E inside character classes, for example with patterns like [\Qa\E-\Qz\E] where the hyphen was adjacent to \Q or \E. I hope I've cleared all this up now.
  • Like Perl, DIRegEx detects when an indefinitely repeated parenthesized group matches an empty string, and forcibly breaks the loop. There were bugs in this code in non-simple cases. For a pattern such as ^(a()*)* matched against aaaa the result was just “a” rather than “aaaa”, for example. Two separate and independent bugs (that affected different cases) have been fixed.
  • Implemented PCRE_NEWLINE_ANY and coNewLineAny to recognize any of the Unicode newline sequences as “newline” when processing dot, circumflex, or dollar metacharacters, or #-comments in /x mode.
  • Added \R to match any Unicode newline sequence, as suggested in the Unicode report.
  • For an unanchored pattern, if a match attempt fails at the start of a newline sequence, and the newline setting is CRLF or ANY, and the next two characters are CRLF, advance by two characters instead of one.
products/regex/history.txt · Last modified: 2023/11/23 10:18 by 127.0.0.1