|
Added support for (*MARK:ARG) and for ARG additions to PRUNE, SKIP, and THEN.
(*ACCEPT) was not working when inside an atomic group.
Inside a character class, \R and \X were always treated as literals, whereas Perl faults them if its -w option is set. Changed so that they fault when coExtra is set.
Added support for \N which always matches any character other than newline. (It is the same as "." when coDotAll is not set.)
Added four artifical Unicode properties to help with an option to make \s etc use properties. The new properties are: Xan (alphanumeric), Xsp ( Perl space), Xps (POSIX space), and Xwd (word).
Added coUCP to make \b, \d, \s, \w, and certain POSIX character classes use Unicode properties. (*UCP) at the start of a pattern can be used to set this option.
In coUtf8 mode, if a pattern that was compiled with coCaseLess was studied, and the match started with a letter with a code point greater than 127 whose first byte was different to the first byte of the other case of the letter, the other case of this starting letter was not recognized.
TDIRegEx.Study now recognizes \h, \v, and \R when constructing a bit map of possible starting bytes for non-anchored patterns.
Extended the "auto-possessify" recognition during pattern compilation. Now \R and a number of cases that involve Unicode properties are recognized, both explicit and implicit when coUCP is set.
Fix a Study problem in UTF-8 mode if a pattern starts with certain non ASCII characters.
A pattern such as (?&t)(?#()(?(DEFINE)(?<t>a)) which has a forward reference to a subpattern the other side of a comment that contains an opening parenthesis caused either an internal compiling error, or a reference to the wrong subpattern.
The Unicode data tables have been updated to Unicode 5.2.0.
A pattern such as (?&t)*+(?(DEFINE)(?<t>.)) which has a possessive quantifier applied to a forward-referencing subroutine call, could compile incorrect code or give the error "internal error: previously-checked referenced subpattern not found".
Fixed possible memory access outside allocated memory.
Hold memory texts as one long string to avoid too much relocation at load time.
Fix for \K giving a compile-time error if it appeared in a lookbehind assersion.
\K was not working if it appeared in an atomic group or in a group that was called as a "subroutine", or in an assertion. Perl 5.11 documents that \K is "not well defined" if used in an assertion. DIRegEx now accepts it if the assertion is positive, but not if it is negative.
A pattern such as (?P<L1>(?P<L2>0)|(?P>L2)(?P>L1)) in which the only other item in branch that calls a recursion is a subroutine call – as in the second branch in the above example – was incorrectly given the compile-time error "recursive call could loop indefinitely" because pcre_compile was not correctly checking the subroutine for matching a non-empty string.
Completely revised the help generator to ease navigation and improve readability. Send your feedback!
A pattern such as ^(?!a(*SKIP)b) where a negative assertion contained one of the verbs SKIP, PRUNE, or COMMIT, did not work correctly. When the assertion pattern did not match (meaning that the assertion was true), it was incorrectly treated as false if the SKIP had been reached during the matching. This also applied to assertions used as conditions.
If an item that is not supported by pcre_dfa_exec() was encountered in an assertion subpattern, including such a pattern used as a condition, unpredictable results occurred, instead of the error return PCRE_ERROR_DFA_UITEM.
A subtle bug concerned with back references has been fixed by a change of specification, with a corresponding code fix. A pattern such as ^(xa|=?\1a)+$ which contains a back reference inside the group to which it refers, was giving matches when it shouldn't. For example, xa=xaaa would match that pattern. Interestingly, Perl (at least up to 5.11.3) has the same bug. Such groups have to be quantified to be useful, or contained inside another quantified group. (If there's no repetition, the reference can never match.) The problem arises because, having left the group and moved on to the rest of the pattern, a later failure that backtracks into the group uses the captured value from the final iteration of the group rather than the correct earlier one. This is now fixed by forcing any group that contains a reference to itself to be an atomic group; that is, there cannot be any backtracking into it once it has completed. This is similar to recursive and subroutine calls.
If a pattern contained a conditional subpattern with only one branch (in particular, this includes all (DEFINE) patterns), studying this pattern computed the wrong minimum data length and resulted in matching failures.
For patterns such as (?i)a(?-i)b|c where an option setting at the start of the pattern is reset in the first branch, compilation failed with "internal error: code overflow at offset…". This happened only when the reset was to the original external option setting.
Change published TDIRegEx.MatchPattern property back to AnsiString. This was unfortunately required type to fix a Delphi 2009 / 2010 RawByteString streaming problem.
Add new public TDIRegEx.MatchPatternRaw: RawByteString property to allow Unicode Delphis to set the MatchPattern without automatic codepage conversion. This is now the recommended MatchPattern runtime property.
Improve Unicode in DIRegEx_MaskControls.pas. TDIRegExMaskEdit and TDIRegExMaskComboBox now automatically encode text to UTF-8 when their RegEx component is in UTF-8 mode.
The maximum size of a compiled regular expression is now 16 MB. This should make users happy which had hit the old 64 KB limit.
A UTF-8 pattern such as \x{123}{2,2}+ was incorrectly compiled; the trigger was a minimum greater than 1 for a wide character in a possessive repetition. The same bug could also affect UTF-8patterns like (\x{ff}{0,2})* which had an unlimited repeat of a nested, fixed maximum repeat of a wide character. Chaos in the form of incorrect output or a compiling loop could result.
The restrictions on what a pattern can contain when partial matching is requested for pcre_exec() have been removed. All patterns can now be partially matched by this function. In addition, if there are at least two slots in the offset vector, the offset of the earliest inspected character for the match and the offset of the end of the subject are set in them when PCRE_ERROR_PARTIAL is returned.
Partial matching has been split into two forms: PCRE_PARTIAL_SOFT, which is synonymous with PCRE_PARTIAL, for backwards compatibility, and PCRE_PARTIAL_HARD, which causes a partial match to supersede a full match, and may be more useful for multi-segment matching.
Partial matching with pcre_exec() is now more intuitive. A partial match used to be given if ever the end of the subject was reached; now it is given only if matching could not proceed because another character was needed. This makes a difference in some odd cases such as Z(*FAIL) with the string "Z", which now yields "no match" instead of "partial match". In the case of pcre_dfa_exec(), "no match" is given if every matching path for the final character ended with (*FAIL).
Restarting a match using pcre_dfa_exec() after a partial match did not work if the pattern had a "must contain" character that was already found in the earlier partial match, unless partial matching was again requested. For example, with the pattern dog.(body)?, the "must contain" character is "g". If the first part-match was for the string "dog", restarting with "sbody" failed. This bug has been fixed.
The string returned by pcre_dfa_exec() after a partial match has been changed so that it starts at the first inspected character rather than the first character of the match. This makes a difference only if the pattern starts with a lookbehind assertion or \b or \B (\K is not supported by pcre_dfa_exec()). It's an incompatible change, but it was required to make it compatible with pcre_exec().
If an odd number of negated classes containing just a single character interposed, within parentheses, between a forward reference to a named subpattern and the definition of the subpattern, compilation crashed with an internal error, complaining that it could not find the referenced subpattern. An example of a crashing pattern is (?&A)(([^m])(?<A>)).
Added moNotEmptyAtStart which makes it possible to have an empty string match not at the start, even when the pattern is anchored.
If the maximum number of capturing subpatterns in a recursion was greater than the maximum at the outer level, the higher number was returned, but with unset values at the outer level. The correct (outer level) value is now given.
If (*ACCEPT) appeared inside capturing parentheses, previous releases did not set those parentheses. The string so far is captured, making this feature compatible with Perl.
DIRegEx now allows subroutine calls in lookbehinds, as long as the subroutine pattern matches a fixed length string. Recursion is not allowed.
The minimum length of subject string that was needed in order to match a given pattern is now provided. This code has now been added to pcre_study(); it finds a lower bound to the length of subject needed. It is not necessarily the greatest lower bound, but using it to avoid searching strings that are too short does give some useful speed-ups. The value is available to calling programs via pcre_fullinfo().
If (?| is used to create subpatterns with duplicate numbers, they are now allowed to have the same name, even if PCRE_DUPNAMES is not set. However, on the other side of the coin, they are no longer allowed to have different names, because these cannot be distinguished.
When duplicate subpattern names are present (necessarily with different numbers), and a test is made by name in a conditional pattern, either for a subpattern having been matched, or for recursion in such a pattern, all the associated numbered subpatterns are tested, and the overall condition is true if the condition is true for any one of them. This is the way Perl works, and is also more like the way testing by number works.
The pattern (?(?=.*b)b|^) was incorrectly compiled as "match must be at start or after a newline", because the conditional assertion was not being correctly handled. The rule now is that both the assertion and what follows in the first alternative must satisfy the test.
If auto-callout was enabled in a pattern with a conditional group whose condition was an assertion, DIRegEx could crash during matching, both with pcre_exec() and pcre_dfa_exec().
The PCRE_DOLLAR_ENDONLY option was not working when pcre_dfa_exec() was used for matching.
Unicode property support in character classes was not working for characters (bytes) greater than 127 when not in UTF-8 mode.
Added the PCRE_NO_START_OPTIMIZE match-time option.
A conditional group that had only one branch was not being correctly recognized as an item that could match an empty string. This meant that an enclosing group might also not be so recognized, causing infinite looping (and probably a segfault) for patterns such as ^"((?(?=[a])[^"])|b)*"$ with the subject "ab", where knowledge that the repeated group can match nothing is needed in order to break the loop.
If a pattern that was compiled with callouts was matched using pcre_dfa_ exec(), but without supplying a callout function, matching went wrong.
If PCRE_ERROR_MATCHLIMIT occurred during a recursion, there was a memory leak if the size of the offset vector was greater than 30. When the vector is smaller, the saved offsets during recursion go onto a local stack vector, but for larger vectors malloc() is used. It was failing to free when the recursion yielded PCRE_ERROR_MATCH_LIMIT (or any other "abnormal" error, in fact).
Forward references, both numeric and by name, in patterns that made use of duplicate group numbers, could behave incorrectly or give incorrect errors, because when scanning forward to find the reference group, PCRE was not taking into account the duplicate group numbers. A pattern such as ^X(?3)(a)(?|(b)|(q))(Y) is an example.
Added support for (*UTF8) at the start of a pattern.
Work around an unexpected Delphi 2009 automatic numeric AnsiChar Unicode conversion in DIUtils.pas which caused an error when compiled on a Windows OS set to a non-European (Asian, Cyrillic, etc.) codepage.
Delphi 2009 support.
Fix an expression study bug when a pattern contained a group with a zero qualifier.
Optimize Unicode Character Property searching, giving speed ups of 2 to 5 times on some simple patterns.
Updated the Unicode datatables to Unicode 5.1.0. This adds yet more scripts.
Fix caseless matching for non- ASCII characters in back references.
Fix overwriting or crash if the start of a pattern had top-level alternatives.
Fix a few cases where matching could read past the end of the subject.
Fix lazy qualifiers which were not working in some cases in UTF-8 mode.
Improve compatibility for parallel installation with other DI packages.
Added support for the Oniguruma syntax \g<name>, \g<n>, \g'name', \g'n', which, however, unlike Perl's \g{…}, are subroutine calls, not back references. DIRegEx supports relative numbers with this syntax.
Previously, a group with a zero repeat such as (…){0} was completely omitted from the compiled regex. However, this means that if the group was called as a subroutine from elsewhere in the pattern, things went wrong (an internal error was given). Such groups are now left in the compiled pattern, with a new opcode that causes them to be skipped at execution time.
Added the PCRE_JAVASCRIPT_COMPAT option. This makes the following changes to the way DIRegEx behaves:
A lone ] character is dis-allowed ( Perl treats it as data).
A back reference to an unmatched subpattern matches an empty string ( Perl fails the current match path).
A data ] in a character class must be notated as \] because if the first data character in a class is ], it defines an empty class. (In Perl it is not possible to have an empty class.) The empty class [] never matches; it forces failure and is equivalent to (*FAIL) or (?!). The negative empty class [^] matches any one character, independently of the DOTALL setting.
A pattern such as /(?2)[]a()b](abc)/ which had a forward reference to a non-existent subpattern following a character class starting with ']' and containing () gave an internal compiling error instead of "reference to non- existent subpattern". This is now corrected.
Accept (*FAIL) for DFA matching
DIRegEx 4.6 missed to update the internal PCRE version number.
Fixed a problem with TDIRegEx.Format and duplicate substring names.
Removed conditional directives from DIRegEx_Workbench_Form.pas which caused problems to some Delphi versions.
$(PRODUCT_NAME_VERSION) is mainly a bug-fix release:
Negative specials like \S did not work in character classes in UTF-8 mode. Characters greater than 255 were excluded from the class instead of being included. The same bug also applied to negated POSIX classes such as [:^space:].
The construct (?&) was not diagnosed as a syntax error (it referenced the first named subpattern) and a construct such as (?&a) would reference the first named subpattern whose name started with "a" (in other words, the length check was missing). Both these problems are fixed. "Subpattern name expected" is now given for (?&) (a zero-length name), and this patch also makes it give the same error for \k'' (previously it complained that that was a reference to a non-existent subpattern).
The erroneous patterns (?+-a) and (?-+a) give different error messages; this is right because (?- can be followed by option settings as well as by digits. I have, however, made the messages clearer.
Patterns such as (?(1)a|b) (a pattern that contains fewer subpatterns than the number used in the conditional) now cause a compile-time error. This is actually not compatible with Perl, which accepts such patterns, but treats the conditional as always being FALSE (as DIRegEx used to), but it seems that giving a diagnostic is better.
Correct some Unicode character properties which were in the wrong script.
The pattern (?=something)(?R) was not being diagnosed as a potentially infinitely looping recursion. The bug was that positive lookaheads were not being skipped when checking for a possible empty match (negative lookaheads and both kinds of lookbehind were skipped).
Specifying a possessive quantifier with a specific limit for a Unicode character property caused pcre_compile() to compile bad code, which led at runtime to PCRE_ERROR_INTERNAL (-14). Examples of patterns that caused this are: '\p{Zl}{2,3}+' and '\p{Cc}{2}+'. It was the possessive "+" that caused the error; without that there was no problem.
In UTF-8 mode, with newline set to "any", a pattern such as .*a.*=.b.* crashed when matching a string such as a\x{2029}b (note that \x{2029} is a UTF-8 newline character). The key issue is that the pattern starts .*; this means that the match must be either at the beginning, or after a newline. The bug was in the code for advancing after a failed match and checking that the new position followed a newline. It was not taking account of UTF-8 characters correctly.
DIRegEx was behaving differently from Perl in the way it recognized POSIX character classes. DIRegEx was not treating the sequence [:…:] as a character class unless the … were all letters. Perl, however, seems to allow any characters between [: and :], though of course it rejects as unknown any "names" that contain non-letters, because all the known class names consist only of letters. Thus, Perl gives an error for [[:1234:]], for example, whereas DIRegEx did not - it did not recognize a POSIX character class. This seemed a bit dangerous, so the code has been changed to be closer to Perl. The behaviour is not identical to Perl, because DIRegEx will diagnose an unknown class for, for example, [[:l\ower:]] where Perl will treat it as [[:lower:]]. However, DIRegEx does now give "unknown" errors where Perl does, and where it didn't before.
Correct a potential one byte overflow by ansi_mbtowc and oem_mbtowc in DIRegEx_SearchStream.pas.
Extend TDIRegEx.MatchNext to match empty result strings. The new algorithm detects potential infinite loops and advances the search position as necessary.
Do not count [\s] as an explicit reference to CR or LF. So now DIRegEx will match single CR and LF only if the pattern contains \r or \n (or a literal CR or LF).
The appearance of (?J) was not reflected by the PCRE_INFO_JCHANGED facility.
Added options (at compile time and exec time) to change \R from matching any Unicode line ending sequence to just matching CR, LF, or CRLF.
The pattern .*$ when run in not-DOTALL UTF-8 mode with newline=any failed when the subject happened to end in the byte 0x85 (e.g. if the last character was \x{1ec5}). *Character* 0x85 is one of the "any" newline characters but of course it shouldn't be taken as a newline when it is part of another character. The bug was that, for an unlimited repeat of . in not-DOTALL UTF-8 mode, DIRegEx was advancing by bytes rather than by characters when looking for a newline.
A small performance improvement in the DOTALL UTF-8 mode .* case.
Remove the explicit limit of non-capturing parenthesis at the expense of using more stack.
Remove the artificial limitation on group length – now there is only the limit on the total length of the compiled pattern, which is set at 65535.
Because Perl interprets \Q…\E at a high level, and ignores orphan \E instances, patterns such as [\Q\E] or [\E] or even [^\E] cause an error, because the ] is interpreted as the first data character and the terminating ] is not found. DIRegEx has been made compatible with Perl in this regard. Previously, it interpreted [\Q\E] as an empty class, and [\E] could cause memory overwriting.
Like Perl, DIRegEx automatically breaks an unlimited repeat after an empty string has been matched (to stop an infinite loop). It was not recognizing a conditional subpattern that could match an empty string if that subpattern was within another subpattern. For example, it looped when trying to match (((?(1)X|))*) but it was OK with ((?(1)X|)*) where the condition was not nested. This bug has been fixed.
A pattern like \X?\d or \P{L}?\d in non-UTF-8 mode could cause a backtrack past the start of the subject in the presence of bytes with the top bit set, for example "\x8aBCD".
Added Perl 5.10 experimental backtracking controls (*FAIL), (*F), (*PRUNE), (*SKIP), (*THEN), (*COMMIT), and (*ACCEPT).
Optimized (?!) to (*FAIL).
Updated the test for a valid UTF-8 string to conform to the later RFC 3629. This restricts code points to be within the range 0 to 0x10FFFF, excluding the "low surrogate" sequence 0xD800 to 0xDFFF. Previously, DIRegEx allowed the full range 0 to 0x7FFFFFFF, as defined by RFC 2279. Internally, it still does: it's just the validity check that is more restrictive.
Inserted checks for integer overflows during escape sequence (backslash) processing, and also fixed erroneous offset values for syntax errors during backslash processing.
Fixed another case of looking too far back in non-UTF-8 mode for patterns like [\PPP\x8a]{1,}\x80 with the subject "A\x80".
An unterminated class in a pattern like (?1)\c[ with a "forward reference" caused an overrun.
A pattern like (?:[\PPa*]*){8,} which had an "extended class" (one with something other than just ASCII characters) inside a group that had an unlimited repeat caused a loop at compile time (while checking to see whether the group could match an empty string).
An orphan \E inside a character class could cause a crash.
A repeated capturing bracket such as (A)? could cause a wild memory reference during compilation.
There are several functions in pcre_compile() that scan along a compiled expression for various reasons (e.g. to see if it's fixed length for look behind). There were bugs in these functions when a repeated \p or \P was present in the pattern. These operators have additional parameters compared with \d, etc, and these were not being taken into account when moving along the compiled data. Specifically:
A item such as \p{Yi}{3} in a lookbehind was not treated as fixed length.
An item such as \pL+ within a repeated group could cause crashes or loops.
A pattern such as \p{Yi}+(\P{Yi}+)(?1) could give an incorrect "reference to non-existent subpattern" error.
A pattern like (\P{Yi}{2}\277)? could loop at compile time.
A repeated \S or \W in UTF-8 mode could give wrong answers when multibyte characters were involved (for example /\S{2}/8g with "A\x{a3}BC").
Patterns such as [\P{Yi}A] which include \p or \P and just one other character were causing crashes (broken optimization).
Patterns such as (\P{Yi}*\277)* (group with possible zero repeat containing \p or \P) caused a compile-time loop.
More problems have arisen in unanchored patterns when CRLF is a valid line break. For example, the unstudied pattern [\r\n]A does not match the string "\r\nA". However, the pattern \nA *does* match, because it doesn't start till \n, and if [\r\n]A is studied, the same is true. There doesn't seem any very clean way out of this, but to make sense for the common cases DIRegEx now takes note of whether there can be an explicit match for \r or \n anywhere in the pattern, and if so, does not advace CRLF by two bytes. As part of this change, there's a new PCRE_INFO_HASCRORLF option for finding out whether a compiled pattern has explicit CR or LF references.
Added (*CR) etc for changing newline setting at start of pattern.
Fix spelling of DIRegEx_Reg.dcr in the DIRegEx packages which caused a problem during IDE installation.
Documentation updates and fixes.
Add more features from Perl 5.10:
(?-n) (where n is a string of digits) is a relative subroutine or recursion call. It refers to the nth most recently opened parentheses.
(?+n) is also a relative subroutine call; it refers to the nth next to be opened parentheses.
Conditions that refer to capturing parentheses can be specified relatively, for example, (?(-2)… or (?(+3)…
\K resets the start of the current match so that everything before is not part of it.
\k{name} is synonymous with \k<name> and \k'name' (.NET compatible).
\g{name} is another synonym - part of Perl 5.10's unification of reference syntax.
(?| introduces a group in which the numbering of parentheses in each alternative starts with the same number.
\h, \H, \v, and \V match horizontal and vertical whitespace.
Fix: Matching a pattern such as (.*(.)?)* failed by either not terminating or by crashing.
Fix: A pattern with a very large number of alternatives (more than several hundred) was running out of internal workspace during the pre-compile phase. A bit of new cunning has reduced the workspace needed for groups with alternatives. The 1000-alternative test pattern now uses 12 bytes of workspace instead of running out of the 4096 that are available.
Fix: If \p or \P was used in non-UTF-8 mode on a character greater than 127 it matched the wrong number of bytes.
Added new method TDIRegEx.SubStrPtr.
Added two new info methods calls to TDIRegEx: InfoOkPartial and InfoJChanged.
Speed up performance of TDIRegEx.CompiledRegExpArray.
Added new menu entry to the DIRegEx Workbench to copy the pattern as a well formatted Pascal string.
Added a new demo DIRegEx_PreCompiled_Pattern.dpr which shows how to use precompiled regular expressions.
Delphi 2007 support.
Added coNewLineAnyCrLf which is like coNewLineAny, but matches only CR, LF, or CRLF as a newline sequence. The compile-option equivalent is moNewLineAnyCrLf. Only a single newline option may be set at the same time. Invalid combinations of newline options will raise an exception.
There is a new example project demonstrating the usage of these new classes.
Fixed a fairly obscure bugs concerned with quantified caseless matching with Unicode property support: For a maximizing quantifier, if the two different cases of the character were of different lengths in their UTF-8 codings, and the matching function had to back up over a mixture of the two cases, it incorrectly assumed they were both the same length.
In multiline mode when the newline sequence was set to "any", the pattern ^$ would give a match between the CR and LF of a subject such as 'A'#13#10'B'. This doesn't seem right; it now treats the CRLF combination as the line ending, and so does not match in that case. It's only a pattern such as ^$ that would hit this one: something like ^ABC$ would have failed after CR and then tried again after CRLF.
SubStrCount returns the actual count of captured substrings, even for descendent classes. Fixed a problem where the wrong value was returned for TDIDfaRegEx. Likewise improved the regular expression workbench.
Fixed TDIRegExInspector to handle Windows XP themes.
Added XP Theme support to the GUI demo projects. Also increased the demo projects' maximum stack size to {$MAXSTACKSIZE $00200000} in order to reduce the potential of stack overflow when matching very demanding regular expressions.
New and improved List2 and Replace2 functions: They are different from the old List and Replace in that they return the number of matches listed / replaced and also work on empty matches. This can be usefull for replacing empty lines, for example.
In response to the growing importance of Unicode, the default character set for caseless matching and character classes is now Latin 1, a subset of Unicode. Use the poUserLocale Option if you are matching ANSI strings in the user's default locale.
Major re-factoring of the way pcre_compile computes the amount of memory needed for a compiled pattern. It now runs the real compile function in a "fake" mode that enables it to compute how much memory it would need, while actually only ever using a few hundred bytes of working memory and without too many tests of the mode. A side effect of this work is that the limit of 200 on the nesting depth of parentheses has been removed. However, there is a downside: pcre_compile now runs more slowly than before (30% or more, depending on the pattern). There is no effect on runtime performance.
Extended pcre_study to be more clever in cases where a branch of a subpattern has no definite first character. For example, (a*|b*)[cd] would previously give no result from pcre_study. Now it recognizes that the first character must be a, b, c, or d.
There was an incorrect error "recursive call could loop indefinitely" if a subpattern (or the entire pattern) that was being tested for matching an empty string contained only one non-empty item after a nested subpattern.
A new optimization is now able automatically to treat some sequences such as a*b as a*+b. More specifically, if something simple (such as a character or a simple class like \d) has an unlimited quantifier, and is followed by something that cannot possibly match the quantified thing, the quantifier is automatically "possessified".
A recursive reference to a subpattern whose number was greater than 39 went wrong under certain circumstances in UTF-8 mode. This bug could also have affected the operation of pcre_study.
Possessive quantifiers such as a++ were previously implemented by turning them into atomic groups such as ($>a+). Now they have their own opcodes, which improves performance. This includes the automatically created ones from above.
A pattern such as (?=(\w+))\1: which simulates an atomic group using a lookahead was broken if it was not anchored. DIRegEx was mistakenly expecting the first matched character to be a colon. This applied both to named and numbered groups.
Forward references to subpatterns in conditions such as (?(2)…) where subpattern 2 is defined later cause pcre_compile to search forwards in the pattern for the relevant set of parentheses. This search went wrong when there were unescaped parentheses in a character class, parentheses escaped with \Q…\E, or parentheses in a #-comment in /x mode.
"Subroutine" calls and backreferences were previously restricted to referencing subpatterns earlier in the regex. This restriction has now been removed.
Added a number of extra features that are going to be in Perl 5.10. On the whole, these are just syntactic alternatives for features that DIRegEx had previously implemented using the Python syntax or my own invention. The other formats are all retained for compatibility.
Named groups can now be defined as (?…) or (?'name'…) as well as (?P…). The new forms, as well as being in Perl 5.10, are also .NET compatible.
A recursion or subroutine call to a named group can now be defined as (?&name) as well as (?P>name).
A backreference to a named group can now be defined as \k or \k'name' as well as (?P=name). The new forms, as well as being in Perl 5.10, are also .NET compatible.
A conditional reference to a named group can now use the syntax (?() or (?('name') as well as (?(name).
A "conditional group" of the form (?(DEFINE)…) can be used to define groups (named and numbered) that are never evaluated inline, but can be called as "subroutines" from elsewhere. In effect, the DEFINE condition is always false. There may be only one alternative in such a group.
A test for recursion can be given as (?(R1).. or (?(R&name)… as well as the simple (?(R). The condition is true only if the most recent recursion is that of the given number or name. It does not search out through the entire recursion stack.
The escape \gN or \g{N} has been added, where N is a positive or negative number, specifying an absolute or relative reference.
Updated the Unicode property tables to Unicode version 5.0.0. Amongst other things, this adds five new scripts.
Perl ignores orphaned \E escapes completely. DIRegEx now does the same. There were also incompatibilities regarding the handling of \Q..\E inside character classes, for example with patterns like [\Qa\E-\Qz\E] where the hyphen was adjacent to \Q or \E. I hope I've cleared all this up now.
Like Perl, DIRegEx detects when an indefinitely repeated parenthesized group matches an empty string, and forcibly breaks the loop. There were bugs in this code in non-simple cases. For a pattern such as ^(a()*)* matched against aaaa the result was just "a" rather than "aaaa", for example. Two separate and independent bugs (that affected different cases) have been fixed.
Implemented PCRE_NEWLINE_ANY and coNewLineAny to recognize any of the Unicode newline sequences as "newline" when processing dot, circumflex, or dollar metacharacters, or #-comments in /x mode.
Added \R to match any Unicode newline sequence, as suggested in the Unicode report.
For an unanchored pattern, if a match attempt fails at the start of a newline sequence, and the newline setting is CRLF or ANY, and the next two characters are CRLF, advance by two characters instead of one.
products/regex/history.txt · Last modified: 2010/06/26 17:14 (external edit)
|