Delphi Inspiration

Components and Applications

User Tools

Site Tools


products:pcre2:changes

YuPcre2: Changes from DIRegEx

YuPcre2 is a new regular expression library for Delphi with Perl syntax. It directly supports UnicodeString, AnsiString, or UCS4String, as well as UTF-8, and UTF-16.

This document describes the differences and similarities between the new YuPcre2 and the old DIRegEx to help convert existing projects. If you never used DIRegEx or start a new project with YuPcre2, you might skip this document.

YuPcre2 is a new project, not just a drastic update to DIRegEx. A lot has changed, even though some units, classes, and functions carry familiar names. Unfortunately, it was not possible to keep identical identifiers because Delphi rejects them if both YuPcre2 and DIRegEx are installed into the IDE. Overall, DIRegEx names have changed to DIRegEx2 where possible, which should simplify transition to YuPcre2.

Unit Name Changes

Unit names had to be changed to allow YuPcre2 to be installed into the IDE in parallel with DIRegEx. Unit names start with the YuPcre2 prefix. The native PCRE2 API is in YuPcre2.pas. DIRegEx units with class wrappers and helper routines have been renamed to YuPcre2_RegEx2…:

DIRegEx YuPcre2
DIRegEx_Api.pas YuPcre2.pas
n/a YuPcre2OptInfo.pas
DIRegEx_Reg.pas YuPcre2Reg.pas
DIRegEx.pas YuPcre2_RegEx2.pas
DIRegEx_Consts.pas YuPcre2_RegEx2_Consts.pas
DIRegEx_MaskControls.pas YuPcre2_RegEx2_MaskControls.pas
DIRegEx_SearchStream.pas YuPcre2_RegEx2_SearchStream.pas
DIRegEx_Utils.pas YuPcre2_RegEx2_Utils.pas

Class and Identifier Name Changes

Class names now contain “RegEx2” – the number 2 is appended to “RegEx”. Most members, helper routines and identifier names are unchanged. Deprecated warnings are issued where appropriate.

DIRegExYuPcre2
TDIPerlRegEx16 TDIPerlRegEx16TDIPerlRegEx2_16.png TDIPerlRegEx2_16
TDIDfaRegEx16.gif TDIDfaRegEx16 TDIDfaRegEx2_16.png TDIDfaRegEx2_16
TDIPerlRegEx.gif TDIPerlRegEx TDIPerlRegEx2_8.png TDIPerlRegEx2_8
TDIDfaRegEx.gif TDIDfaRegEx TDIDfaRegEx2_8.png TDIDfaRegEx2_8
TDIRegExMaskEdit.gif TDIRegExMaskEdit TDIRegEx2MaskEdit.png TDIRegEx2MaskEdit
TDIRegExMaskComboBox.gif TDIRegExMaskComboBox TDIRegEx2MaskComboBox.png TDIRegEx2MaskComboBox

TDIRegEx2Base.CompileOptions is empty by default. In DIRegEx, coCaseLess and coDotAll were set by default. YuPcre2 excludes them for compatibility with PCRE2. If matching relies on these options, set them like this:

{ Set YuPcre2 CompileOptions to DIRegEx default: }
RegEx.CompileOptions := [coCaseLess, coDotAll];

TDIRegEx2Base.BSR and TDIRegEx2Base.NewLine options are new properties of their own. In DIRegEx they were be part of the CompileOptions and MachOptions. As a consequence, BSR and NewLine options can no longer be passed to CompileMatchPatternStrOpt but must be set beforehand.

PCRE2 Native API Changes

  • Names of the native API functions start with the “pcre2_” prefix. The “_8”, “_16”, and “_32” suffixes denote the width of the function's string code unit in bits.
  • Many names have been changed; in particular, pcre_exec has become pcre2_match. The PCRE_JAVASCRIPT_COMPAT option has been split into independent functional options PCRE2_ALT_BSUX, PCRE2_ALLOW_EMPTY_CLASS, and PCRE2_MATCH_UNSET_BACKREF.
  • Patterns, subject strings, and replacement strings may all contain binary zeros and for this reason are always passed as a pointer and a length. However, the length may be given as PCRE2_ZERO_TERMINATED for zero-terminated strings.
  • The output vector that holds offsets of matched strings is now a vector of PCRE2_SIZE elements instead of Integers. The special value PCRE2_UNSET is used for unset elements.
  • Error handling has been redesigned and error messages are available in all code unit widths. The error codes have been redesignated.
  • Explicit “studying” of compiled patterns has been abolished – it now always happens automatically. JIT compiling is done by calling a new function, pcre2_jit_compile after a successful return from pcre2_compile.
  • The capture_last field of the pcre2_callout_block is now an unsigned integer, set to zero if there have been no captures.
  • Saving / restoring a compiled pattern is accomplished by a set of serializing functions.
  • There is a new function called pcre2_substitute that performs “find and replace” operations.
  • Implement the PCRE2_NO_DOTSTAR_ANCHOR, PCRE2_NEVER_BACKSLASH_C, and PCRE2_ALT_CIRCUMFLEX options.

PCRE2 Funcionality Changes

  • Patterns may start with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) to set the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART options for every subject line that is matched by that pattern.
  • For the benefit of those who use PCRE2 via some other application, that is, not writing the function calls themselves, it is possible to check the PCRE2 version by matching a pattern such as (?(VERSION>=10)yes|no) against a string such as “yesno”.
  • There are case-equivalent Unicode characters whose encodings use different numbers of code units in UTF-8. U+023A and U+2C65 are one example. (It is theoretically possible for this to happen in UTF-16 too.) If a backreference to a group containing one of these characters was greedily repeated, and during the match a backtrack occurred, the subject might be backtracked by the wrong number of code units. For example, if ^(\x{23a})\1*(.) is matched caselessly (and in UTF-8 mode) against x{23a}\x{2c65}\x{2c65}\x{2c65}, group 2 should capture the final character, which is the three bytes E2, B1, and A5 in UTF-8. Incorrect backtracking meant that group 2 captured only the last two bytes. This bug has been fixed; the new code is slower, but it is used only when the strings matched by the repetition are not all the same length.
  • Update Unicode to 8.0.0.
  • A pattern such as ()a was not setting the “first character must be 'a'” information. This applied to any pattern with a group that matched no characters, for example: (?:(?=.)|(?<!x))a.
  • When an (*ACCEPT) is triggered inside capturing parentheses, it arranges for those parentheses to be closed with whatever has been captured so far. However, it was failing to mark any other groups between the highest capture so far and the currrent group as “unset”. Thus, the ovector for those groups contained whatever was previously there. An example is the pattern (x)|((*ACCEPT)) when matched against “abcd”.
  • Add the (*NO_JIT) pattern feature.
  • Add callouts with string arguments.
products/pcre2/changes.txt · Last modified: 2016/01/22 15:08 (external edit)