aboutsummaryrefslogtreecommitdiffstats
path: root/erts/emulator/pcre/README.pcre_update.md
diff options
context:
space:
mode:
authorPatrik Nyblom <[email protected]>2013-08-16 11:37:54 +0200
committerPatrik Nyblom <[email protected]>2013-08-16 11:37:54 +0200
commit5d9a587a8fcc164e02f043959338edec2ff69381 (patch)
tree554f05a944777622b30031724010f6f31707565b /erts/emulator/pcre/README.pcre_update.md
parent23610dbfc1c409f83349e9e293dd3cfc1f74d497 (diff)
parent52cb62b7930d9c7b9e04a210ff6b02946f27ae79 (diff)
downloadotp-5d9a587a8fcc164e02f043959338edec2ff69381.tar.gz
otp-5d9a587a8fcc164e02f043959338edec2ff69381.tar.bz2
otp-5d9a587a8fcc164e02f043959338edec2ff69381.zip
Merge branch 'pan/update_pcre_8.33'
* pan/update_pcre_8.33: Workaround TR gnu/181328, GCC 4.2.1 20070831 on FreeBSD 9.1 Clarify relation between erts_iolist_{size|to_buf} Fix backslash in titles of manpages Correct UTF-8 in stdlib's notes.xml Add more tests for corner error cases in erl_bif_re.c Add documentation of report_errors and match_limit(_recursion) Add match_limit and match_limit_recursion options Add return_errors option to re:run/3 Add README for updating PCRE Add documentation of extensions to re module Add new options to Erlang re interface and mend dupnames Update PCRE doc part of re.xml to PCRE 8.33 state Integrate new PCRE test suites Integrate patch for PCRE bug id 1370 Handle CRLF correctly in global regexp Add erts_prefix to pcre_library and update erl_bif_re Update to PCRE 8.33, w/o the erts_ prefix added OTP-11204 OTP-11205 OTP-10285
Diffstat (limited to 'erts/emulator/pcre/README.pcre_update.md')
-rw-r--r--erts/emulator/pcre/README.pcre_update.md733
1 files changed, 733 insertions, 0 deletions
diff --git a/erts/emulator/pcre/README.pcre_update.md b/erts/emulator/pcre/README.pcre_update.md
new file mode 100644
index 0000000000..5f2414f0d9
--- /dev/null
+++ b/erts/emulator/pcre/README.pcre_update.md
@@ -0,0 +1,733 @@
+# How to update the PCRE version used by Erlang
+
+## The basic changes to the PCRE library
+
+To work with the Erlang VM, PCRE has been changed in two important ways:
+
+1. The main execution machine in pcre\_exec has been modified so that
+matching can be interrupted and restarted. This functionality utilizes
+the code that implements recursion by allocating explicit
+"stack-frames" in heap space, which basically means that all local
+variables in the loop are part of a struct which is kept in "malloced"
+memory on the heap and there are no real stack variables that need to
+be pushed on the C stack in the case of recursive calls. This is a
+technique we also use inside the VM to avoid building large C
+stacks. In PCRE this is enabled by the NO\_RECURSE define, so that is a
+prerequisite for the ERLANG\_INTEGRATION define which also adds labels
+at restart points and counts "reductions".
+
+2. All visible symbols in PCRE gets the erts_ prefix, so that NIF's
+and such using a "real" pcre library does not get confused (or 're'
+gets confused when a "real" pcre library get's loaded into the VM
+process).
+
+3. All irrelevant functionality has been stripped from the library,
+which means for example UTF16 support, jit, DFA execution
+etc. Basically the source files handling this are removed, together
+with any build support from the PCRE project. We have our own
+makefiles etc.
+
+## Setting up an environment for the work
+
+I work with four temporary directories when doing this (the examples
+are from the updating of pcre-7.6 to pcre-8.33);
+
+ ~/tmp/pcre> ls
+ epcre-7.6 epcre-8.33 pcre-7.6 pcre-8.33
+
+I've unpacked the plain pcre sources in pcre-* and will work with our
+patched sources in the epcre-* directories.
+
+Make sure your ERL_TOP contains a *built* version of Erlang (and you have made a branch)
+
+First unpack the pcre libraries (which will create the pcre-*
+directories) and then copy our code to the old epcre directory:
+
+ ~/tmp/pcre> tar jxf $ERL_TOP/erts/emulator/pcre/pcre-7.6.tar.bz2
+ ~/tmp/pcre> tar jxf ~/Downloads/pcre-8.33.tar.bz2
+ ~/tmp/pcre> mkdir epcre-7.6 epcre-8.33
+ ~/tmp/pcre> cd epcre-7.6/
+ ~/tmp/pcre/epcre-7.6> cp -r $ERL_TOP/erts/emulator/pcre/* .
+ ~/tmp/pcre/epcre-7.6> rm pcre-7.6.tar.bz2
+
+Leave the obj directory, you may need the libepcre.a file...
+
+If you find it easier, you can revert the commit in GIT that adds the
+erts_ prefix to the previous version before continuing work, but as
+that is a quite small diff in newer versions of PCRE, it is probably
+not worth it. Still, you will find the erts_ prefix being a separate
+commit when integrating 8.33, so if you're nice, you will do the same
+for the person coming after you...
+
+## Generating a diff for our changes to PCRE
+
+Before you generate a diff (that, in an ideal world, would be used to
+automatically patch the newer version of pcre, which will probably
+only work for minor PCRE updates), we need to configure the old pcre.
+
+ ~/tmp/pcre/epcre-7.6> cd ../pcre-7.6
+ ~/tmp/pcre/pcre-7.6> ./configure --enable-utf8 --enable-unicode-properties --disable-shared --disable-stack-for-recursion
+
+Note that for newer versions, the configure flag '--enable-utf8'
+should be replaced with '--enable-utf'
+
+So we now generate a diff:
+
+ ~/tmp/pcre/pcre-7.6> cd ../epcre-7.6
+ ~/tmp/pcre/epcre-7.6> (for x in *.[ch]; do if [ -f ../pcre-7.6/$x ]; then diff -c ../pcre-7.6/$x $x; fi; done ) > ../epcre-7.6_clean.diff
+
+### What the diff means
+
+Let's now walk through the relevant parts of the diff. Some of the
+differences might come from patches that probably are already in the
+new version, For example in out 7.6, we had a security patch which
+added the define WORK_SIZE_CHECK and used it in some places. Those can
+probably safely be ignored, but to be on the safe side, check what's
+already integrated in the new version.
+
+The interesting part is in pcre_exec.c. You will see things like
+
+ #ifdef ERLANG_INTEGRATION
+ ...
+ #endif
+
+or
+
+ #if defined(ERLANG_INTEGRATION)
+ ...
+ #endif
+
+and a lot of
+
+ COST_CHK(1);
+
+or
+
+ COST(min);
+
+and
+
+ /* LOOP_COUNT: Ok */
+ /* LOOP_COUNT: CHK */
+ /* LOOP_COUNT: COST */
+
+
+scattered over the main loop. Those mean the following:
+
+* COST(int x) - consume reductions proportional to the integer
+ parameter, but no need for interruption here (it's like
+ bump_reductions without trapping). The loop they apply to also has a
+ 'LOOP_COUNT: COST' comment at it's head.
+
+* COST\_CHK(int x) - like COST(x), but also check that the reduction
+counter does not reach zero. If it does, leave the execution loop to
+be restarted at a later point. No real stack variables can be live
+here. Note that variables like 'max' and 'min' are *not* real stack
+variables, the NO\_RECURSION setting has taken care of that. 'i' is a
+stack variable that's explicitly saved when trapping, so that will
+also be correct when returning from a trap. So will 'c', 'rrc' and
+flags like 'utf8', 'minimize' and 'posessive'. Those can also be
+regarded as "non C-stack variables". The loop where they reside also
+has a 'LOOP\_COUNT: CHK' comment.
+
+* /* LOOP_COUNT: Ok */ - means that I have checked the loop and it
+ only runs a deterministic set of iterations regardless of input, or
+ it has a call to RRECURSE in it's body, why we need not add more
+ cost than the normal reduction counting that will occur for each
+ instruction demands.
+
+The thing is that each loop in the function 'match' should be marked
+with one of these comments. If no comment is present after you patched
+the new release (if you successfully manage to do it automatically),
+it may be a new regexp instruction that is added since the last
+release.
+
+You will need to manually go through the main 'match' loop after
+upgrading to verify that there are no unhandled loops in the regexp
+machine loop (!).
+
+The COST\_CHK macro works like this:
+
+1. Add to the loop count.
+2. If loop count > limit:
+ 1. Store the line (+100) in the Xwhere member of the frame structure
+ 2. Goto LOOP\_COUNT\_BREAK, which ultimately returns from the function
+3. Insert a label, which is named L\_LOOP\_COUNT\_&lt;line number&gt;
+
+LOOP\_COUNT\_BREAK code will create an extra "stack frame" on the heap
+allocated stack used if NO\_RECURSION is set, and will store the few
+locals that are not already in the ordinary stack frame there (like
+'c' and 'i').
+
+When we continue execution (after a trap up to the main Erlang
+scheduler), we will jump to LOOP\_COUNT\_RETURN, which will restore
+the local variables and will jump to the labels. The jump code looks
+like this in the C source:
+
+ switch (frame->Xwhere)
+ {
+ #include "pcre_exec_loop_break_cases.inc"
+ default:
+ DPRINTF(("jump error in pcre match: label %d non-existent\n", frame->Xwhere));
+ return PCRE_ERROR_INTERNAL;
+ }
+
+When building, pcre\_exec\_loop\_break\_cases.inc will be generated
+during build by pcre.mk, it will look like:
+
+ case 791: goto L_LOOP_COUNT_691;
+ case 1892: goto L_LOOP_COUNT_1792;
+ case 1999: goto L_LOOP_COUNT_1899;
+
+etc
+
+So, simply put, all C-stack variables are saved when we have consumed
+our reductions, we return from the function and, as there is no real
+recursion we immediately fall out into the re:run BIF, which with the
+help of a magic binary keeps track of the heap allocated stack for the
+regexp machine. When we return from trapping out to the scheduler, all
+vital data is restored and we continue from exactly the same state as
+we left. What's needed is to patch this into the new pcre_exec and
+check all new instructions to determine what might need updating in
+terms of COST, COST\_CHK etc.
+
+Well, that's *almost* everything, because there is of course more...
+
+The actual interface function, 'pcre\_exec', needs the same treatment
+as the actual regexp machine loop, that is we need to store all local
+variables between restarts. Unfortunately the NO\_RECURSE setting does
+not do this, we need to do it ourselves. So there's quite a diff in
+that function too, where a big struct is declared, containing every
+local variable in that function, together with either local copies
+that are swapped in and out, or macros that directly access the heap
+allocated struct. The struct is called `PcreExecContext`.
+
+If a context is present, we are restarting and therefore restore
+everything. If we are restarting we can also skip all initialization
+code in the function and jump more or less directly to the
+RESTART_INTERRUPTED label and the call to 'match', which is the actual
+regexp machine loop.
+
+There are a few places in the pcre_exec we need to do some housekeeping, you will see code like:
+
+ if ((extra_data->flags & PCRE_EXTRA_LOOP_LIMIT) != 0)
+ {
+ *extra_data->loop_counter_return =
+ (extra_data->loop_limit - md->loop_limit);
+ }
+
+Make sure, after updating, that this housekeeping is done whenever we
+do not reach the call to 'match'.
+
+So, now we in theory know what to do, so let's do it:
+
+But...
+
+## File changes in the new version of PCRE
+
+First we need to go through what's changed in the new library
+version. Files may have new names, functions may have moved and so on.
+
+Start by building the new library:
+
+ ~/tmp/pcre> cd pcre-8.33/
+ ~/tmp/pcre/pcre-8.33> ./configure --enable-utf --enable-unicode-properties --disable-shared --disable-stack-for-recursion
+ ~/tmp/pcre/pcre-8.33> make
+
+In the make process, you will probably notice most files that are
+used, but you can bet that's not all not all...
+
+To begin with you will need a default table for Latin-1 characters, so:
+
+ ~/tmp/pcre/pcre-8.33> cc -DHAVE_CONFIG_H -o dftables dftables.c
+ ~/tmp/pcre/pcre-8.33> LANG=sv_SE ./dftables -L ../epcre-8.33/pcre_latin_1_table.c
+
+Compare it to the pcre\_latin\_1\_table.c in the old version, they
+should not differ in any significant way.
+
+A good starting point is then to try to find all files in the new
+version of the library that have (probably) the same names as the
+one's in our distribution:
+
+ ~/tmp/pcre/pcre-8.33> cd ../epcre-7.6/
+ ~/tmp/pcre/epcre-7.6> for x in *.[ch]; do if [ '!' -f ../pcre-8.33/$x ]; then echo $x; else cp ../pcre-8.33/$x ../epcre-8.33/; fi; done
+
+This will output a list of files not found in the new distro. Let's
+look at the list from the example upgrade:
+
+ local_config.h
+ make_latin1_table.c
+ pcre_info.c
+ pcre_latin_1_table.c
+ pcre_make_latin1_default.c
+ pcre_try_flipped.c
+ pcre_ucp_searchfuncs.c
+ ucpinternal.h
+ ucptable.h
+
+* local\_config.h - OK, that's our child, it contains PCRE-specific
+ configure-results (i.e. the #defines that are results from out
+ parameters to configure, like NO\_RECURSE etc). Just copy it and
+ edit it according to what specific settings you can find in the
+ generated config.h from the real library build. In our example case,
+ the #define SUPPORT\_UTF8 should be renamed to #define SUPPORT\_UTF
+ and #define VERSION "7.6" should be changed to #define VERSION
+ "8.33"...
+
+* make\_latin1\_table.c - it was renamed to dftables.c, so we copy
+ that instead.
+
+* pcre\_info.c - It was simply removed from the library. Good, because
+ it was useless... So just ignore.
+
+* pcre\_latin\_1\_table.c - No problem, we generated a new one in the
+ earlier stage.
+
+* pcre\_make\_latin1\_default.c - No longer used, a hack that's not
+ needed with dftables. Ignored
+
+* pcre\_try\_flipped.c - This functionality has been removed from
+ pcre\_exec, you cannot compile on one endianess and execute on
+ another any more :( Ignored.
+
+* pcre\_ucp\_searchfuncs.c, ucpinternal.h, ucptable.h - this
+ functionality is moved to pcre\_ucd.c, copy that one instead.
+
+OK, now go the other way and look at what was actually built for the new version of pcre:
+
+ ~/tmp/pcre/epcre-7.6> cd ../pcre-8.33/
+ ~/tmp/pcre/pcre-8.33> nm ./.libs/libpcre.a | egrep 'lib.*.o:'
+
+The output for this release was:
+
+ libpcre_la-pcre_byte_order.o:
+ libpcre_la-pcre_compile.o:
+ libpcre_la-pcre_config.o:
+ libpcre_la-pcre_dfa_exec.o:
+ libpcre_la-pcre_exec.o:
+ libpcre_la-pcre_fullinfo.o:
+ libpcre_la-pcre_get.o:
+ libpcre_la-pcre_globals.o:
+ libpcre_la-pcre_jit_compile.o:
+ libpcre_la-pcre_maketables.o:
+ libpcre_la-pcre_newline.o:
+ libpcre_la-pcre_ord2utf8.o:
+ libpcre_la-pcre_refcount.o:
+ libpcre_la-pcre_string_utils.o:
+ libpcre_la-pcre_study.o:
+ libpcre_la-pcre_tables.o:
+ libpcre_la-pcre_ucd.o:
+ libpcre_la-pcre_valid_utf8.o:
+ libpcre_la-pcre_version.o:
+ libpcre_la-pcre_xclass.o:
+ libpcre_la-pcre_chartables.o:
+
+Libtool has changed the object names, but we can fix that and see what
+sources we have already decided should exist:
+
+ ~/tmp/pcre/pcre-8.33> NAMES=`nm ./.libs/libpcre.a | egrep 'lib.*.o:'| sed 's,libpcre_la-,,' | sed 's,.o:$,,'`
+ ~/tmp/pcre/pcre-8.33> for x in $NAMES; do if [ '!' -f ../epcre-8.33/$x.c ]; then echo $x; fi; done
+
+And the list contained:
+
+ pcre_byte_order
+ pcre_jit_compile
+ pcre_string_utils
+
+pcre\_jit\_compile is actually needed, even though we have not enabled
+jit, and the other two contain functionality needed, so just copy the
+sources...
+
+ ~/tmp/pcre/pcre-8.33> for x in $NAMES; do if [ '!' -f ../epcre-8.33/$x.c ]; then cp $x.c ../epcre-8.33/; fi; done
+
+## Test build of stripped down version of new PCRE
+
+Time to do a test build. Copy and edit the pcre.mk makefile and try to
+get something that builds...
+
+I made a wrapper Makefile, hacked pcre.mk a little and did a few
+changes to a few files, namely added:
+
+ #ifdef ERLANG_INTEGRATION
+ #include "local_config.h"
+ #endif
+
+to pcre\_config.c and pcre\_internal.h. Also pcre.mk needs to get the
+new files added and the old files removed, directory names need to be
+changed and the wrapper can define most. My wrapper Makefile looked
+like this:
+
+ EPCRE_LIB = ./obj/libepcre.a
+ PCRE_GENINC = ./pcre_exec_loop_break_cases.inc
+ PCRE_OBJDIR = ./obj
+ V_AR = ar
+ V_CC = gcc
+ CFLAGS = -g -O2 -DHAVE_CONFIG_H -I/ldisk/pan/git/otp/erts/x86_64-unknown-linux-gnu
+ gen_verbose =
+ PCRE_DIR=.
+ include pcre.mk
+
+And the according variables were removed together with dependencies
+from pcre.mk. Note that you will need to put things back in order in
+pcre.mk after all testing is done. Once a 'make' is successful, you
+can generate new dependencies:
+
+ ~/tmp/pcre/epcre-8.33> gcc -MM -c -g -O2 -DHAVE_CONFIG_H -I/ldisk/pan/git/otp/erts/x86_64-unknown-linux-gnu -DERLANG_INTEGRATION *.c | grep -v $ERL_TOP
+
+Well, then you have to add $(PCRE\_OBJDIR)/ to each object and
+$(PCRE\_DIR)/ to each header. I did it manually, it's just a couple of
+files. Now your pcre.mk is fairly up to date and it's time to start
+patching in the changes...
+
+## Actually patching in the changes to the C code
+
+### Fixing the functionality (interruptable pcre\_run etc)
+
+Begin with only pcre\_exec.c, that's the important part:
+
+ ~/tmp/pcre/epcre-8.33> cd ../epcre-7.6/
+ ~/tmp/pcre/epcre-7.6> diff -c ../pcre-7.6/pcre_exec.c ./pcre_exec.c > ../epcre_exec.c_7.6.diff
+ ~/tmp/pcre/epcre-7.6> cd ../epcre-8.33
+
+Now - if you are lucky, you can patch the new pcre\_exec with the
+patch command from the diff, but that may not be the case... Even if:
+
+ ~/tmp/pcre/epcre-8.33> patch -p0 < ../epcre_exec.c_7.6.diff
+
+works like a charm, you still have to go through the main loop and see
+that all do, while and for loops in the code contains COST\_CHK or at
+least COST, or, if it's a small loop (over, say one UTF character),
+mark it as OK with a comment.
+
+You should also check for other changes, like new local variables in
+the pcre\_exec code etc.
+
+What will probably happen, is that the majority of chunks
+fail. pcre\_exec is the main file for PCRE, one that is constantly
+optimized and where every new feature ends up. You will probably see
+so many failed HUNK's that you feel like giving up, but do not
+despair, it's just a matter of patience and hard work:
+
+* First, fix the 'pcre\_exec' function.
+
+ * Change the struct PcreExecContext to reflect the local variables
+ in this version of the code.
+
+ * Add/update the defines that makes local variables in the code
+ actually stay in an allocated "exec\_context" and be sure to
+ initialize the "pseudo-stack-variables" in the same way as in
+ the declarations for the original version of the code.
+
+ * The macros SWAPIN and SWAPOUT should be for variables that are
+ used a lot and we do not want to always access through the
+ struct. Also a few parameters are saved by SWAPIN and SWAPOUT.
+
+ * What might be tricky is to get things deallocated in a proper
+ way, there is a function that's called from the BIF code to
+ clean up an exec\_context, be especially observant about how the
+ stack in the 'match' function is allocated! The first frame is
+ supposed to be on the C stack, but in our case is allocated in
+ the exec\_context. The rest of the frames are allocated but
+ never freed, not until the match is done.
+
+ The variable 'frame' in the 'match' function is stored in our
+ additional field of the 'md' structure, that is the stack top,
+ but not necessarily the uppermost frame (due to reuse of old
+ frames, which is supposed to be an optimization...).
+
+ * The housekeeping of the "reduction counter" in the extra\_data
+ struct needs to be added to all places where we break out of the
+ main loop of pcre\_exec. Look for 'break' and you will see the
+ places. Make sure to update
+ '*extra\_data->loop\_counter\_return' whenever you leave this
+ function. It all boils down to some code that loops over the
+ call for match and returns PCRE\_ERROR\_LOOP\_LIMIT and get's
+ jumped back to when the BIF is restarted. You will see it in
+ your diff and you will find a similar place in the new version
+ where you put basically the same code.
+
+ * Fixing pcre\_exec takes about an hour of concentrated work, it
+ could be worse...
+
+* Next, go for the match function. It's simpler in some ways but
+ harder in other. The elimination of the C stack is already there,
+ you just need to modify it a little:
+
+ * In the RRETURN macro for NO\_RECURSE, add updating of
+ md->loop\_limit before returning. You can see how it's done in
+ the diff.
+
+ * RMATCH can be left as it is, at least it could in earlier
+ versions. Note however that you should mimic the allocation
+ strategies of RMATCH and RRETURN in the code at another place
+ later... The principle of the labels HEAP\_RECURSE and
+ HEAP\_RETURN are mimicked by our code in LOOP\_COUNT\_BREAK and
+ LOOP\_COUNT\_RETURN. You'll see later...
+
+ * COST and COST\_CHK, together with the jump to
+ LOOP\_COUNT\_RETURN label are in the beginning of the function
+ 'match'. It's a block of macros and declaration of our local
+ variables loop\_count and loop\_limit. We patch in the code for
+ that, but may need to adopt it to new variable names etc. It's
+ important to handle the 'frames' variable correctly, dig it out
+ of the 'md' struct when we are restarting, but initialize it as
+ is done in normal NO\_RECURSE code otherwise. Note that the
+ COST\_CHK macro reuses the Xwhere field of the frame struct, it
+ is not needed when trapping.
+
+ * The LOOP\_COUNT\_BREAK and the LOOP\_COUNT\_RETURN code can now
+ be added. Make sure to check both how a new stack frame should be
+ properly allocated by mimicking the code in RMATCH, and how (if)
+ it should be freed by mimicking RRETURN. Also check which
+ variables need to be saved. They are properly pointed out in
+ 8.33 with the comment 'These variables do not need to be
+ preserved over recursion' and appear in the beginning of the
+ function. Find variables of similar type in the frame structure
+ and reuse them. In 8.33 there are eight such variables. They are
+ placed at the end of the function 'match'. If You are reading
+ the diff, you need to scroll past all the COST\_CHK calls,
+ i.e. past the whole regexp machine loop.
+
+ * Now take the time to add things like debug macros to the top of
+ the file and one single COST\_CHK (preferably the one right
+ after for(;;) in 'match'), and see if you can compile. You will
+ probably need to add some fields in the structures in pcre.h,
+ see from a larger diff what you need there and iterate until you
+ can compile.
+
+ * So, what's left is to add all the COST and COST\_CHK macros,
+ plus marking all harmless loops as OK. There are a few rules
+ here:
+
+ * Mark *every* loop with the comment 'LOOP\_COUNT: xxxx',
+ where xxxx is either 'Ok', 'COST' or 'CHK'. There are 175
+ 'LOOP\_COUNT:' comments in 8.33.
+
+ * Loops marked 'Ok' need no macro, either because they are so
+ short (like over an UTF character) or because they contain
+ an RMATCH macro, in which case they will be accounted for
+ anyway.
+
+ * Loops marked 'COST' will have an associated 'COST(N)' macro,
+ either before, if we know the amount of iterations, or
+ within. Reductions are counted, but we will not
+ interrupt. This is typically in what is expected to be
+ medium long loops or at places where interruption is hard
+ (like where we have local variables that are alive. The
+ selection between 'COST' and 'COST\_CHK' is hard. 'COST' is
+ much cheaper and usually enough, but when in doubt about the
+ loop length, try to use 'COST\_CHK', while making very sure
+ there are no live block-local variables that need to be
+ saved over the trap. There are 49 'COST' macros in 8.33.
+
+ * Loops marked 'CHK' shall contain a 'COST\_CHK(N)'
+ macro. This macro both counts reductions and may result in
+ an interrupt and a return to Erlang space. It is expensive
+ and it is vital to ensure that there are no unexpected local
+ variables that live past the macro. Most variables are in
+ the pseudo stack frame, but some regexp instructions declare
+ temporaries inside blocks. Make sure they are not expected
+ to be alive after a COST\_CHK if they are not in the
+ 'heapframe' structure. If they are, you need to
+ conditionally move them to the 'heapframe' #if
+ defined(ERLANG\_INTEGRATION). in 8.33 the variables 'lgb'
+ and 'rgb' are preserved in this way. There are 54
+ 'COST\_CHK's in 8.33.
+
+ * I've marked a few block-local variables with warnings, but
+ look thoroughly through the main loop to detect any new
+ ones.
+
+ * Be careful when it comes to freeing the context from Erlang
+ (the function erts\_pcre\_free\_restart\_data), Whatever is
+ done there has to work *both* when the context is freed in
+ the middle of an operation (because of trapping) and when
+ some things have been freed by a successful
+ return. Specifically, make sure to set md->offset\_vector to
+ NULL whenever it's freed (in the rest of the code) and
+ construct release\_match\_heapframes so that it can be
+ called multiple times for the same heapframe (set the next
+ pointer in the "static" frame, i.e. the one allocated in the
+ md to NULL after freeing).
+
+ * To add the costs to the main loop takes less than one work day,
+ keep calm and continue...
+
+OK, now you are done with the pcre\_exec (or at least, you think
+so). The rest is simpler. You have probably already handled 'pcre.h'
+and 'pcre\_internal.h' to add fields to the structures etc. Looking at
+a diff from an earlier version, you will see what's left. In upgrading
+to 8.33, the following things was left to do after pcre\_exec was
+fixed, remember you could generate a diff with:
+
+ ~/tmp/pcre/epcre-8.33> cd ../epcre-7.6/
+ ~/tmp/pcre/epcre-7.6> (for x in *.[ch]; do if [ -f ../pcre-7.6/$x ]; then diff -c ../pcre-7.6/$x $x; fi; done) > ../epcre-7.6.diff
+
+Open the diff in your favorite editor and remove whatever changes you
+have already made, like everything that has to do with pcre\_exec.c
+and probably a large part of pcre.h/pcre\_internal.h.
+
+The expected result is a diff that either contains only the
+'%ExternalCopyright%' comments or contains them and the addition of
+the erts\_ prefix, depending on if you reverted the prefix change
+(using 'git revert') before starting to work. With a little luck, the
+patch of the remaining stuff should be possible to apply
+automatically. If anything fails, just add it manually.
+
+### Fixing the erts\_prefix
+
+The erts\_ prefix is mostly implemented by adding '#if
+defined(ERLANG\_INTEGRATION)' to a lot of function headers, inside the
+COMPILE\_UTF8 part. If you then also change the PRIV and PUBL macros
+in pcre\_internal.h. Typical diffs look like:
+
+ #if defined COMPILE_PCRE8
+ + #if defined(ERLANG_INTEGRATION)
+ + #ifndef PUBL
+ + #define PUBL(name) erts_pcre_##name
+ + #endif
+ + #ifndef PRIV
+ + #define PRIV(name) _erts_pcre_##name
+ + #endif
+ + #else
+ #ifndef PUBL
+ #define PUBL(name) pcre_##name
+ #endif
+ #ifndef PRIV
+ #define PRIV(name) _pcre_##name
+ #endif
+ + #endif
+
+and
+
+ #if defined COMPILE_PCRE8
+ + #if defined(ERLANG_INTEGRATION)
+ + PCRE_EXP_DECL int erts_pcre_pattern_to_host_byte_order(pcre *argument_re,
+ + erts_pcre_extra *extra_data, const unsigned char *tables)
+ + #else
+ PCRE_EXP_DECL int pcre_pattern_to_host_byte_order(pcre *argument_re,
+ pcre_extra *extra_data, const unsigned char *tables)
+ + #endif
+
+Note that some data types, like pcre\_extra are accessed with the PUBL
+macro, so they need to explicitly get the prefix added. pcre.h is a
+pig, as it declares prototypes for all functions regardless of
+compilation ode, so there is quite a lot of '#if
+defined(ERLANG\_INTEGRATION)' to add there.
+
+Anyway, now try to patch, using a diff where you have removed the
+changes you made manually (probably to pcre\_exec.c) but make sure to
+save your work (temporary git repository?) before, so you can revert
+any disasters...
+
+ ~/tmp/pcre/epcre-7.6> cd ../epcre-8.33/
+ ~/tmp/pcre/epcre-8.33> patch -p0 < ../epcre-7.6_clean2.diff
+
+Some hunks may certainly still fail, read through the .rej file and fix it.
+
+### ExternalCopyright
+
+Now you should check that the 'ExternalCopyright' comment is present
+in all source files:
+
+ ~/tmp/pcre/epcre-8.33> for x in *.[ch]; do if grep ExternalCopyright $x > /dev/null; then true; else echo $x; fi; done
+
+In this upgrade (from 7.6 to 8.33) we certainly had some new and
+renamed files:
+
+ dftables.c
+ pcre_byte_order.c
+ pcre_chartables.c
+ pcre_jit_compile.c
+ pcre_latin_1_table.c
+ pcre_string_utils.c
+ pcre_ucd.c
+
+Go through them manually and add the 'ExternalCopyright' comment.
+
+## Integrate with Erlang
+
+Now you are done with most of the tedious work. It's time to move this
+into your branch of the Erlang source tree, remove old files and add
+new ones, plus add the tar file with the original pcre dist. Remember
+to fix your hacked version of pcre.mk and then try to build
+Erlang. You might need to update 'erl\_bif\_re.c' to reflect any
+changes in the PCRE library. When it builds, run the test suites.
+
+Make sure to rename any files that has new names and remove any files
+that are no longer present before copying in the new versions from
+your temporary directory. In our example we remove 'pcre\_info.c',
+'pcre\_make\_latin1\_default.c', 'pcre\_try\_flipped.c',
+'ucpinternal.h' and 'ucptable.h'. We rename 'make\_latin1\_table.c' to
+'dftables.c' and 'pcre\_ucp\_searchfuncs.c' to 'pcre\_ucd.c'.
+
+After copying in the sources, we can try to build. Do not forget to
+fix whatever you did in pcre.mk to make it build locally.
+
+## Update test suites
+
+The next step is to integrate the updated PCRE tests into our test suites.
+
+Copy testoutput[1-9] from the testdata directory of your new version
+of pcre, to the re\_SUITE\_data in stdlib's test suites. Run the
+test suites and remove any bugs. Usually the bugs come from the fact
+that the PCRE test suites get better and from our implementation of
+global matching, which may have bugs outside of the PCRE library. The
+test suite 'pcre' is the one that runs these tests. Also copy
+testoutput11-8 to testoutput10, the testoutput10 file in pcre is
+nowadays for the DFA, which we do not use.
+
+The next step is to regenerate re\_testoutput1\_replacement\_test. How
+to do that is in a comment in the beginning of the file. The key
+module is run\_pcre\_tests.erl, which both driver the pcre test and
+generate re\_testoutput1\_replacement\_test.erl. Watch during the
+generation that you do not get to many of the "Fishy character"
+messages, if they are more than, say 20, you will probably need to
+address the UTF8 issues in the Perl execution. As it is now, we skip
+non latin1 characters in this test. You will need to run iconv on the
+generated module to make it UTF-8 before running tests.
+
+The exact same procedure goes for the re\_testoutput1\_split\_test.erl.
+
+Also add copyright headers to the files after converting them to UTF-8.
+
+After ironing out the rest of the bugs, you should be done with the
+code.
+
+## Update documentation
+
+Now it's time for the documentation, which is fairly
+straightforward. Diff the pcrepattern man pages from the old and new
+PCRE distros and update the re.xml file accordingly. It may help to
+have the generated HTML file from the new version to cut and paste
+from, but as you will notice, it's quite a few changes from HTML to
+XML. All lists are reformatted, the &lt;pre&gt; tags are made into
+either &lt;code&gt; or &lt;quite&gt; etc. Also the &lt;P&gt; tags are
+converted to lowercase and all mentioned options and function calls
+are converted to their Erlang counterpart. Really awesome work that
+requires thorough reading of all new text. For the upgrade from 7.6
+to 8.33, the update of the pcrepattern part of our manual page took
+about eight hours.
+
+## Add new relevant options to re
+
+Then, when all this is done, you should add any new relevant options
+from the PCRE library to both the code (erl\_bif\_re.c), the specs and
+the Erlang function 'copt/1' (re.erl) and the manual page
+(re.xml). Make sure the options are really relevant to add to the
+Erlang API, check if they are compile or run-time options (or both) and
+add them to the 'parse\_options' function of erl\_bif\_re.c. Adding an
+option that is just passed through to PCRE is pretty simple, at least
+"code wise".
+
+Now you are done. Run all test suites on all machines and you will be happy.
+
+## Final notes
+
+To avoid the work of a major upgrade, it is probably worth it to keep
+in pace with the changes to PCRE. The upgrade from 7.6 to 8.33,
+including tracking down bugs etc, took me a total of two weeks. If
+smaller diffs from the PCRE development were integrated in a more
+incremental fashion, it will be much easier each time and you will
+have the PCRE library up to date. PCRE should probably be updated for
+each major release, instead of every five years... \ No newline at end of file