re_match/5
and re_bulkmatch/5
perform regular
expression matching. The predicate re_substitute/4
replaces
substrings in a list with strings from another list and returns the
resulting new string.
The re_match/5
predicate has the following calling sequence:
re_match(+Regexp, +InputStr, +Offset, ?IgnoreCase, -MatchList)Regexp is a regular expression, e.g., ``
abc([^;,]*); (dd|ee)*;
''. It can be a Prolog atom or string (i.e., a list of
characters). The above expression matches any substring that has ``abc''
followed by a sequence of characters none of which is a ``;'' or a ``,'',
followed by a ``; '', followed by a sequence that consists of zero or more
of ``dd'' or ``ee'' segments, followed by a ``;''. An example of a string
where such a match can be found is ``123abc&*^; ddeedd;poi
''.
InputStr is the string to be matched against. It can be a Prolog atom or a string (list of characters). Offset is an integer offset into the string. The matching process starts at this offset. IgnoreCase indicates whether the case of the letters is to be ignored. If this argument is an uninstantiated variable, then the case is not ignored. If this argument is bound to a non-variable, then the case is ignored.
The last argument, MatchList, is used to return the results. It must unify with a list of the form:
[match(beg_off0,end_off0), match(beg_off1,end_off1), ...]The term
match(beg_off0,end_off0)
represents the substring
that matches the entire regular expression,
and the terms match(beg_off1,end_off1)
, ..., represent the matches
corresponding to the parenthesized subexpressions of the regular
expression.
The terms beg_off and end_off above are integers that specify
beginning and ending offsets of the various matches. Thus, beg_off0
is the offset into InputStr that points to the start of the maximal
substring that matches the entire regular expression; end_off0 points to the
end of such a substring. In our case, the maximal matching substring is
``abc&*^; ddeedd;
'' and the first term in the list returned by
| ?- re_match('abc([^;,]*); (dd|ee)*;', '123abc&*^; ddeedd;poi', 0, _,L).is match(3,18).
The most powerful feature of POSIX pattern matching is the ability to
remember and return substrings matched by parenthesized subexpressions.
When the above predicate succeeds, the terms 2,3, etc., in the above list
represent the offsets for the matches corresponding to the parenthesized
expressions 1,2,etc.
For instance, our earlier regular expression
``abc([^;,]*); (dd|ee)*;
'' has two parenthetical subexpressions, which
match ``&*^
'' and ``dd, respectively. So, the complete output
from the above call is:
L = [match(3,18),match(6,9),match(15,17)]
The maximal number of parenthetical expressions supported by the Regmatch package is 30. Partial matches to parenthetical expressions 31 and over are discarded.
The match-terms corresponding to parenthetical expressions can sometimes
report ``no-use.'' This is possible when the regular expression
specifies that zero or more occurrences of the parenthesized subexpression
must be matched, and the match was made using zero subexpressions. In this
case, the corresponding match term is match(-1,-1)
. For instance,
| ?- re_match('ab(de)*', 'abcd',0,_,L). L = [match(0,2),match(-1,-1)] yesHere the match that was found is the substring ``ab'' and the parenthesized subexpression ``de'' was not used. This fact is reported using the special match term
match(-1,-1)
.
Here is one more example of the power of POSIX regular expression matching:
| ?- re_match("a(b*|e*)cd\\1",'abbbcdbbbbbo', 0, _, M).Here the result is:
M = [match(0,9),match(1,4)]The interesting features here are the positional parameter \\1 and the alternating parenthetical expression a(b*|e*). The alternating parenthetical expression here can match any sequence of b's or any sequence of e's. Note that if the string to be matched is not known when we write the program, we will not know a priori which sequence will be matched: a sequence of b's or a sequence of e's. Moreover, we do not even know the length of that sequence.
Now, suppose, we want to make sure that the matching substrings look like this:
abbbcdbbb aeeeecdeeee abbbbbbcdbbbbbbHow can we make sure that the suffix that follows ``cd'' is exactly the same string that is stuck between ``a'' and ``cd''? This is what \\1 precisely does: it represents the substring matched by the first parenthetical expression. Similarly, you can use \\2, etc., if the regular expression contains more than one parenthetical expression.
The following example illustrates the use of the offset argument:
| ?- re_match("a(b*|e*)cd\\1",'abbbcdbbbbboabbbcdbbbbbo',2,_,M). M = [match(12,21),match(13,16)]Here, the string to be matched is double the string from the previous example. However, because we said that matching should start at offset 2, the first half of the string is not matched.
The re_match/5
predicate fails if Regexp does not match InputStr or if the term specified in MatchList does not unify with
the result produced by the match. Otherwise, it succeeds.
We should also note that parenthetical expressions can be represented
using the \(...\)
notation. What if you want to match a ``('' then?
You must escape it with a ``\\
'' then:
| ?- re_match("a(b*)cd\\(",'abbbcd(bbo', 0, _, M). M = [match(0,7),match(1,4)]Now, what about matching the backslash itself? Try harder: you need four backslashes:
| ?- re_match("a(b*)cd\\\\",'abbbcd\bbo', 0, _, M). M = [match(0,7),match(1,4)]
The predicate re_bulkmatch/5
has the same calling sequence as
re_match/5
, and the meaning of the arguments is the same, except the
last (output) argument. The difference is that re_bulkmatch/5
ignores parenthesized subexpressions in the regular expression and instead
of returning the matches corresponding to these parenthesized
subexpressions it returns the list of all matches for the top-level regular
expression. For instance,
| ?- re_bulkmatch('[^a-zA-Z0-9]+', '123&*-456 )7890% 123', 0, 1, X). X = [match(3,6),match(9,11),match(15,17)]