Matching.

Next: Extracting the matches. Up: regmatch: Regular Expression Matching Previous: regmatch: Regular Expression Matching Contents Index

Matching.

The predicates re_match/5 and re_bulkmatch/5 perform regular expression matching. The predicate re_substitute/4 replaces substrings in a list with strings from another list and returns the resulting new string.

The re_match/5 predicate has the following calling sequence:

 re_match(+Regexp, +InputStr, +Offset, ?IgnoreCase, -MatchList)

Regexp is a regular expression, e.g., ``abc([^;,]*); (dd|ee)*;''. It can be a Prolog atom or string (i.e., a list of characters). The above expression matches any substring that has ``abc'' followed by a sequence of characters none of which is a ``;'' or a ``,'', followed by a ``; '', followed by a sequence that consists of zero or more of ``dd'' or ``ee'' segments, followed by a ``;''. An example of a string where such a match can be found is ``123abc&*^; ddeedd;poi''.

InputStr is the string to be matched against. It can be a Prolog atom or a string (list of characters). Offset is an integer offset into the string. The matching process starts at this offset. IgnoreCase indicates whether the case of the letters is to be ignored. If this argument is an uninstantiated variable, then the case is not ignored. If this argument is bound to a non-variable, then the case is ignored.

The last argument, MatchList, is used to return the results. It must unify with a list of the form:

    [match(beg_off0,end_off0), match(beg_off1,end_off1), ...]

The term match(beg_off0,end_off0) represents the substring that matches the entire regular expression, and the terms match(beg_off1,end_off1), ..., represent the matches corresponding to the parenthesized subexpressions of the regular expression. The terms beg_off and end_off above are integers that specify beginning and ending offsets of the various matches. Thus, beg_off0 is the offset into InputStr that points to the start of the maximal substring that matches the entire regular expression; end_off0 points to the end of such a substring. In our case, the maximal matching substring is ``abc&*^; ddeedd;'' and the first term in the list returned by

| ?- re_match('abc([^;,]*); (dd|ee)*;', '123abc&*^; ddeedd;poi', 0, _,L).

is match(3,18).

The most powerful feature of POSIX pattern matching is the ability to remember and return substrings matched by parenthesized subexpressions. When the above predicate succeeds, the terms 2,3, etc., in the above list represent the offsets for the matches corresponding to the parenthesized expressions 1,2,etc. For instance, our earlier regular expression ``abc([^;,]*); (dd|ee)*;'' has two parenthetical subexpressions, which match ``&*^'' and ``dd, respectively. So, the complete output from the above call is:

L = [match(3,18),match(6,9),match(15,17)]

The maximal number of parenthetical expressions supported by the Regmatch package is 30. Partial matches to parenthetical expressions 31 and over are discarded.

The match-terms corresponding to parenthetical expressions can sometimes report ``no-use.'' This is possible when the regular expression specifies that zero or more occurrences of the parenthesized subexpression must be matched, and the match was made using zero subexpressions. In this case, the corresponding match term is match(-1,-1). For instance,

| ?- re_match('ab(de)*', 'abcd',0,_,L).
L = [match(0,2),match(-1,-1)]
yes

Here the match that was found is the substring ``ab'' and the parenthesized subexpression ``de'' was not used. This fact is reported using the special match term match(-1,-1).

Here is one more example of the power of POSIX regular expression matching:

| ?- re_match("a(b*|e*)cd\\1",'abbbcdbbbbbo', 0, _, M).

Here the result is:

M = [match(0,9),match(1,4)]

The interesting features here are the positional parameter \\1 and the alternating parenthetical expression a(b*|e*). The alternating parenthetical expression here can match any sequence of b's or any sequence of e's. Note that if the string to be matched is not known when we write the program, we will not know a priori which sequence will be matched: a sequence of b's or a sequence of e's. Moreover, we do not even know the length of that sequence.

Now, suppose, we want to make sure that the matching substrings look like this:

abbbcdbbb
aeeeecdeeee
abbbbbbcdbbbbbb

How can we make sure that the suffix that follows ``cd'' is exactly the same string that is stuck between ``a'' and ``cd''? This is what \\1 precisely does: it represents the substring matched by the first parenthetical expression. Similarly, you can use \\2, etc., if the regular expression contains more than one parenthetical expression.

The following example illustrates the use of the offset argument:

| ?- re_match("a(b*|e*)cd\\1",'abbbcdbbbbboabbbcdbbbbbo',2,_,M).  

M = [match(12,21),match(13,16)]

Here, the string to be matched is double the string from the previous example. However, because we said that matching should start at offset 2, the first half of the string is not matched.

The re_match/5 predicate fails if Regexp does not match InputStr or if the term specified in MatchList does not unify with the result produced by the match. Otherwise, it succeeds.

We should also note that parenthetical expressions can be represented using the \(...\) notation. What if you want to match a ``('' then? You must escape it with a ``\\'' then:

| ?- re_match("a(b*)cd\\(",'abbbcd(bbo', 0, _, M).

M = [match(0,7),match(1,4)]

Now, what about matching the backslash itself? Try harder: you need four backslashes:

| ?- re_match("a(b*)cd\\\\",'abbbcd\bbo', 0, _, M).

M = [match(0,7),match(1,4)]

The predicate re_bulkmatch/5 has the same calling sequence as re_match/5, and the meaning of the arguments is the same, except the last (output) argument. The difference is that re_bulkmatch/5 ignores parenthesized subexpressions in the regular expression and instead of returning the matches corresponding to these parenthesized subexpressions it returns the list of all matches for the top-level regular expression. For instance,

| ?- re_bulkmatch('[^a-zA-Z0-9]+', '123&*-456 )7890% 123', 0, 1, X).

X = [match(3,6),match(9,11),match(15,17)]

Next: Extracting the matches. Up: regmatch: Regular Expression Matching Previous: regmatch: Regular Expression Matching Contents Index

Luis Fernando P. de Castro 2003-06-27