Next: Other Libraries Up: Library Utilities Previous: Random Number Generator Contents Index

Scanning in Prolog

Scanners, (sometimes called tokenizers) take an input string, usually in ASCII or similar format, and produce a scanned sequence of tokens. The requirements that various applications have for scanning differ in small but important ways - a character that is special to one application may be part of the token of another; or some applications may want lower case text converted to upper-case test. The stdscan.P library provides a simple scanner written in XSB that can be configured in several ways. While useful, this scanner is not intended to be as powerful as general-purpose scanners such as lex or flex.

scan(+List,-Tokens)

stdscan

Given as input a List of character codes, scan/2 scans this list producing a list of atoms constituting the lexical tokens. Its parameters are set via set_scan_pars/1.

Tokens produced are either a sequence of letters and/or numbers or consist of a single special character (e.g. ( or )). Whitespaces may occur between tokens.

scan(+List,+FieldSeparator,-Tokens)

stdscan

Given as input a List of character codes, along with a character code for a field separator, scan/3 scans this list producing a list of list of atoms constituting the lexical tokens in each field. scan/3 thus can be used to scan tabular information. Its parameters are set via set_scan_pars/1.

set_scan_pars(+List)

stdscan

set_scan_pars(+List) is used to configure the tokenizer to a particular need. List is a list of parameters including the following:

whitespace. The default action of the scanner is to return a list of tokens, with any whitespace removed. If whitespace is a parameter, then the scanner returns the token '' when it finds whitespace separating two tokens (unless the two tokens are letter sequences; since two letter sequences can be two tokens ONLY if they are separated by whitespace, such an indication of whitespace would be redundant.) Including the parameter no_whitespace undoes the effect of previously including whitespace.
upper_case The default action of the parser is to treat lowercase letter differently from uppercase letters. This parameter should be set if conversion to uppercase should be done when producing a token that does not consist entirely of letters (e.g. one with mixed letters and digits). Including the parameter no_case undoes the effect of previously including upper_case.
upper_case_in_lit The default action of the parser is to treat lowercase letter differently from uppercase letters. This parameter should be set if conversion to uppercase should be done when producing a token that consists entirely of letters. Including the parameter no_case_in_lit undoes the effect of previously including upper_case.
whitespace(Code) adds Code as a whitespace code. By default, all ASCII codes less than or equal to 32 are regarded as whitespace.
letter(Code) adds Code as a letter constituting a token. By default, ASCII codes for characters a-z and A-Z are regarded as letters.
special_char(Code) adds Code as a special character. By default, ASCII codes for the following characters are regarded as special characters:
```
| { } [ ] " $ % & ' ( ) * + , - . / : ; < = > ? @ \ ^ _ ~ `
```

get_scan_pars(-List)

stdscan
get_scan_pars/1 returns a list of the currently active parameters.

Next: Other Libraries Up: Library Utilities Previous: Random Number Generator Contents Index

Luis Fernando P. de Castro 2003-06-27