next up previous contents index
Next: 9.3.2 Handling of White Up: 9.3 Predicate Reference Previous: 9.3 Predicate Reference   Contents   Index

9.3.1 Loading Structured Documents

SGML, HTML, and XML documents are parsed by the predicate load_structure/4, which has many options. For convenience, a number of commonly used shorthands are provided to parse SGML, XML, HTML, and XHTML documents respectively.

load_sgml_structure(+Source, -Content, -Warn)
load_xml_structure(+Source, -Content, -Warn)
load_html_structure(+Source, -Content, -Warn)
load_xhtml_structure(+Source, -Content, -Warn)
The parameters of these predicates have the same meaning as those in load_structure/4, and are described below.

The above predicates (in fact, just load_xml_structure/3 and load_html_structure/3) are the most commonly used predicates of the sgml package. The other predicates described in this section are needed only for advanced uses of the package.

load_structure(+Source, -Content, +Options, -Warn)

Source can have one of the following forms: url(url), file(file name), string('document as a Prolog atom'). The parsed document is returned in Content. Warn is bound to a (possibly empty) list of warnings generated during the parsing process. Options is a list of parameters that control parsing, which are described later.

The list Content can have the following members:

A Prolog atom

Atoms are used to represent character strings, i.e., CDATA.

element(Name, Attributes, Content )

Name is the name of the element tag. Since SGML is case-insensitive, all element names are returned as lowercase atoms.

Attributes is a list of pairs the form Name= Value, where Name is the name of an attribute and Value is its value. Values of type CDATA are represented as atoms. The values of multi-valued attributes (NAMES, etc.) are represented as a lists of atoms. Handling of the attributes of types NUMBER and NUMBERS depends on the setting of the number(+NumberMode) option of set_sgml_parser/2 or load_structure/3 (see later). By default the values of such attributes are represented as atoms, but the number(...) option can also specify that these values must be converted to Prolog integers.

Content is a list that represents the content for the element.

entity(Code)

If a character entity (e.g., Α) is encountered that cannot be represented in the Prolog character set, this term is returned. It represents the code of the encountered character (e.g., entity(913)).

entity(Name)

This is a special case of entity(Code), intended to handle special symbols by their name rather than character code. If an entity refers to a character entity holding a single character, but this character cannot be represented in the Prolog character set, this term is returned. For example, if the contents of an element is Α < Β then it will be represented as follows:
    [ entity('Alpha'), ' < ', entity('Beta') ]
Note that entity names are case sensitive in both SGML and XML.

sdata(Text)

If an entity with declared content-type SDATA is encountered, this term is used. The data of the entity instantiates Text.

ndata(Text)

If an entity with declared content-type NDATA is encountered, this term is used. The data instantiates Text.

pi(Text)

If a processing instruction is encountered (<?...?>), Text holds the text of the processing instruction. Please note that the <?xml ...?> instruction is ignored and is not treated as a processing instruction.

The Options parameter is a list that controls parsing. Members of that list can be of the following form:

dtd(?DTD)

Reference to a DTD object. If specified, the <!DOCTYPE ...> declaration supplied with the document is ignored and the document is parsed and validated against the provided DTD. If the DTD argument is a variable, then a the variable DTD gets bound to the DTD object created out of the DTD supplied with the document.

dialect(+Dialect)

Specify the parsing dialect. The supported dialects are sgml (default), xml and xmlns.

space(+SpaceMode)

Sets the space handling mode for the initial environment. This mode is inherited by the other environments, which can override the inherited value using the XML reserved attribute xml:space. See Section 9.3.2 for details.

number(+NumberMode)

Determines how attributes of type NUMBER and NUMBERS are handled. If token is specified (the default) they are passed as an atom. If integer is specified the parser attempts to convert the value to an integer. If conversion is successful, the attribute is represented as a Prolog integer. Otherwise the value is represented as an atom. Note that SGML defines a numeric attribute to be a sequence of digits. The - (minus) sign is not allowed and 1 is different from 01. For this reason the default is to handle numeric attributes as tokens. If conversion to integer is enabled, negative values are silently accepted and the minus sign is ignored.

defaults(+Bool)

Determines how default and fixed attributes from the DTD are used. By default, defaults are included in the output if they do not appear in the source. If false, only the attributes occurring in the source are emitted.

file(+Name)

Sets the name of the input file for error reporting. This is useful if the input is a stream that is not coming from a file. In this case, errors and warnings will not have the file name in them, and this option allows one to force inclusion of a file name in such messages.

line(+Line)

Sets the starting line-number for reporting errors. For instance, if line(10) is specified and an error is found at line X then the error message will say that the error occurred at line X+10. This option is used when the input stream does not start with the first line of a file.

max_errors(+Max)

Sets the maximum number of errors. The default is 50. If this number is reached, the following exception is raised:
error(limit_exceeded(max_errors, Max), _)


next up previous contents index
Next: 9.3.2 Handling of White Up: 9.3 Predicate Reference Previous: 9.3 Predicate Reference   Contents   Index
Terrance Swift 2007-10-06