next up previous contents index
Next: 9.3 Predicate Reference Up: 9. sgml and xpath: Previous: 9.1.0.0.1 Installation and configuration.   Contents   Index

9.2 Overview of the SGML Parser

The sgml package accepts input in the form of files, URLs and Prolog atoms. To load the sgml parser, the user should type

 ?- [sgml].
at the prompt. If test.html is a file with the following contents
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">

<html>
<head>
<title>Demo</title>
</head>
<body>

<h1 align=center>This is a demo</title>

<p>Paragraphs in HTML need not be closed.

<p>This is called `omitted-tag' handling.
</body>
</html>
then the following call
?- load_html_structure(file('test.html'), Term, Warn).
will parse the document and bind Term to the following Prolog term:
[ element(html,
          [],
          [ element(head,
                    [],
                    [ element(title,
                              [],
                              [ 'Demo'
                              ])
                    ]),
            element(body,
                    [],
                    [ '\n',
                      element(h1,
                              [ align = center
                              ],
                              [ 'This is a demo'
                              ]),
                      '\n\n',
                      element(p,
                              [],
                              [ 'Paragraphs in HTML need not be closed.\n'
                              ]),
                      element(p,
                              [],
                              [ 'This is called `omitted-tag\' handling.'
                              ])
                    ])
          ])
].

The XML document is converted into a list of Prolog terms of the form element(Name,Attributes,Content). Each term corresponds to an XML element. Name represents the name of the element. Attributes is a list of attribute-value pairs of the element. Content is a list of child-elements and CDATA. For instance,

    <aaa>fooo<bbb>foo1</bbb></aaa>
will be parsed as
    element(aaa,[],[fooo, element(bbb,[],[foo1])])

Entities (e.g. &lt;) are returned as part of CDATA, unless they cannot be represented. See load_sgml_structure/3 for details.



Terrance Swift 2007-10-06