1. Toby Inkster
  2. p5-html-html5-microdata-parser

Source

p5-html-html5-microdata-parser / README

NAME
    HTML::HTML5::Microdata::Parser - fairly experimental parser for HTML
    'microdata'

VERSION
    0.02

SYNOPSIS
      use HTML::HTML5::Microdata::Parser;
  
      my $parser = HTML::HTML5::Microdata::Parser->new($html, $baseURI);
      $parser->consume;
      my $graph  = $parser->graph

DESCRIPTION
    This package aims to have a roughly compatible API to RDF::RDFa::Parser.

    Microdata is an experimental metadata format, not in wide use. Use this
    module at your own risk.

    $p = HTML::HTML5::Microdata::Parser->new($html, $baseuri, \%options,
    $storage)
            This method creates a new HTML::HTML5::Microdata::Parser object
            and returns it.

            The $xhtml variable may contain an XHTML/XML string, or a
            XML::LibXML::Document. If a string, the document is parsed using
            HTML::HTML5::Parser and HTML::HTML5::Sanity, which may throw an
            exception. HTML::HTML5::Microdata::Parser does not catch the
            exception.

            The base URI is used to resolve relative URIs found in the
            document.

            Options [default in brackets]:

              * alt_stylesheet  - Magic rel="alternate stylesheet". [1]
              * auto_config     - See section "Auto Config" [0]
              * mhe_lang        - Process <meta http-equiv=Content-Language>.
                                  [1]
              * prefix_empty    - URI prefix for itemprops of untyped items.
                                  [undef]
              * tdb_service     - thing-described-by.org when possible. [0] 
              * xhtml_base      - Process <base href> element. [1]
              * xhtml_lang      - Process @lang. [1]
              * xhtml_meta      - Process <meta>. [1]
              * xhtml_cite      - Process @cite. [1]
              * xhtml_rel       - Process @rel. [1]
              * xhtml_time      - Process <time> element more nicely. [0]
              * xhtml_title     - Process <title> element. [1]
              * xml_lang        - Process @xml:lang. [1]

            $storage is an RDF::Trine::Storage object. If undef, then a new
            temporary store is created.

    $p->xhtml
            Returns the HTML source of the document being parsed.

    $p->uri Returns the base URI of the document being parsed. This will
            usually be the same as the base URI provided to the constructor,
            but may differ if the document contains a <base> HTML element.

            Optionally it may be passed a parameter - an absolute or
            relative URI - in which case it returns the same URI which it
            was passed as a parameter, but as an absolute URI, resolved
            relative to the document's base URI.

            This seems like two unrelated functions, but if you consider the
            consequence of passing a relative URI consisting of a
            zero-length string, it in fact makes sense.

    $p->dom Returns the parsed XML::LibXML::Document.

    $p->set_callbacks(\%callbacks)
            Set callback functions for the parser to call on certain events.
            These are only necessary if you want to do something especially
            unusual.

              $p->set_callbacks({
                'pretriple_resource' => sub { ... } ,
                'pretriple_literal'  => sub { ... } ,
                'ontriple'           => undef ,
                });

            Either of the two pretriple callbacks can be set to the string
            'print' instead of a coderef. This enables built-in callbacks
            for printing Turtle to STDOUT.

            For details of the callback functions, see the section
            CALLBACKS. "set_callbacks" must be used *before* "consume".
            "set_callbacks" itself returns a reference to the parser object
            itself.

    $p->consume
            The document is parsed for Microdata. Nothing of interest is
            returned by this function, but the triples extracted from the
            document are passed to the callbacks as each one is found.

    $p->consume_microdata_item($element)
            You almost certainly do not want to use this method.

            It will consume a single Microdata item, assuming that $element
            is an element that does or should have the @itemscope attribute
            set. Returns the URI or blank node identifier for the item.

            This method is exposed mostly for the benefit of
            HTML::HTML5::Microdata::ToRDFa.

    $p->graph()
            This method will return an RDF::Trine::Model object with all
            statements of the full graph.

            It makes sense to call "consume" before calling "graph".
            Otherwise you'll just get an empty graph.

CALLBACKS
    Several callback functions are provided. These may be set using the
    "set_callbacks" function, which taskes a hashref of keys pointing to
    coderefs. The keys are named for the event to fire the callback on.

  pretriple_resource
    This is called when a triple has been found, but before preparing the
    triple for adding to the model. It is only called for triples with a
    non-literal object value.

    The parameters passed to the callback function are:

    *   A reference to the "HTML::HTML5::Microdata::Parser" object

    *   A reference to the "XML::LibXML::Element" being parsed

    *   Subject URI or bnode (string)

    *   Predicate URI (string)

    *   Object URI or bnode (string)

    The callback should return 1 to tell the parser to skip this triple (not
    add it to the graph); return 0 otherwise.

  pretriple_literal
    This is the equivalent of pretriple_resource, but is only called for
    triples with a literal object value.

    The parameters passed to the callback function are:

    *   A reference to the "HTML::HTML5::Microdata::Parser" object

    *   A reference to the "XML::LibXML::Element" being parsed

    *   Subject URI or bnode (string)

    *   Predicate URI (string)

    *   Object literal (string)

    *   Datatype URI (string or undef)

    *   Language (string or undef)

    Beware: sometimes both a datatype *and* a language will be passed. This
    goes beyond the normal RDF data model.)

    The callback should return 1 to tell the parser to skip this triple (not
    add it to the graph); return 0 otherwise.

  ontriple
    This is called once a triple is ready to be added to the graph. (After
    the pretriple callbacks.) The parameters passed to the callback function
    are:

    *   A reference to the "HTML::HTML5::Microdata::Parser" object

    *   A reference to the "XML::LibXML::Element" being parsed

    *   An RDF::Trine::Statement object.

    The callback should return 1 to tell the parser to skip this triple (not
    add it to the graph); return 0 otherwise. The callback may modify the
    RDF::Trine::Statement object.

AUTO CONFIG
    HTML::HTML5::Microdata::Parser has a lot of different options that can
    be switched on and off. Sometimes it might be useful to allow the page
    being parsed to control some of the options. If you switch on the
    'auto_config' option, pages can do this.

    A page can set options using a specially crafted <meta> tag:

      <meta name="http://search.cpan.org/dist/HTML-HTML5-Microdata-Parser/#auto_config"
         content="alt_stylesheet=0&amp;prefix_empty=http://example.net/" />

    Note that the "content" attribute is an
    application/x-www-form-urlencoded string (which must then be
    HTML-escaped of course). Semicolons may be used instead of ampersands,
    as these tend to look nicer:

      <meta name="http://search.cpan.org/dist/HTML-HTML5-Microdata-Parser/#auto_config"
         content="alt_stylesheet=0;prefix_empty=http://example.net/" />

    Any option allowed in the constructor may be given using auto config,
    except 'auto_config' itself.

SEE ALSO
    XML::LibXML, RDF::Trine, RDF::RDFa::Parser, HTML::HTML5::Parser,
    HTML::HTML5::Sanity.

    <http://www.perlrdf.org/>.

AUTHOR
    Toby Inkster <tobyink@cpan.org>.

COPYRIGHT AND LICENSE
    Copyright (C) 2009-2010 by Toby Inkster

    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself, either Perl version 5.8.1 or, at
    your option, any later version of Perl 5 you may have available.