partial results?

Henry Story avatarHenry Story created an issue

I have now developed a spec compliant NTriples parser [1]. NTriples is a very easy format to parse as each statement is on one line, and is just a simple triple. So for parsing something this simple nomo is perhaps overkill. But it does have the advantage of allowing me to quickly explore nomo.

I now have a test suite that shows that I can feed the parser chunks of data [2], proving that it is asynchronous. But even though the input is asynchronous, I am not sure the result is. The call to parser.result seems to be final as it sends an EOF. So currently I am accumulating a lot of results in a buffer. It would be a lot more efficient if I could pass the results as they are parsed along to another process to store them in a database or reserialise them in another format to send to another agent perhaps.

Is this possible?

Also along the same lines, I am not quite sure how much backtracking state nomo keeps. Is it possible to reduce the backtracking? (if this is a problem) I think NTriples needs no look ahaead, and the next parser I am working on only needs a few char look ahead.

[1] https://github.com/betehess/pimp-my-rdf/blob/db2c59ff9768e14af19a887b321101b878c33f22/n-triples-parser/src/main/scala/Parser.scala [2] https://github.com/betehess/pimp-my-rdf/blob/db2c59ff9768e14af19a887b321101b878c33f22/n-triples-parser-test-suite/src/main/scala/ParserTest.scala

Comments (6)

  1. Henry Story

    mhh. perhaps that is what the User is for? (I was wondering about what could be the point of it). I can send the user a triple when finished?

  2. pchiusano

    Regarding lookahead, use commit in conjunction with attempt/block to control backtracking. See the scaladocs for those.

    If you want to use a parser as a stream transformer so you can stream the output as well, you don't want the parser itself to encode the repetition, since it will construct the result list strictly. Instead you would just have your parser be the parser for a single element, and write a helper function to take a stream of chunks and produce a stream of results. This should be just a little helper function, see if you can write it. :) After each chunk you'll need to feed the current parser EOF to see if it has produced a value. If it has, you emit that into the output stream and feed it's remainder to a fresh parser. Otherwise, you feed it the next chunk.

    I think the code could be cleaned up a bit to handle this better - I no longer like an explicit EOF token. Instead I just prefer to have a finish method on each parser, which must return a result. This avoids issues with a parser deciding it wants more input even after it's been fed EOF, which should not really be allowed.

  3. Henry Story

    In NTriples it is easy to isolate a single element: it is one line of input. So one could of course just write a very efficient parser by first accumulating characters until a new line, then parsing it, or by follow your suggestion above.

    With Turtle [1], which this is all building up to, things are more complicated, because the whole document could be just one sentence consisting of thousand of triples. Eg:

     1:  @prefix : <> .
     2:  @prefix foaf: <http://xmlns.com/foaf/0.1/> .
     3:  @prefix doap: <http://usefulinc.com/ns/doap#> .
     4:
     5:  <> a foaf:PersonalProfileDocument;
     6:       foaf:primaryTopic :me .
     7: 
     8: :me a foaf:Person;
     9:    foaf:name "Henry Story";
    10:    foaf:knows [ a foaf:Person;
    11:                 foaf:name "P Chiusano";
    12:                 doap:project <https://bitbucket.org/pchiusano/nomo>  ];
    13:    foaf:knows [ = <http://www.w3.org/People/Berners-Lee/card#i> ;
    14:                  foaf:name "Tim Berners Lee";
    15:                  foaf:knows [ foaf:name "P Chiusano" ]
    16:               ] .
    

    So this says that I know two people one of whome is Tim Berners Lee who knows someone named P Chiusano. Let's say I receive this on a slow connection, and I get a chunk up to line 10. I would like it to return the 4 triples it will end up parsing, but then be able to continue with the given context which in this case consists of the 3 names space prefixes and the current subject which is :me .

    So I can't really start all over, because I have the namespaces and a list of antecedent subjects I need to remember. Some triples will be incomplete, but others will be finished. So one thing I was thinking was to send the finished triples to a processing agent, and to have the parser state forget those, so that is is at every point only as heavy as it needs to be.

    I was thinking that whenever I create a triple I could send it to an agent with something like

    .

    ......map{ case s++r++o=>  agentU.send(Triple(s,r,o)); return empty } 
    

    or to add that triple to a provisional accumulator.

    I see there is an accumulator trait

    trait Accumulator[A,B,U]
    

    and that it takes a User parameter. But I don't see what that User is for......

    [1] http://www.w3.org/TR/turtle/

  4. Henry Story
    • changed status to open

    I found a way to make use of the User type: I can treat it as an agent that the parser can send results to. This looks like it can fit within the Nomo's architecture and give us partial results.

    First I added some new methods manyIgnore, many1Ignore, delimitIgnore, delimit1Ignore that don't return lists but just ignore the results, in order to reduce the memory consumption of the parser. See: https://bitbucket.org/bblfish/nomo/changeset/c19b41ef445d It is not clear that these are all needed. I think I will have a better idea of which other methods may be important when finishing the more complex Turtle parser.

    I can the create a ListenerAgent class which takes on the role of the User U, usually Unit, and to which the parser can send results.

    trait ListenerAgent[T] {
      def send(a: T)
    }
    

    And it's test implementation

    case class Listener(val queue: mutable.Queue[Any] = new mutable.Queue[Any]()) extends ListenerAgent[Any] {
        def send(a: Any) = queue.enqueue(a)
      }
    

    My NTriples entry point parser can then be written like this.

    val ntriples = (line.mapResult{
      r => 
        r.get match {
          case Some(t) => r.user.send(t);
          case None => () 
        }  
        Success(r)
    } ).delimitIgnore(eoln)
    

    This sends the triples to the user agent (which could be an akka agent) and ignores the result locally. The triples can then be retrieved with

      val res = P.ntriples(doc)
      val resSet = res.user.queue
    

    (in my code things are a bit more complex, and need to be simplified with some refactoring)

    The diffs to the parser are here:

    https://github.com/betehess/pimp-my-rdf/commit/a2373567e472a4078a72a553412dd68627753f57#diff-4

    For something as simple as NTriples this may seem overkill, but I think it should make a lot more sense with the Turtle parser (and many others for that matter).

  5. pchiusano

    That is an interesting use of user state... not sure what I think of it. :) Usually user state is used for things like keeping track of a symbol table or something like that during parsing. I would much prefer to implement the streaming version of many just as a pure function, like I mentioned above. Here's a possible signature:

    def manyStream: Stream[F] => Stream[Result[X,E,U,A]]

    I'll see about implementing this or something like it.

  6. Henry Story

    So I have developed it now in the Turtle parser, so that triples get added as they are created. The user is now the Listener class, which is embarrassingly stateful.

    case class Listener[RDF <: RDFDataType](val ops: RDFOperations[RDF]) {
      
      import ops._
      
      private val prefixs = new mutable.HashMap[String, RDF#IRI]
      val queue: mutable.Queue[RDF#Triple] = new mutable.Queue[RDF#Triple]()
    
      def send(a: RDF#Triple) = queue.enqueue(a)
    
      def addPrefix(name: String, value: RDF#IRI) {  prefixs.put(name, value) }
    
      def setObject(obj: RDF#Node) {    send(Triple(subject, verb, obj)) }
    
      def resolve(pname: PName): Option[IRI] = {
        prefixs.get(pname.prefix).map{ case IRI(pre)=> IRI(pre + pname.name)}
      }
    
      def prefixes = prefixs.toMap
    
      var verb: RDF#IRI = _
      
      def setVerb(rel: RDF#IRI) {
        verb = rel
      }
    
      var subject: RDF#Node = _
      
      def setSubject(subj: RDF#Node) {
          subject = subj
      }
    }
    

    And it is used by the Turtle Parser like this:

    lazy val obj = (IRIref | literal).mapResult{ r =>  r.status.map{ node => { r.user.setObject(node); r } } } 
    

    The problem with this at present I realise is that it won't work well with backtracking. So the question: how much backtracking does nomo do? becomes an important one.

    I noticed this with the number parsers. A number can be something like 12 or 12.5 or -12.5e34 . If one does not write the parser carefully, it could read 12, then think the . is the end of the sentence, only to discover that the next token cannot be interpreted as a continuation. Interestingly enough nomo does not do that much backtracking. In fact I found one has to put the parser for 12.5e6 first, or else it will just return 12 and then fail later not being able to parse what it thinks is the next sentence.

  7. Log in to comment
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.