complex EBNF rules

Issue #6 new
Henry Story
created an issue

There are a couple of times when Turtle has some complex rules that are not easy to translate into nomo. I found a solution that seems a bit like a hack, but perhaps there is a better way to do things.

[[ | The current Turtle BNF ]] has the following rules} {{{ PNAME_NS ::= (PN_PREFIX)? ":"



PN_CHARS_BASE ::= [A-Z] | [a-z] | [ # 00C0 - # 00D6 ]... | UCHAR PN_CHARS_U ::= PN_CHARS_BASE | "_" PN_CHARS ::= PN_CHARS_U | "-" | [0-9] | ... HEX ::= [0-9] | [A-F] | [a-f] }}}

I ended up having a lot of trouble writing out the PN_PREFIX rule in a simple way that follows the grammer. I think that is because in many cases the middle part of the rule {{{ ( PN_CHARS | "." )* }}} ends up eating all the characters, including the last one.

In the end I found something that works, but it seems a bit like a hack: [[ | line 69 of Turtle.scala ]]



lazy val PN_PREFIX : P.Parser[String] = P.takeWhile(_ != ':').mapResult { case Result(Success(pfx), pos, us)=> { val prefx = pfx.toSeq.mkString if (prefx.size == 0) Success("") else if (prefx.last == '.') Failure(P.err.single('.',pos)) else if (!pn_chars_base(prefx.head) && UCHAR(prefx)(us).isFailure) { Failure(P.err.single(prefx.head,pos)) } else { val result = (PN_decode<<P.eof)(prefx)(us) result.status.mapL(e=>P.err.single(':',pos)) } } case other =>>"never gets called, as the result has to be some error") } }}}

Ie, here I just take everything until the point I know it has to end, and then I verify that the token is well formed.

The same issue has now popped up with the other very similar rule



PN_LOCAL ::= ( PN_CHARS_U | [0-9] | PLX ) ( ( PN_CHARS | '.' | PLX )* ( PN_CHARS | PLX ) ) ? }}}

Again I tried a simple implementation of this rule [[ | starting Line 100 ]]



lazy val PN_LOCAL = (PNL_FIRST ++ ( ++ PNL_LAST ).optional) map { case first++Some(body++last)=> first + body + last case first++None => first } }}}

But this has the same problem I had previously. It reads just the first character, and then stops. Anyway, this feels like a well known problem, but it would be useful to have a pointer for guidance.

Comments (5)

  1. pchiusano repo owner

    I responded to this but my comment seems to have entered a black hole... weird.

    It sounds like you basically have p.many ++ p2, where p2 parses a subset of what p parses, so the p.many just consumes it greedily leaving nothing leftover for p2. I would just do something like => (ps.init, p.lastOption)).filter(x => foo(x._2), msg("expected last element to match p2")), or something to that effect.

    Independent of this, you can write a "reluctant" version of many - def manyReluctant[B](p2: Parser[B]): Parser[(A,B)] that tries to run p2 first. That won't quite address your use case here though, since that will switch to p2 too early.

  2. Henry Story reporter

    (/dev/null contains many secrets)

    yes. It is pretty tricky. The issue is that the last character is not allowed to be a '.' (because they need that later for an extension to Turtle named N3), but it can be an encoded dot as a string "\u002e". So that to take the first example, the following would be a bad token "foaf.:" but the following a good one "foaf\u002e:" . The problem is that when I parse \u002e of course I want the effort taken to parse that to also transform the \u002e into a '.'. But then I can't tell if the resulting dot I end up with was encoded or not.

    Now one could argue that this is pretty twisted and that they should be changing the Turtle spec (which is under development), and there may well be reason to argue that way. I think they are doing this because they want to work with some RDF/XML formats. But one could respond that this is not worth the effort...

    I think it so happens that just verifying that the last underlying character is or is not a dot works in this case, which is why my technique worked. In this case something like peek but which would tell you want the current underlying input was may have helped. Otherwise one would have to write a parser that kept track of the underlying state. My answer using takeWhile does that by finding some other simpler delimiter, and then parsing the result; your answer would be to build up a pair of (inputChars,transformed) pairs, or perhaps even (inputChar(s), =>transformer) so that one could then map the transformers on the input later.

  3. Henry Story reporter

    Oh, I just noticed that things are more complicated than I thought.

    One can have N3 such as the following. I am not sure yet if that is ok Turtle, as the spacing requirements seem to be underspecified.

    @prefix : <>.

    So the delimiters are a lot more complicated here, especially as we have tokens that can contain dots (it's just not the last character that can be a dot!

  4. Log in to comment