Issue #9 new

cpu intensive parsing problem

Henry Story
created an issue

Hi, so over at the [[ https://github.com/betehess/pimp-my-rdf | Pimp-my-RDF ]] project the [[ https://github.com/betehess/pimp-my-rdf/blob/760184eb26dd76de651ebf466cee95f9a84f87d5/n3/src/main/scala/Turtle.scala | Turtle ]] and [[ https://github.com/betehess/pimp-my-rdf/blob/760184eb26dd76de651ebf466cee95f9a84f87d5/n3/src/main/scala/NTriples.scala | NTriples ]] parsers are passing all the W3C tests (and in fact we are finding bugs in their tests and specs!) .

Running the W3C tests consists of two things (done by running an implementation of [[ https://github.com/betehess/pimp-my-rdf/blob/10597465f57ff40e268f9f6cc59008bec1fa79ab/n3-test-suite/src/main/scala/TurtleParserTest.scala | TurtleParserTest.scala ]] )

reading the input Turtle files with the Turtle parser

reading the desired resulting NTriples with some other trusted parser

comparing the graphs produced by 1. and 2. for isomorphism.

When I use the [[ https://github.com/betehess/pimp-my-rdf/blob/248c8a13567e589308d1b7999570a14d6b530b20/n3-test-suite/src/test/scala/TurtleParserTests.scala | Sesame Parsers or the Jena Parsers for reading the NTriples suite ]], everything goes well and quickly (10 seconds for Sesame). In {{{sbt}}} this is done with:

{{{

!scala

test-only org.w3.rdf.n3.JenaTurtleParserStringTest test-only org.w3.rdf.n3.SesameTurtleParserStringTest }}}

But when using either the [[ https://github.com/betehess/pimp-my-rdf/blob/d64ae11514f4bd8402c0857cb29c203ec821bd67/simple-rdf/src/test/scala/SimpleModuleTest.scala#L61 | nomo based NTriples parser or the Turtle parser to read the simpler to parse Ntriples output files ]], then it takes a few minutes (100 seconds) to do the same job and it pegs one of my CPUs to over 100%.

{{{

!scala

test-only org.w3.rdf.simple.NomoTurtleParserSeqTest_1 }}}

So there may be more data to read in the resulting files as NTriples is more verbose, but I think this can't be the whole and only reason for the problem. There may be all in all 3MB to parse, but I am sure one can do better.

So I first added {{{ .commit }}} statements using the latest >>! and ++! notation, to reduces as much as possibile the backtracking issues, in the parsers but that does not seem to solve the problem. (though I failed to check the time before I made the changes, so I don't know if it helped).

I built a thread dump which I will attach to this report using the [[ http://visualvm.java.net/gettingstarted.html | pretty neat Java Visual VM that ships with the jdk ]]

Anyway, I am a bit at a loss at present to see if there is a problem, or how I could go about optimising this.

Comments (7)

  1. pchiusano repo owner

    It could be a couple things -

    • some performance bug in nomo
    • a bug in your grammar (too much backtracking)
    • some micro-optimizations missing in your grammar
    • no performance bug, nomo is just slow given its current design / implementation

    I would try to just do more profiling to see if you can figure out what your issues are. The backtracking should probably be easy to measure. You could just add some printlns in the implementation of `|`. How often does it switch to the second branch after a failure on the first branch, and when it does, how much input did it have to buffer? You'll hopefully see very little backtracking (well, there may be a lot, but it should only be a token's worth of input or so) At some point it might be nice to build this into nomo, so there'd be a magic function that would tell you if you had excessive backtracking and could compute other statistics on your grammar.

    Figuring out what micro-optimizations are important will just depend on what the profiler tells you. I would just dig more into the profiler output to develop some ideas that you can try testing. As a guess, I might probably start with applying the micro-optimizations since they are easy to do. So don't use .many and variants for what are basically the tokens of your grammar - those are going to get called the most and you want as little overhead with these as possible. Use takeWhile, etc. Don't take my word for it. I would do some microbenchmarking with these to see if / how much faster they are. If that is working out well, I have some further ideas (nomo could use a specialized regex engine for creating token parsers).

    I have not really tried to optimize nomo itself too much. An idea I had was to try to get things to specialize to primitives, so you aren't paying for the generics and it performs as if everything were hardcoded to String and Char.

  2. Henry Story reporter

    Yes, I can think of a lot of cool long term optimisations like being able to use the same type of structures java.nio deals with, perhaps using rope structures such as Akka.io's ByteString. But I thought there the way your system was build nice and generically should allow one to switch over to such structures relatively easily.... But yes, perhaps more use of ropes and array type structures... I heard that modern computers were just very much better optimised for arrays (over lists).

    I doubt the NTriples grammar has too much backtracking as it is extreemly simple. I have popped some commits in the few key places, including the end of line, so it could never backtrack very far. I think I have also used takeWhile, whenever I could (as that was working on arrays, it was obviously going to be a lot faster).

    Did you look at the gif I attached of the cpu meter? It seems to be saying that a lot of time is spent in MonotypicW.span which calls immutable.Range.foreach... not sure what to make of that...

    PS. Perhaps if you get a bit of time you could look directly yourself there. The good news is that there are boatloads of RDF data around. It would be easy to create GB sized turtle files, to help tune your engine with real data. (But you may already have loads of data :-) Well in any case, you can put all data into turtle format. :-)

  3. Henry Story reporter

    Ah well a few changes did make some huge improvements. The last commit I checked in moved me from 120 -> 30 seconds to do the test. In the NTriples parser I did in fact have a .many method being used where a takeWhile was better, and I had placed some less likely tokens in front of some more likely ones. Now if I can halve it again, I'll be in the ballpark of the other parsers :-)

  4. Henry Story reporter

    Mhh, well, my previous statement was probably over-optimisistic. I just showed how much I could improve my parser. But I don't really show how big the difference is between it and Jena or Sesame, because the calculation in both cases involved (sesame + nomo) vs (nomo+nomo). ie all the difference could be in the nomo turtle parser.... I'll need to take some equivalent files and compare the parsing of them...

  5. Henry Story reporter

    So in order to get a clearer idea of how these parser compare, I refactored the test suite somewhat this evening in order to allow it to test any pair of Parsers. So we can now test Sesame against Jena, Nomo against Sesame, Nomo against Nomo, etc... This is very useful in a number of ways, not least as this will help us all debug our parsers. The good news is that Nomo is the only one that passes all the tests, the bad news is that it is a lot slower.

    Testing Sesame against Sesame takes 2 seconds

    > test-only org.w3.rdf.n3.SesameOnSesameParserTest
    [...]
    [error] Total time: 2 s, completed Mar 3, 2012 2:05:53 AM
    

    Testing Jena against Jena takes 3 seconds

    > test-only org.w3.rdf.n3.JenaOnJenaParserTest
    [...]
    [error] Total time: 3 s, completed Mar 3, 2012 2:12:31 AM
    

    Testing the Nomo Turtle Parser against Sesame takes 15 seconds

    > test-only org.w3.rdf.n3.NomoOnSesameParserSeqTest
    [...]
    [success] Total time: 15 s, completed Mar 3, 2012 2:14:21 AM
    

    This means that the Nomo parser I wrote is close to 10 times slower at present, if not more. I suppose the 2 seconds it took for Sesame on Sesame means it took 1 seconds for each half, and so we are comparing 1 against fifteen perhaps.

    I spent a lot of time today putting commits everywhere, and even though that helped the NTriples parser it did not have the same effect on the Turtle Parser yet.

  6. Log in to comment