Scoping report fails for PDF file

Issue #691 new
Volodymyr Duduladenko created an issue

When rainbow tool tries to execute Scoping Report step it fails with error

net.sf.okapi.steps.leveraging.LeveragingStep initialize
INFO: Database: C:\okapi\TM_Base\EN_BG.pentm
java.lang.NullPointerException
    at java.io.File.<init>(Unknown Source)
    at net.sf.okapi.steps.scopingreport.ScopingReportStep.handleStartDocument(ScopingReportStep.java:880)
    at net.sf.okapi.lib.extra.steps.AbstractPipelineStep.handleEvent(AbstractPipelineStep.java:107)
    at net.sf.okapi.lib.extra.steps.CompoundStep.handleEvent(CompoundStep.java:101)
    at net.sf.okapi.common.pipeline.Pipeline.execute(Pipeline.java:119)
    at net.sf.okapi.common.pipeline.Pipeline.process(Pipeline.java:231)
    at net.sf.okapi.common.pipeline.Pipeline.process(Pipeline.java:201)
    at net.sf.okapi.common.pipelinedriver.PipelineDriver.processBatch(PipelineDriver.java:182)
    at net.sf.okapi.applications.rainbow.pipeline.PipelineWrapper.execute(PipelineWrapper.java:472)
    at net.sf.okapi.applications.rainbow.pipeline.PipelineWrapper.execute(PipelineWrapper.java:411)
    at net.sf.okapi.applications.rainbow.CommandLine.launchPipeline(CommandLine.java:374)
    at net.sf.okapi.applications.rainbow.CommandLine.execute(CommandLine.java:99)
    at net.sf.okapi.applications.rainbow.Main.main(Main.java:44)

Steps to reproduce:

  1. Create new Settings
  2. Add pdf document
  3. Edit/Execute Pipeline with steps (use default options) below
  4. Execute

Pipeline steps

  • Raw Document to Filter Events
  • Segmentation
  • Leveraging
  • Scoping Report
  • Rainbow Translation Kit Creation

More info

PdfFilter returns event with resource which has null value of inputURI property (PdfFilter.java lines 164-167)

Comments (8)

  1. ysavourel

    The issue comes from the PDFFilter not the ScopingReport step.

    The PDF Filter is a read-only filter (one cannot merge back the translation). It works by first extracting the original input PDF file using Apache's pdfbox. Then the output of the pdfbox (which is a big string) is passed on a ParaPlainTextFilter. When that secondary filter is set, because the input is a string and not a physical file, the inputURI field of the RawDocument for that filter is not set. The next() calls get the events from the ParaPlainTextFilter, which has no inputURI, hence the null pointer one gets with pretty much any step.

    Possible work-around: set the inputURI of the RawDocument for the ParaPlainTextFilter with the inputURI of the original PDF, but I think that would cause problems as the RawDocument has a priority on how it tries to open the stream. We would have to see of the inputCharSequence input has priority over the inputURI. Even if it does there may be some side effects on having an inputURI set for the incorrect input...

    Note: checked: yes inputCharSequence is checked before inputURI in getStream().

    To be discussed.

  2. Volodymyr Duduladenko reporter

    Hello Yves. Thank you for the quick response and details. I agree with you that problem comes from the PdfFilter. Could you please explain why PdfFilter uses RawDocument and OpenXmlFilter uses StartDoucument?

  3. ysavourel

    All IFilter implementation use RawDocument for the input. The idea is to allow different types of input to be processed (e.g. from a string or a physical file, etc.) StartDocument is an event that is generated by IFilter implementations. OpenXmlFilter has extra methods for opening with either a string or a URI. That probably would not change the issue with the PDFFilter.

  4. Chase Tingley

    This is one of those cases I ran into when I was trying to clean up the RawDocument abstraction a while back. There are a bunch of places like this that make assumptions that depend on a particular flavor of RD.

  5. Volodymyr Duduladenko reporter

    As far as I understood there are three options how to resolve the issue. * Change RawDocument * Change PdfFilter to using something else instead RawDocument * Change ScopingReport to make it work with RawDocument

    @ysavourel, @tingley, what would you suggest?

  6. Chase Tingley

    Fixing RawDocument so that all types of RawDocument is a thing I'd like to do, but it's not a big task.

    This seems like we should fix it in the filter. The PDF filter seems like it's violating the contract of the event system: StartDocument is supposed to have a name.

  7. Chase Tingley

    The nested text filter is derived from AbstractLineFilter, which contains this code:

            if ( input.getInputURI() != null ) {
                docName = input.getInputURI().getPath();
            }
    

    The docName is then used to provide the name for the StartDocument event.

    This seems like the crux of the problem. inputURI in a RawDocument is meant to provide the location of the data, but it's also being used to generate metadata here. Speaking conceptually, we should always be able to name a RawDocument: either from the inputURI, or from the name of a File, or (if we're working with a character sequence) either via some externally-provided label or an anonymous label generated internally.

  8. Log in to comment