- edited description
Scoping report fails for PDF file
When rainbow tool tries to execute Scoping Report step it fails with error
net.sf.okapi.steps.leveraging.LeveragingStep initialize
INFO: Database: C:\okapi\TM_Base\EN_BG.pentm
java.lang.NullPointerException
at java.io.File.<init>(Unknown Source)
at net.sf.okapi.steps.scopingreport.ScopingReportStep.handleStartDocument(ScopingReportStep.java:880)
at net.sf.okapi.lib.extra.steps.AbstractPipelineStep.handleEvent(AbstractPipelineStep.java:107)
at net.sf.okapi.lib.extra.steps.CompoundStep.handleEvent(CompoundStep.java:101)
at net.sf.okapi.common.pipeline.Pipeline.execute(Pipeline.java:119)
at net.sf.okapi.common.pipeline.Pipeline.process(Pipeline.java:231)
at net.sf.okapi.common.pipeline.Pipeline.process(Pipeline.java:201)
at net.sf.okapi.common.pipelinedriver.PipelineDriver.processBatch(PipelineDriver.java:182)
at net.sf.okapi.applications.rainbow.pipeline.PipelineWrapper.execute(PipelineWrapper.java:472)
at net.sf.okapi.applications.rainbow.pipeline.PipelineWrapper.execute(PipelineWrapper.java:411)
at net.sf.okapi.applications.rainbow.CommandLine.launchPipeline(CommandLine.java:374)
at net.sf.okapi.applications.rainbow.CommandLine.execute(CommandLine.java:99)
at net.sf.okapi.applications.rainbow.Main.main(Main.java:44)
Steps to reproduce:
- Create new Settings
- Add pdf document
- Edit/Execute Pipeline with steps (use default options) below
- Execute
Pipeline steps
- Raw Document to Filter Events
- Segmentation
- Leveraging
- Scoping Report
- Rainbow Translation Kit Creation
More info
PdfFilter returns event with resource which has null value of inputURI property (PdfFilter.java lines 164-167)
Comments (8)
-
reporter -
The issue comes from the PDFFilter not the ScopingReport step.
The PDF Filter is a read-only filter (one cannot merge back the translation). It works by first extracting the original input PDF file using Apache's pdfbox. Then the output of the pdfbox (which is a big string) is passed on a ParaPlainTextFilter. When that secondary filter is set, because the input is a string and not a physical file, the inputURI field of the RawDocument for that filter is not set. The next() calls get the events from the ParaPlainTextFilter, which has no inputURI, hence the null pointer one gets with pretty much any step.
Possible work-around: set the inputURI of the RawDocument for the ParaPlainTextFilter with the inputURI of the original PDF, but I think that would cause problems as the RawDocument has a priority on how it tries to open the stream. We would have to see of the inputCharSequence input has priority over the inputURI. Even if it does there may be some side effects on having an inputURI set for the incorrect input...
Note: checked: yes inputCharSequence is checked before inputURI in
getStream()
.To be discussed.
-
reporter Hello Yves. Thank you for the quick response and details. I agree with you that problem comes from the PdfFilter. Could you please explain why PdfFilter uses RawDocument and OpenXmlFilter uses StartDoucument?
-
All IFilter implementation use RawDocument for the input. The idea is to allow different types of input to be processed (e.g. from a string or a physical file, etc.) StartDocument is an event that is generated by IFilter implementations. OpenXmlFilter has extra methods for opening with either a string or a URI. That probably would not change the issue with the PDFFilter.
-
This is one of those cases I ran into when I was trying to clean up the RawDocument abstraction a while back. There are a bunch of places like this that make assumptions that depend on a particular flavor of RD.
-
reporter As far as I understood there are three options how to resolve the issue. * Change RawDocument * Change PdfFilter to using something else instead RawDocument * Change ScopingReport to make it work with RawDocument
@ysavourel, @tingley, what would you suggest?
-
Fixing RawDocument so that all types of RawDocument is a thing I'd like to do, but it's not a big task.
This seems like we should fix it in the filter. The PDF filter seems like it's violating the contract of the event system: StartDocument is supposed to have a name.
-
The nested text filter is derived from
AbstractLineFilter
, which contains this code:if ( input.getInputURI() != null ) { docName = input.getInputURI().getPath(); }
The
docName
is then used to provide the name for theStartDocument
event.This seems like the crux of the problem.
inputURI
in a RawDocument is meant to provide the location of the data, but it's also being used to generate metadata here. Speaking conceptually, we should always be able to name a RawDocument: either from the inputURI, or from the name of a File, or (if we're working with a character sequence) either via some externally-provided label or an anonymous label generated internally. - Log in to comment