Splitting long TextUnit contents into sentences?

Issue #625 resolved
Former user created an issue

Hello, we are trying to parse some CSV files into Zanata.

What we are trying to achieve is:

if TextUnit content has multiple sentences:
     return multiple TextUnits containing each sentence
else
    return single TextUnit(since it is only one sentence)

The configuration file we are sending (Generated with Rainbow) :

#v1
unescapeSource.b=true
trimLeading.b=true
trimTrailing.b=true
preserveWS.b=false
useCodeFinder.b=false
codeFinderRules=#v1$0a$count.i=2$0a$rule0=%(([-0+#]?)[-0+#]?)((\d\$)?)(([\d\*]*)(\.[\d\*]*)?)[dioxXucsfeEgGpn]$0a$rule1=(\\r\\n)|\\a|\\b|\\f|\\n|\\r|\\t|\\v$0a$sample=$0a$useAllRulesWhenTesting.b=false
wrapMode.i=0
columnNamesLineNum.i=1
valuesStartLineNum.i=2
detectColumnsMode.i=0
numColumns.i=1
sendHeaderMode.i=0
trimMode.i=1
sendColumnsMode.i=1
sourceIdColumns=
sourceColumns=1
targetColumns=2,3,4
commentColumns=5
commentSourceRefs=
recordIdColumn.i=0
sourceIdSourceRefs=
sourceIdSuffixes=
targetLanguages=fr,es,it
targetSourceRefs=1,1,1
fieldDelimiter=,
textQualifier="
removeQualifiers.b=true
parametersClass=net.sf.okapi.filters.table.csv.Parameters

The file in the attachment has two cells as source, but includes 4 sentences. We would like to somehow get 4 TUs from Okapi (1 for first row, 3 for second row). Is it possible to do with filter configurations?

Please also refer to the Comment column in the attachment.

Zanata's OkapiFilterAdapter impl: https://github.com/zanata/zanata-platform/blob/master/server/zanata-war/src/main/java/org/zanata/adapter/OkapiFilterAdapter.java

Thanks,

Kaan Demirel.

Comments (7)

  1. Chase Tingley

    The filter on its own won't do it. However, if you run the filter as part of a pipeline you can use the Segmentation Step to perform the sentence breaking.

    Normally, that doesn't actually produce multiple Text Units, however -- it breaks the TU up into multiple Segment objects inside a single TU. (When serialized to XLIFF, this is represented using <seg-source>.)

    If the Segments absolutely need to be promoted to separate text units, you can use the Segments To Text Units Converter step to do so.

  2. Log in to comment