- edited description
Splitting long TextUnit contents into sentences?
Hello, we are trying to parse some CSV files into Zanata.
What we are trying to achieve is:
if TextUnit content has multiple sentences:
return multiple TextUnits containing each sentence
else
return single TextUnit(since it is only one sentence)
The configuration file we are sending (Generated with Rainbow) :
#v1
unescapeSource.b=true
trimLeading.b=true
trimTrailing.b=true
preserveWS.b=false
useCodeFinder.b=false
codeFinderRules=#v1$0a$count.i=2$0a$rule0=%(([-0+#]?)[-0+#]?)((\d\$)?)(([\d\*]*)(\.[\d\*]*)?)[dioxXucsfeEgGpn]$0a$rule1=(\\r\\n)|\\a|\\b|\\f|\\n|\\r|\\t|\\v$0a$sample=$0a$useAllRulesWhenTesting.b=false
wrapMode.i=0
columnNamesLineNum.i=1
valuesStartLineNum.i=2
detectColumnsMode.i=0
numColumns.i=1
sendHeaderMode.i=0
trimMode.i=1
sendColumnsMode.i=1
sourceIdColumns=
sourceColumns=1
targetColumns=2,3,4
commentColumns=5
commentSourceRefs=
recordIdColumn.i=0
sourceIdSourceRefs=
sourceIdSuffixes=
targetLanguages=fr,es,it
targetSourceRefs=1,1,1
fieldDelimiter=,
textQualifier="
removeQualifiers.b=true
parametersClass=net.sf.okapi.filters.table.csv.Parameters
The file in the attachment has two cells as source, but includes 4 sentences. We would like to somehow get 4 TUs from Okapi (1 for first row, 3 for second row). Is it possible to do with filter configurations?
Please also refer to the Comment column in the attachment.
Zanata's OkapiFilterAdapter impl: https://github.com/zanata/zanata-platform/blob/master/server/zanata-war/src/main/java/org/zanata/adapter/OkapiFilterAdapter.java
Thanks,
Kaan Demirel.
Comments (7)
-
Account Deleted reporter -
Account Deleted reporter - edited description
-
Account Deleted reporter - edited description
-
Account Deleted reporter - edited description
-
The filter on its own won't do it. However, if you run the filter as part of a pipeline you can use the Segmentation Step to perform the sentence breaking.
Normally, that doesn't actually produce multiple Text Units, however -- it breaks the TU up into multiple Segment objects inside a single TU. (When serialized to XLIFF, this is represented using <seg-source>.)
If the Segments absolutely need to be promoted to separate text units, you can use the Segments To Text Units Converter step to do so.
-
Account Deleted reporter Thanks, I think we can figure it out from there.
-
- changed status to resolved
Ok. Closing this for now, we can reopen it (or a new one) if this continues to give you trouble.
- Log in to comment