Customization of Segmentation (all tools)
Thread poster: Ella Luz
Do you happen to know whether it is possible to make a CAT-tool ignore pilcrows (paragraph marks) and tabs within sentences in Word Documents? I am translating documents containing sentences which span several lines but are interrupted by paragraph marks and tabs. How can I tell a tool to only accept a full stop (= actual end of sentence) as the end of a segment?
There are instances where there is a placeholder within sentences indicated by several full stops (…). How can I add this exception to my segmentation rules, once I have customized the tool? I don’t want the segment to end after this set of full stops either, only after a single full stop.
Is such a customization possible at all and if so, is it possible with every CAT-tool? This question might affect our decision regarding which tool to go for. We need a collaborative tool and will most likely go for a cloud-based one. Do internet-based tools have this feature? How about Memsource?
Thank you a lot in advance!
| Re: Customization of Segmentation (all tools) || Nov 23, 2016 |
With tab characters, it is pretty easy to configure your CAT tool so that there is no segmentation on tab characters. Some CAT tools such as memoQ do not perform segmentation on tab characters by default, others like Memsource do.
With paragraph breaks (hard returns) it is more difficult. A paragraph is a distinct structural unit so when a CAT tool segments a document, it first breaks the text into structural units (e.g., in Word these are paragraphs, in HTML these are div, p and similar structural tags, etc.). After the text is broken down into structural units, each piece is segmented according to segmentation rules. For this reason it is impossible to avoid segmentation on paragraph breaks -- you can create a segmentation rule for this, but it will never work. Note, however, that it is possible to avoid segmentation on line breaks (soft returns). In Word you can use Unbreaker for Word (part of a Word add-in) to remove hard returns in a semi-automatic way before the document is imported into the CAT tool, thus avoiding the above issue.
Regarding the ellipsis (…), it is easy enough to prevent segmentation on this character.
Memsource uses segmentation rules in SRX format. For Word, it segments on tab characters by default (it is easy to set up a default rule for a specific source language which will avoid segmenting on tab characters). By default, Memsource does not segment on ellipsis (Unicode character U+2026). You can customize Memsource and other CAT tools so that they segment only on specific punctuation marks, e.g. on . as in your case.
TransTools – Useful tools for every translator
| || |
| | esperantisto
Local time: 03:33
English to Russian
| OmegaT and Anaphraseus || Nov 23, 2016 |
OmegaT: yes for tabs and no for paragraph breaks (paragraphs are the largest possible translation units).
Anaphraseus: segments can be expanded both over tabs and paragraph breaks (in the latter case, a warning is issued). However, Anaphraseus has no collaborative features.
[Edited at 2016-11-23 14:52 GMT]
Local time: 18:33
French to English
MemoQ can usually join segments and replace the paragraph marks with a tag. Tabs can also be replaced with a tag. Segmentation can be adjusted with rules. MemoQ also has a cloud server version, although I don't know much about it.
| | Samuel Murray
Local time: 02:33
English to Afrikaans
How can I tell a tool to only accept a full stop (= actual end of sentence) as the end of a segment?
The only tool that I know of that can ignore hard line breaks is OmegaT, and then only for TXT files.
Try replacing all hard line breaks ^p with soft line breaks ^l, and then optionally change all double soft line breaks ^l^l back to hard line breaks ^p^p.
There are instances where there is a placeholder within sentences indicated by several full stops (…).
Try replacing the three dots ... with a single ellipsis character …. Then find all instances of the ellipsis plus a space, and replace it with just the ellipsis.
Is such a customization possible at all and if so, is it possible with every CAT-tool? This question might affect our decision regarding which tool to go for.
Most CAT tools require the user to fix the source files first (or on the fly) to deal with the limitations of the CAT tool.
Dear Stanislav, esperantisto, John and Samuel,
Thank you very much for your comments, they are really helpful!
To report site rules violations or get help, contact a site moderator:
You can also contact site staff by submitting a support request »
Customization of Segmentation (all tools)
|memoQ translator pro|
|Kilgray's memoQ is the world's fastest developing integrated localization & translation environment rendering you more productive and efficient.|
With our advanced file filters, unlimited language and advanced file support, memoQ translator pro has been designed for translators and reviewers who work on their own, with other translators or in team-based translation projects.
More info »
|For clarity and excellence|
WordFinder is the leading dictionary service that gives you the words you want anywhere, anytime.
Access 260+ dictionaries from the world's leading dictionary publishers in virtually any device.
Find the right word anywhere, anytime - online or offline.
More info »