Beyond paragraph in OmegaT - question about segmentation
Thread poster: Marcos Zattar
Marcos Zattar
Germany
Local time: 06:32
German to Portuguese
+ ...
Aug 15, 2008

Hello,

a lot has been discussed about the segmentation of OmegaT, even that in *past versions* it did not have the ability to do it at sentence level, instead recognizing the segment by the paragraph mark.

Well, I have exactly the opposite problem: I need a segmentation method that ignores paragraph marks, because my file format has them in the middle of the sentences.

I checked the following site: http://www.omegat.org/en/howtos/new_filter.html

It teaches how to create new filters for "exotic" file types, just as mine. My question: is it possible to create a filter which ignores paragraph marks and don't consider them the end of a segment?

Please note that I cannot just erase those marks within my source language text, because that would mess up the formatting.

Thanks for hints!

Kind regards,
Marcos


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 06:32
Member (2006)
English to Afrikaans
+ ...
Soft returns instead of hard returns Aug 15, 2008

Marcos de Miranda Zattar wrote:
Well, I have exactly the opposite problem: I need a segmentation method that ignores paragraph marks, because my file format has them in the middle of the sentences.


1. Open your file in MS Word (because OpenOffice.org is rubbish)
2. Do find/replace that finds ^p and replaces it with ^l
3. Save, and reopen in OpenOffice.org

This changes hard returns into soft returns, which OmegaT regards as inline formatting. Remember to use paragraph segmentation in OmegaT for this.

Oh, I assumed your document is ODT. If it is simply a plaintext file with empty lines between the paragraphs, you ignore the above advice, and simply use Options -> File Filters, Click Text, click Options, and try some other option.


Direct link Reply with quote
 
xxxMarc P  Identity Verified
Local time: 06:32
German to English
+ ...
Probably not practical Aug 15, 2008

I'm happy to be corrected, but I don't think that it is practical. The reason is that "paragraph segmenting" in OmegaT does not simply mean that OmegaT uses the paragraph marker as the point at which it segments. This is true only of plain-text files. In formatted file formats, paragraph segmenting instructs OmegaT to present the content between the opening and closing paragraph-level tags as the segment, and to ignore the markup between paragraphs. Removing these tags as markers for segmentation would mean that the entire text would be presented as one segment (which is what you want, segmentation then being assured by the regex syntax for sentence-level segmenting), but it would also mean that paragraph-level formatting would be displayed.

To correct this, I think (I'm not absolutely sure) that you would have to modify the filter such that paragraph-level segmenting tags were ignored (except the paragraph tag itself, which you would have to treat as an inline tag). I think that is unlikely to be practical, though it may be theoretically possible.

This is also the reason why Samuel's suggestion won't work (Sorry, Samuel). Replacing the hard paragraph break with a line break merges the paragraphs either side of the break, meaning that they must then have the same paragraph-level formatting. As a result, they both assume the formatting of either the paragraph before or the paragraph after (in Word, the paragraph after, it seems, which is logical, since the paragraph marker in Word "contains" the paragraph-level formatting).

You could try experimenting with the primitive Abiword filter provided as an example in the HowTo: create a simple file in Abiword, then successively define the non-inline XML tags as non-translatable. If you do, you'll realize just how primitive the existing Abiword filter is. For one job, the effort is unlikely to be worthwhile; it might be interesting to have such a filter for the future, although the only time I've encountered such texts is in files converted from PDF, and I usually find it easier to delete the unwanted breaks manually and reformat as necessary.

Marc


Direct link Reply with quote
 
Marcos Zattar
Germany
Local time: 06:32
German to Portuguese
+ ...
TOPIC STARTER
Sample - the use of formatting marks to delimit segments Aug 15, 2008

Thank you Samuel and Mark for the good ideas.

I experimented with Samuel's suggestion. It is indeed not practical because to big chunks of text are regarded as segments.

Mark: before I try your idea out, maybe you could have a look at my text. Below is a sample of it.

Please note that at the end of *every single line* there is a paragraph mark. The file below counts 74 paragraph marks.

The codes B1, AS, AL and others that appear in the beginning of the line are formatting info. I intended to use them in the filter for segmenting.

So, what do you think?




/HTEXT
/:OBJECT TERM
/:NAME CONSCHECK_GEOMETRIC_INSTANCE
/:ID T01
/:LANGUAGE P
/:FORM S_DOCU_PRINT
/:STYLE S_DOCUS1
/:FIRST-USER
/:FIRST-DATE 00 00 0000
/:FIRST-TIME 00 00 00
/:LAST-USER
/:LAST-DATE 00 00 0000
/:LAST-TIME 00 00 00
/:TITLE ' '
/:TITLE1 ' '
/:TITLE2 ' '
/MTEXT
U1Konsistenzprüfung für Lageinstanzen
ASMit dieser Funktion überprüfen Sie, ob die Menge einer Positionsvariante
oder einer Baukastenposition (bzw. des jeweiligen Änderungsstands) mit
der Anzahl von Lageinstanzen, die dem Objekt zugeordnet sind und
bestimmte Filterkriterien erfüllen, übereinstimmt.
AL¬e&
ALFür die Konsistenzprüfung muss die Menge am übergeordneten Objekt in der
Mengeneinheit ST (Stück) oder einer anderen zählbaren Einheit angegeben
sein.
ASUm die Konsistenzprüfung zu implementieren, verwenden Sie den
Funktionsbaustein PPEHI_PVINS_CHECK_CONSISTENCY.
AL¬e&
ALDie Eingabeparameter sind in diesem Funktionsbaustein ähnlich wie im
Funktionsbaustein PPEHI_PVINS_GET_INST_BY_OBJECT.
ASEingabeparameter
/:INCLUDE IV_MSG_HANDLING OBJECT DOKU ID TX
IV_MSG_HANDLING
/:INCLUDE IV_COMPONENT_VARIANT_ID OBJECT DOKU ID TX
/:INCLUDE IV_ASSEMBLY_RELATION_ID OBJECT DOKU ID TX
B1IV_CHANGE_NUMBER
ALWenn Sie den Änderungsstand des übergeordneten Objekts kennen, geben Sie
diesen hier an. Andernfalls lassen Sie diesen Parameter leer und geben
das Gültigkeitsdatum im folgenden Parameter an.
B1IV_VALIDITY_DATE


Direct link Reply with quote
 
xxxMarc P  Identity Verified
Local time: 06:32
German to English
+ ...
Sample - the use of formatting marks to delimit segments Aug 15, 2008

Marcos,

I'm not sure whether your sample has reproduced properly by being pasted here (quoting it reveals some additional codes), but I would treat this particular case as plain text. It should be quite easy to write a script (or regex S&R) to delete paragraph breaks except where they occur before B1, AS and AL (and any other paragraph-level formatting codes). Put the breaks back in again after translating, either manually or again by using a script. Certainly much easier than writing a dedicated OmegaT filter.

Marc (auch mit c, übrigens)


Direct link Reply with quote
 


There is no moderator assigned specifically to this forum.
To report site rules violations or get help, please contact site staff »


Beyond paragraph in OmegaT - question about segmentation

Advanced search






LSP.expert
You’re a freelance translator? LSP.expert helps you manage your daily translation jobs. It’s easy, fast and secure.

How about you start tracking translation jobs and sending invoices in minutes? You can also manage your clients and generate reports about your business activities. So you always keep a clear view on your planning, AND you get a free 30 day trial period!

More info »
CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use SDL Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search