How to get rid of a problem encountered while splitting a paragraph into sentences? Thread poster: Rajan Chopra
|
Rajan Chopra India Local time: 05:16 Member (2008) English to Hindi + ...
Hello experts, I sometimes feel the need for splitting a paragraph into sentences. There is a standard method to do so by going to the end of a sentence, i.e. full stop, and hit enter but it consumes time. There is another method by using Find and Replace function, i.e. putting (.) in the find box and (.^p) in the replace box and hitting Replace All (or entering a period in the Replace With field and then selecting the “Manual Line Break” option twice from the Special drop-dow... See more Hello experts, I sometimes feel the need for splitting a paragraph into sentences. There is a standard method to do so by going to the end of a sentence, i.e. full stop, and hit enter but it consumes time. There is another method by using Find and Replace function, i.e. putting (.) in the find box and (.^p) in the replace box and hitting Replace All (or entering a period in the Replace With field and then selecting the “Manual Line Break” option twice from the Special drop-down menu) but the actual problem is that it breaks the sentences unnecessarily because full stop (.) is used for other purposes also in English. For example: 1. abbreviations (W.H.0.) 2. honorific titles (Mr., Mrs.) 3. decimal in amounts (Rs. 483.97) 4. email addresses ([email protected]) The procedure breaks the sentences in all such situations. Is there any method or trick to avoid the same? Thanks and regards, Chopra
[Edited at 2015-10-25 03:24 GMT] ▲ Collapse | | |
..... (X) Local time: 08:46 Sentence Boundary Detection | Oct 25, 2015 |
The problem is called sentence boundary detection (or segmentation). There are many different libraries for segmentation (and most CAT tools come equipped to perform segmentation). I did a comparison of all of the popular sentence segmentation libraries... See more The problem is called sentence boundary detection (or segmentation). There are many different libraries for segmentation (and most CAT tools come equipped to perform segmentation). I did a comparison of all of the popular sentence segmentation libraries based on a set of Golden Rules (common edge case scenarios) when I developed TM-Town's open source segmentation library Pragmatic Segmenter. There is a live demo of Pragmatic Segmenter here. Kevin ▲ Collapse | | |
2nl (X) Netherlands Local time: 01:46 Use an advanced wildcard search | Oct 25, 2015 |
chopra_2002 wrote: 1. abbreviations (W.H.0.) 2. honorific titles (Mr., Mrs.) 3. decimal in amounts (Rs. 483.97) 4. email addresses ( [email protected]) You can exclude cases 3 and 4 with a wildcard search, requiring a new line/paragraph mark or uppercase letter to follow zero or multiple spaces following the full stop. The first two full stops in case 1 are covered by the same expression. To avoid splitting after the last full stop, you have to use an acronym or abbreviation list. That will require a macro. Or, since all letters are uppercase, define a wildcard search to hide the full stops temporarily. Also by means of a macro: case 2. Or you can temporarily hide the full stop in these titles. Since the number of titles is limited, you could use a wildcard search and even record it as a macro. | | |
Kevin is right, of course. Just about every CAT tool can do it, you can use Okapi for the purpose, but I think the easiest solution is a LibreOffice extension: LibreOffice is free (beer and err, libre), and it's as close as you can get to a Word file. Cheers, Hans | |
|
|
Rajan Chopra India Local time: 05:16 Member (2008) English to Hindi + ... TOPIC STARTER Thanks for your quick reply | Oct 25, 2015 |
Kevin Dias wrote: I did a comparison of all of the popular sentence segmentation libraries based on a set of Golden Rules (common edge case scenarios) when I developed TM-Town's open source segmentation library Pragmatic Segmenter. Kevin Thank you for your informative reply. Is there any software which will take care of these issues? If so, please inform the link for the same. Thanks and regards, Chopra | | |
Rajan Chopra India Local time: 05:16 Member (2008) English to Hindi + ... TOPIC STARTER Thanks for suggesting the solutions | Oct 25, 2015 |
[/quote] You can exclude cases 3 and 4 with a wildcard search, requiring a new line/paragraph mark or uppercase letter to follow zero or multiple spaces following the full stop. The first two full stops in case 1 are covered by the same expression. To avoid splitting after the last full stop, you have to use an acronym or abbreviation list. That will require a macro. Or, since all letters are uppercase, define a wildcard search to hide the full stops temporarily. ... See more [/quote] You can exclude cases 3 and 4 with a wildcard search, requiring a new line/paragraph mark or uppercase letter to follow zero or multiple spaces following the full stop. The first two full stops in case 1 are covered by the same expression. To avoid splitting after the last full stop, you have to use an acronym or abbreviation list. That will require a macro. Or, since all letters are uppercase, define a wildcard search to hide the full stops temporarily. Also by means of a macro: case 2. Or you can temporarily hide the full stop in these titles. Since the number of titles is limited, you could use a wildcard search and even record it as a macro. [/quote] Thanks so much for your valuable suggestions but one should have technical knowledge to apply it. Is there any link in which the procedures for wildcard search, making macros etc. have been explained in detail? Thanks and regards, Chopra
[Edited at 2015-10-25 04:05 GMT] ▲ Collapse | | |
2nl (X) Netherlands Local time: 01:46 Use a Word-based CAT tool to pre-segment | Oct 25, 2015 |
My advice would be to use a CAT tool like Wordfast or Metatexis to divide the document into segments. Demo versions will probably do the job. May I ask: How are you translating your document? | | |
Rajan Chopra India Local time: 05:16 Member (2008) English to Hindi + ... TOPIC STARTER I have the licensed (full version) of Wordfast Pro | Oct 25, 2015 |
2nl wrote: My advice would be to use a CAT tool like Wordfast or Metatexis to divide the document into segments. Demo versions will probably do the job. May I ask: How are you translating your document? I have the latest version of WF Pro (3.4.5) but even it is not quite capable of solving this problem and the sentences are broken unnecessarily which not only causes inconvenience but also spoils the translation memory because a different translation is entered in the broken segments (Remember: The syntax of English and Hindi is different). I use this software (WF Pro) to translate the documents. In addition to it, I also have Trados Freelance 7.0 which is also unable to help in this respect. Thanks and regards, Chopra | |
|
|
Tom in London United Kingdom Local time: 00:46 Member (2008) Italian to English Only a paragraph | Oct 25, 2015 |
chopra_2002 wrote: Hello experts, I sometimes feel the need for splitting a paragraph into sentences. There is a standard method to do so by going to the end of a sentence, i.e. full stop, and hit enter but it consumes time. There is another method by using Find and Replace function, i.e. putting (.) in the find box and (.^p) in the replace box and hitting Replace All (or entering a period in the Replace With field and then selecting the “Manual Line Break” option twice from the Special drop-down menu) but the actual problem is that it breaks the sentences unnecessarily because full stop (.) is used for other purposes also in English. For example: 1. abbreviations (W.H.0.) 2. honorific titles (Mr., Mrs.) 3. decimal in amounts (Rs. 483.97) 4. email addresses ( [email protected]) The procedure breaks the sentences in all such situations. Is there any method or trick to avoid the same? Thanks and regards, Chopra [Edited at 2015-10-25 03:24 GMT] It would be different for a long document but since you're only talking about a paragraph, why can't you just do a find/replace on a one-by-one basis? I don't see the problem. | | |
Heinrich Pesch Finland Local time: 02:46 Member (2003) Finnish to German + ... WFP segmentation settings | Oct 25, 2015 |
Go to Edit - Preferences and select Segmentation settings. Down on the page you will find a list of the abbriavations for different languages. You can add the necessary cases which you think are missing and don't forget to put a comma after the last item. The same is true for WFC. | | |
Rajan Chopra India Local time: 05:16 Member (2008) English to Hindi + ... TOPIC STARTER It is not about just one paragraph | Oct 25, 2015 |
Tom in London wrote: It would be different for a long document but since you're only talking about a paragraph, why can't you just do a find/replace on a one-by-one basis? I don't see the problem. Thanks for your reply. I am not talking about just one paragraph because I get small as well as big projects for translation. If a procedure can be applied to one paragraph, the same can be applied to a text consisting of thousands of words. Thanks and regards, Chopra | | |
Heinrich Pesch wrote: Go to Edit - Preferences and select Segmentation settings. ...makes sense. A lot of it. If you translate from Hindi to English, paragraph segmentation makes sense. However, I think the only thing the OP will have to do if he switches to English to Hindi, is change the segmentation settings from paragraph segmentation to sentence segmentation. Cheers, Hans | |
|
|
Rajan Chopra India Local time: 05:16 Member (2008) English to Hindi + ... TOPIC STARTER Thanks for the suggestion | Oct 25, 2015 |
Heinrich Pesch wrote: Go to Edit - Preferences and select Segmentation settings. Down on the page you will find a list of the abbriavations for different languages. You can add the necessary cases which you think are missing and don't forget to put a comma after the last item. The same is true for WFC. Yes, it can solve the problem for abbreviations but one will have to add the new abbreviations manually. However, it is still useful because it will be beneficial for future projects also in which the earlier abbreviations figure. Thanks and regards, Chopra | | |
Rolf Keller Germany Local time: 01:46 English to German How to write macros --> Google | Oct 26, 2015 |
chopra_2002 wrote: Is there any link in which the procedures for wildcard search, making macros etc. have been explained in detail? There are hundreds or thousands of explanations & How-to's, at least one for any level of knowledge & grasp of things. Just throw something like this to Google: macro tutorial "word 2010" | | |