How to get rid of a problem encountered while splitting a paragraph into sentences?
Thread poster: chopra_2002

chopra_2002  Identity Verified
India
Local time: 03:34
Member (2008)
English to Hindi
+ ...
Oct 25, 2015

Hello experts,

I sometimes feel the need for splitting a paragraph into sentences. There is a standard method to do so by going to the end of a sentence, i.e. full stop, and hit enter but it consumes time. There is another method by using Find and Replace function, i.e. putting (.) in the find box and (.^p) in the replace box and hitting Replace All (or entering a period in the Replace With field and then selecting the “Manual Line Break” option twice from the Special drop-down menu) but the actual problem is that it breaks the sentences unnecessarily because full stop (.) is used for other purposes also in English. For example:

1. abbreviations (W.H.0.)
2. honorific titles (Mr., Mrs.)
3. decimal in amounts (Rs. 483.97)
4. email addresses (xyz@abc.com)

The procedure breaks the sentences in all such situations. Is there any method or trick to avoid the same?

Thanks and regards,

Chopra





[Edited at 2015-10-25 03:24 GMT]


 

Kevin Dias
Local time: 07:04
SITE STAFF
Sentence Boundary Detection Oct 25, 2015

The problem is called sentence boundary detection (or segmentation). There are many different libraries for segmentation (and most CAT tools come equipped to perform segmentation).

I did a comparison of all of the popular sentence segmentation libraries based on a set of Golden Rules (common edge case scenarios) when I developed TM-Town's open source segmentation library Pragmatic Segmenter.

There is a live demo of Pragmatic Segmenter here.

Kevin


 

xxx2nl  Identity Verified
Netherlands
Local time: 00:04
Use an advanced wildcard search Oct 25, 2015

chopra_2002 wrote:

1. abbreviations (W.H.0.)
2. honorific titles (Mr., Mrs.)
3. decimal in amounts (Rs. 483.97)
4. email addresses (xyz@abc.com)



You can exclude cases 3 and 4 with a wildcard search, requiring a new line/paragraph mark or uppercase letter to follow zero or multiple spaces following the full stop.

The first two full stops in case 1 are covered by the same expression. To avoid splitting after the last full stop, you have to use an acronym or abbreviation list. That will require a macro. Or, since all letters are uppercase, define a wildcard search to hide the full stops temporarily.

Also by means of a macro: case 2. Or you can temporarily hide the full stop in these titles. Since the number of titles is limited, you could use a wildcard search and even record it as a macro.


 

Meta Arkadia
Local time: 05:04
English to Indonesian
+ ...
LibreOffice Oct 25, 2015

Kevin is right, of course. Just about every CAT tool can do it, you can use Okapi for the purpose, but I think the easiest solution is a LibreOffice extension:

LO%20segments.png

LibreOffice is free (beer and err, libre), and it's as close as you can get to a Word file.

Cheers,

Hans


 

chopra_2002  Identity Verified
India
Local time: 03:34
Member (2008)
English to Hindi
+ ...
TOPIC STARTER
Thanks for your quick reply Oct 25, 2015

Kevin Dias wrote:


I did a comparison of all of the popular sentence segmentation libraries based on a set of Golden Rules (common edge case scenarios) when I developed TM-Town's open source segmentation library Pragmatic Segmenter.


Kevin


Thank you for your informative reply. Is there any software which will take care of these issues? If so, please inform the link for the same.

Thanks and regards,

Chopra


 

chopra_2002  Identity Verified
India
Local time: 03:34
Member (2008)
English to Hindi
+ ...
TOPIC STARTER
Thanks for suggesting the solutions Oct 25, 2015

[/quote]

You can exclude cases 3 and 4 with a wildcard search, requiring a new line/paragraph mark or uppercase letter to follow zero or multiple spaces following the full stop.

The first two full stops in case 1 are covered by the same expression. To avoid splitting after the last full stop, you have to use an acronym or abbreviation list. That will require a macro. Or, since all letters are uppercase, define a wildcard search to hide the full stops temporarily.

Also by means of a macro: case 2. Or you can temporarily hide the full stop in these titles. Since the number of titles is limited, you could use a wildcard search and even record it as a macro. [/quote]

Thanks so much for your valuable suggestions but one should have technical knowledge to apply it. Is there any link in which the procedures for wildcard search, making macros etc. have been explained in detail?

Thanks and regards,

Chopra





[Edited at 2015-10-25 04:05 GMT]


 

xxx2nl  Identity Verified
Netherlands
Local time: 00:04
Use a Word-based CAT tool to pre-segment Oct 25, 2015

My advice would be to use a CAT tool like Wordfast or Metatexis to divide the document into segments. Demo versions will probably do the job.

May I ask: How are you translating your document?


 

chopra_2002  Identity Verified
India
Local time: 03:34
Member (2008)
English to Hindi
+ ...
TOPIC STARTER
I have the licensed (full version) of Wordfast Pro Oct 25, 2015

2nl wrote:

My advice would be to use a CAT tool like Wordfast or Metatexis to divide the document into segments. Demo versions will probably do the job.

May I ask: How are you translating your document?


I have the latest version of WF Pro (3.4.5) but even it is not quite capable of solving this problem and the sentences are broken unnecessarily which not only causes inconvenience but also spoils the translation memory because a different translation is entered in the broken segments (Remember: The syntax of English and Hindi is different).

I use this software (WF Pro) to translate the documents. In addition to it, I also have Trados Freelance 7.0 which is also unable to help in this respect.

Thanks and regards,

Chopra


 

Tom in London
United Kingdom
Local time: 23:04
Member (2008)
Italian to English
Only a paragraph Oct 25, 2015

chopra_2002 wrote:

Hello experts,

I sometimes feel the need for splitting a paragraph into sentences. There is a standard method to do so by going to the end of a sentence, i.e. full stop, and hit enter but it consumes time. There is another method by using Find and Replace function, i.e. putting (.) in the find box and (.^p) in the replace box and hitting Replace All (or entering a period in the Replace With field and then selecting the “Manual Line Break” option twice from the Special drop-down menu) but the actual problem is that it breaks the sentences unnecessarily because full stop (.) is used for other purposes also in English. For example:

1. abbreviations (W.H.0.)
2. honorific titles (Mr., Mrs.)
3. decimal in amounts (Rs. 483.97)
4. email addresses (xyz@abc.com)

The procedure breaks the sentences in all such situations. Is there any method or trick to avoid the same?

Thanks and regards,

Chopra





[Edited at 2015-10-25 03:24 GMT]


It would be different for a long document but since you're only talking about a paragraph, why can't you just do a find/replace on a one-by-one basis? I don't see the problem.


 

Heinrich Pesch  Identity Verified
Finland
Local time: 01:04
Member (2003)
Finnish to German
+ ...
WFP segmentation settings Oct 25, 2015

Go to Edit - Preferences and select Segmentation settings. Down on the page you will find a list of the abbriavations for different languages. You can add the necessary cases which you think are missing and don't forget to put a comma after the last item.
The same is true for WFC.


 

chopra_2002  Identity Verified
India
Local time: 03:34
Member (2008)
English to Hindi
+ ...
TOPIC STARTER
It is not about just one paragraph Oct 25, 2015

Tom in London wrote:


It would be different for a long document but since you're only talking about a paragraph, why can't you just do a find/replace on a one-by-one basis? I don't see the problem.


Thanks for your reply. I am not talking about just one paragraph because I get small as well as big projects for translation. If a procedure can be applied to one paragraph, the same can be applied to a text consisting of thousands of words.

Thanks and regards,

Chopra


 

Meta Arkadia
Local time: 05:04
English to Indonesian
+ ...
This... Oct 25, 2015

Heinrich Pesch wrote:
Go to Edit - Preferences and select Segmentation settings.


...makes sense. A lot of it. If you translate from Hindi to English, paragraph segmentation makes sense. However, I think the only thing the OP will have to do if he switches to English to Hindi, is change the segmentation settings from paragraph segmentation to sentence segmentation.

Cheers,

Hans


 

chopra_2002  Identity Verified
India
Local time: 03:34
Member (2008)
English to Hindi
+ ...
TOPIC STARTER
Thanks for the suggestion Oct 25, 2015

Heinrich Pesch wrote:

Go to Edit - Preferences and select Segmentation settings. Down on the page you will find a list of the abbriavations for different languages. You can add the necessary cases which you think are missing and don't forget to put a comma after the last item.
The same is true for WFC.


Yes, it can solve the problem for abbreviations but one will have to add the new abbreviations manually. However, it is still useful because it will be beneficial for future projects also in which the earlier abbreviations figure.

Thanks and regards,

Chopra


 

Rolf Keller
Germany
Local time: 00:04
English to German
How to write macros --> Google Oct 26, 2015

chopra_2002 wrote:

Is there any link in which the procedures for wildcard search, making macros etc. have been explained in detail?


There are hundreds or thousands of explanations & How-to's, at least one for any level of knowledge & grasp of things.

Just throw something like this to Google: macro tutorial "word 2010"


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How to get rid of a problem encountered while splitting a paragraph into sentences?

Advanced search






Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

More info »
PerfectIt consistency checker
Faster Checking, Greater Accuracy

PerfectIt helps deliver error-free documents. It improves consistency, ensures quality and helps to enforce style guides. It’s a powerful tool for pro users, and comes with the assurance of a 30-day money back guarantee.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search