https://www.proz.com/forum/office_applications/293735-how_to_get_rid_of_a_problem_encountered_while_splitting_a_paragraph_into_sentences.html

How to get rid of a problem encountered while splitting a paragraph into sentences?
Thread poster: Rajan Chopra
Rajan Chopra
Rajan Chopra
India
Local time: 05:16
Member (2008)
English to Hindi
+ ...
Oct 25, 2015

Hello experts,

I sometimes feel the need for splitting a paragraph into sentences. There is a standard method to do so by going to the end of a sentence, i.e. full stop, and hit enter but it consumes time. There is another method by using Find and Replace function, i.e. putting (.) in the find box and (.^p) in the replace box and hitting Replace All (or entering a period in the Replace With field and then selecting the “Manual Line Break” option twice from the Special drop-dow
... See more
Hello experts,

I sometimes feel the need for splitting a paragraph into sentences. There is a standard method to do so by going to the end of a sentence, i.e. full stop, and hit enter but it consumes time. There is another method by using Find and Replace function, i.e. putting (.) in the find box and (.^p) in the replace box and hitting Replace All (or entering a period in the Replace With field and then selecting the “Manual Line Break” option twice from the Special drop-down menu) but the actual problem is that it breaks the sentences unnecessarily because full stop (.) is used for other purposes also in English. For example:

1. abbreviations (W.H.0.)
2. honorific titles (Mr., Mrs.)
3. decimal in amounts (Rs. 483.97)
4. email addresses ([email protected])

The procedure breaks the sentences in all such situations. Is there any method or trick to avoid the same?

Thanks and regards,

Chopra





[Edited at 2015-10-25 03:24 GMT]
Collapse


 
..... (X)
..... (X)
Local time: 08:46
Sentence Boundary Detection Oct 25, 2015

The problem is called sentence boundary detection (or segmentation). There are many different libraries for segmentation (and most CAT tools come equipped to perform segmentation).

I did a comparison of all of the popular sentence segmentation libraries
... See more
The problem is called sentence boundary detection (or segmentation). There are many different libraries for segmentation (and most CAT tools come equipped to perform segmentation).

I did a comparison of all of the popular sentence segmentation libraries based on a set of Golden Rules (common edge case scenarios) when I developed TM-Town's open source segmentation library Pragmatic Segmenter.

There is a live demo of Pragmatic Segmenter here.

Kevin
Collapse


 
2nl (X)
2nl (X)  Identity Verified
Netherlands
Local time: 01:46
Use an advanced wildcard search Oct 25, 2015

chopra_2002 wrote:

1. abbreviations (W.H.0.)
2. honorific titles (Mr., Mrs.)
3. decimal in amounts (Rs. 483.97)
4. email addresses ([email protected])



You can exclude cases 3 and 4 with a wildcard search, requiring a new line/paragraph mark or uppercase letter to follow zero or multiple spaces following the full stop.

The first two full stops in case 1 are covered by the same expression. To avoid splitting after the last full stop, you have to use an acronym or abbreviation list. That will require a macro. Or, since all letters are uppercase, define a wildcard search to hide the full stops temporarily.

Also by means of a macro: case 2. Or you can temporarily hide the full stop in these titles. Since the number of titles is limited, you could use a wildcard search and even record it as a macro.


 
Meta Arkadia
Meta Arkadia
Local time: 06:46
English to Indonesian
+ ...
LibreOffice Oct 25, 2015

Kevin is right, of course. Just about every CAT tool can do it, you can use Okapi for the purpose, but I think the easiest solution is a LibreOffice extension:



LibreOffice is free (beer and err, libre), and it's as close as you can get to a Word file.

Cheers,

Hans


 
Rajan Chopra
Rajan Chopra
India
Local time: 05:16
Member (2008)
English to Hindi
+ ...
TOPIC STARTER
Thanks for your quick reply Oct 25, 2015

Kevin Dias wrote:


I did a comparison of all of the popular sentence segmentation libraries based on a set of Golden Rules (common edge case scenarios) when I developed TM-Town's open source segmentation library Pragmatic Segmenter.


Kevin


Thank you for your informative reply. Is there any software which will take care of these issues? If so, please inform the link for the same.

Thanks and regards,

Chopra


 
Rajan Chopra
Rajan Chopra
India
Local time: 05:16
Member (2008)
English to Hindi
+ ...
TOPIC STARTER
Thanks for suggesting the solutions Oct 25, 2015

[/quote]

You can exclude cases 3 and 4 with a wildcard search, requiring a new line/paragraph mark or uppercase letter to follow zero or multiple spaces following the full stop.

The first two full stops in case 1 are covered by the same expression. To avoid splitting after the last full stop, you have to use an acronym or abbreviation list. That will require a macro. Or, since all letters are uppercase, define a wildcard search to hide the full stops temporarily.
... See more
[/quote]

You can exclude cases 3 and 4 with a wildcard search, requiring a new line/paragraph mark or uppercase letter to follow zero or multiple spaces following the full stop.

The first two full stops in case 1 are covered by the same expression. To avoid splitting after the last full stop, you have to use an acronym or abbreviation list. That will require a macro. Or, since all letters are uppercase, define a wildcard search to hide the full stops temporarily.

Also by means of a macro: case 2. Or you can temporarily hide the full stop in these titles. Since the number of titles is limited, you could use a wildcard search and even record it as a macro. [/quote]

Thanks so much for your valuable suggestions but one should have technical knowledge to apply it. Is there any link in which the procedures for wildcard search, making macros etc. have been explained in detail?

Thanks and regards,

Chopra





[Edited at 2015-10-25 04:05 GMT]
Collapse


 
2nl (X)
2nl (X)  Identity Verified
Netherlands
Local time: 01:46
Use a Word-based CAT tool to pre-segment Oct 25, 2015

My advice would be to use a CAT tool like Wordfast or Metatexis to divide the document into segments. Demo versions will probably do the job.

May I ask: How are you translating your document?


 
Rajan Chopra
Rajan Chopra
India
Local time: 05:16
Member (2008)
English to Hindi
+ ...
TOPIC STARTER
I have the licensed (full version) of Wordfast Pro Oct 25, 2015

2nl wrote:

My advice would be to use a CAT tool like Wordfast or Metatexis to divide the document into segments. Demo versions will probably do the job.

May I ask: How are you translating your document?


I have the latest version of WF Pro (3.4.5) but even it is not quite capable of solving this problem and the sentences are broken unnecessarily which not only causes inconvenience but also spoils the translation memory because a different translation is entered in the broken segments (Remember: The syntax of English and Hindi is different).

I use this software (WF Pro) to translate the documents. In addition to it, I also have Trados Freelance 7.0 which is also unable to help in this respect.

Thanks and regards,

Chopra


 
Tom in London
Tom in London
United Kingdom
Local time: 00:46
Member (2008)
Italian to English
Only a paragraph Oct 25, 2015

chopra_2002 wrote:

Hello experts,

I sometimes feel the need for splitting a paragraph into sentences. There is a standard method to do so by going to the end of a sentence, i.e. full stop, and hit enter but it consumes time. There is another method by using Find and Replace function, i.e. putting (.) in the find box and (.^p) in the replace box and hitting Replace All (or entering a period in the Replace With field and then selecting the “Manual Line Break” option twice from the Special drop-down menu) but the actual problem is that it breaks the sentences unnecessarily because full stop (.) is used for other purposes also in English. For example:

1. abbreviations (W.H.0.)
2. honorific titles (Mr., Mrs.)
3. decimal in amounts (Rs. 483.97)
4. email addresses ([email protected])

The procedure breaks the sentences in all such situations. Is there any method or trick to avoid the same?

Thanks and regards,

Chopra





[Edited at 2015-10-25 03:24 GMT]


It would be different for a long document but since you're only talking about a paragraph, why can't you just do a find/replace on a one-by-one basis? I don't see the problem.


 
Heinrich Pesch
Heinrich Pesch  Identity Verified
Finland
Local time: 02:46
Member (2003)
Finnish to German
+ ...
WFP segmentation settings Oct 25, 2015

Go to Edit - Preferences and select Segmentation settings. Down on the page you will find a list of the abbriavations for different languages. You can add the necessary cases which you think are missing and don't forget to put a comma after the last item.
The same is true for WFC.


 
Rajan Chopra
Rajan Chopra
India
Local time: 05:16
Member (2008)
English to Hindi
+ ...
TOPIC STARTER
It is not about just one paragraph Oct 25, 2015

Tom in London wrote:


It would be different for a long document but since you're only talking about a paragraph, why can't you just do a find/replace on a one-by-one basis? I don't see the problem.


Thanks for your reply. I am not talking about just one paragraph because I get small as well as big projects for translation. If a procedure can be applied to one paragraph, the same can be applied to a text consisting of thousands of words.

Thanks and regards,

Chopra


 
Meta Arkadia
Meta Arkadia
Local time: 06:46
English to Indonesian
+ ...
This... Oct 25, 2015

Heinrich Pesch wrote:
Go to Edit - Preferences and select Segmentation settings.


...makes sense. A lot of it. If you translate from Hindi to English, paragraph segmentation makes sense. However, I think the only thing the OP will have to do if he switches to English to Hindi, is change the segmentation settings from paragraph segmentation to sentence segmentation.

Cheers,

Hans


 
Rajan Chopra
Rajan Chopra
India
Local time: 05:16
Member (2008)
English to Hindi
+ ...
TOPIC STARTER
Thanks for the suggestion Oct 25, 2015

Heinrich Pesch wrote:

Go to Edit - Preferences and select Segmentation settings. Down on the page you will find a list of the abbriavations for different languages. You can add the necessary cases which you think are missing and don't forget to put a comma after the last item.
The same is true for WFC.


Yes, it can solve the problem for abbreviations but one will have to add the new abbreviations manually. However, it is still useful because it will be beneficial for future projects also in which the earlier abbreviations figure.

Thanks and regards,

Chopra


 
Rolf Keller
Rolf Keller
Germany
Local time: 01:46
English to German
How to write macros --> Google Oct 26, 2015

chopra_2002 wrote:

Is there any link in which the procedures for wildcard search, making macros etc. have been explained in detail?


There are hundreds or thousands of explanations & How-to's, at least one for any level of knowledge & grasp of things.

Just throw something like this to Google: macro tutorial "word 2010"


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How to get rid of a problem encountered while splitting a paragraph into sentences?






Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »
Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »