Is there any method or software to extract same and similar sentences (segments) from a file?
Thread poster: chopra_2002

chopra_2002  Identity Verified
India
Local time: 15:22
Member (2008)
English to Hindi
+ ...
Jul 14, 2018

Hi experts,

I want to know is there any method or a software to extract same and similar sentences (segments) from a file?

Wordfast Pro retains one of the same segments and removes all such segments in a newly generated file. But I could not find a software or a method to extract the similar segments by entering the desired percentage (e.g. removing the segments which are 85% similar). I am looking for a software or a procedure by which one may remove the similar sentences.

Does a software or method exist at all for doing so? I would appreciate if you could guide me in this respect.

Thanks and regards,

Chopra


 

Samuel Murray  Identity Verified
Netherlands
Local time: 10:52
Member (2006)
English to Afrikaans
+ ...
Only in very cumbersome ways Jul 14, 2018

chopra_2002 wrote:
I want to know is there any method or a software to extract same and similar sentences (segments) from a file?


I know of no automated method, but this is truly something that I would have expected CAT tools to have, i.e. to analyze a file against a TM and then export a TM that contains all matches from that TM up to a certain percentage. Some tools do have a similar feature, but it only exports the single highest match for any segment.

To do what you want, then, you would then create a dummy TM that contains the source text as both source and target text, and then analyze that same file against that TM. You'll get all 100% matches, of course, but a CAT tool should also show you non-100% matches. Now all you need is a way to export all those matches (100% and fuzzy) to a separate TM.

FWIW... from a theoretical point of view:

I have a way to do it in WFP 3, but it is cumbersome and time consuming. I use an AutoIt script that allows me to visit each segment and then automatically copy a certain number of matches from the match pane. I use it to extract segments from a TM server but you can use it to extract segments from a dummy TM created from the current file. I offer no support: wftmserver_extract.zip.

I've just discovered that doing something like this would be quite simple in OmegaT (again, using a script, unfortunately, which I don't have time to write right now), because you can customise the way fuzzy matches are displayed in the match pane, and if you put the cursor in the fuzzy match pane, you can copy all text from it (i.e. all matches at once), and when you press Ctrl+U (go to next segment), the cursor remains in the fuzzy match pane, so you can then easily copy the next set of matches without having to worry about automating the clicking between panes.

qtpbhuzhoadaerazszwa.png

I think I recall that an earlier version of WFC had the ability to create something that was then known as a "project TM", which was something like what you're looking for (except that the extraction would be in TM format), but I could not find any mention of it now.


[Edited at 2018-07-14 08:41 GMT]


chopra_2002
 

Philippe Etienne  Identity Verified
Spain
Local time: 10:52
Member
English to French
Back to the past Jul 14, 2018

chopra_2002 wrote:
...But I could not find a software or a method to extract the similar segments by entering the desired percentage (e.g. removing the segments which are 85% similar).

Trados pre-2009 (Workbench)
It can extract all repetitions in a set of files.
It could also generate a file with all the segments below a user-defined ceiling (like anything below a 85% match): the "unknown segments". But I understand you want xx%matches only, so it would be what's "left out".

MemoQ 2013 R2
Using views, it can extract all repetitions in a set of files.
Using the Pre-translate feature and setting the "Good match" value to whatever threshold of interest in the TM settings, you can pretranslate your set of files, then create a view containing only the lines pretranslated up to that specific match value (or its complement), then export the view to a bilingual Word file/memoQ file/bilingual old-Trados doc. file.

Surely all modern CAT Tools also have a workaround for this.

Philippe

[Edited at 2018-07-14 12:50 GMT]


chopra_2002
 

Samuel Murray  Identity Verified
Netherlands
Local time: 10:52
Member (2006)
English to Afrikaans
+ ...
@Philippe (and myself) Jul 14, 2018

Philippe Etienne wrote:
Trados pre-2009 (Workbench)
It can extract all repetitions in a set of files. ... It could also generate a file with all the segments below a user-defined ceiling (like anything below a 85% match): the "unknown segments". But I understand you want xx%matches only, so it would be what's "left out".


In addition, this feature works only if you have an existing TM, and it extracts only segments that do not match that TM above the threshold. This means that if you try this with an empty TM, it will extract all segments, and if you try this with an exact-match TM, it will extract no segments.

I just checked, and Trados pre-2009 does have the "Create Project TM" feature that I had remembered, but it is/was only available in the professional (i.e. agency) version, and I can't tell from the user guide whether this project TM would have had only one match per segment (in which case it would be useless for the OP's purposes) or multiple matches.

MemoQ 2013 R2
Using the Pre-translate feature and setting the "Good match" value to whatever threshold of interest in the TM settings, you can pretranslate your set of files...


Yes, but again this only works if you're analysing your file against a TM, and I suspect the same will apply: if the TM is empty, all segments will be extacted, if the TM is an exact-match TM, no segments will be extracted.

According to this help page, MemoQ 2014 could also create a "project TM" in the same way as Trados pre-2009, but in the case of MemoQ it claims that the TM would contain all matches, not just the highest matches, which means that creating a project TM against an exact-match TM would result in something similar to what the OP is looking for. This feature is also available in MemoQ 2015 (which I have). Unfortunately the exported TM does not contain any indication of what the match percentages were, so the resulting TM will contain all segments and there would be no way to find and delete segments that fell below a certain threshold.

OmegaT
I just found that OmegaT does have a "Create Project TM" feature that exports all segments above a certain threshold (not just the highest match), although this doesn't help the OP because it also exports exact matches and it does not indicate the match percentage in the export TM.
https://gist.github.com/yu-tang/6526991



[Edited at 2018-07-15 16:24 GMT]


chopra_2002
 

Olaf Schutze  Identity Verified
Vietnam
English to German
+ ...
How you will define you percentual "matches"? Feb 12

There are almost always spaces, symbols, formatting characters, typos and whatever in each document. You will have to assign a value to each used/possible character and then do the maths.
All this is done by the TM's (e.g. MemoQ) to get some sort of statistics and they are roughly the best results, you can get automatically. MemoQ uses even two TM's for creating statistics, as they can very differ.
Those differences can even differ again by certain languages pairs; fortnight = fourteen (days) = 14 (days)?

Here is some LINUX bash code, how to slowly approach:
1) Copy all textual content into a plain text file
2) Paste all textual content in and save file as "in.txt"
3) Open a terminal in working directory ad paste the following few lines:

for x in $(cat in.txt | tr -s ' ' '\n' | sort | uniq -c | sort -r | awk '{ print $2}');
do grep -oh '[^.]*\s'$x'\s' in.txt >> out.tmp;
done
sort out.tmp > out.txt


4) The resulting strings look like:

"The
The resulting
The resulting strings
The resulting strings look
The resulting strings look like" and the after the matches, one still has to do

* All single occurrences BEFORE any REPEATING match are gone. The LONGEST lines of similar are those most valuable strings/segments

5) Save/copy/import the out.txt to your TM project.
This is usually my FIRST file of a project, so the primary project TM can learn to build, there is the usual progress, a TM 'learns' and consumes a bit of the computers processing power/time.

6) in your terminal run the following lines:

rm out.tmp in.txt
exit

6) Again, it's a messy approach I am using, to identify more easily identify fixed terms and for quoting.

Have fun icon_wink.gif

[Edited at 2019-02-13 04:51 GMT]

[Edited at 2019-02-13 04:52 GMT]

[Edited at 2019-02-13 04:55 GMT]

[Edited at 2019-02-13 04:55 GMT]

[Edited at 2019-02-13 04:57 GMT]


 

Samuel Murray  Identity Verified
Netherlands
Local time: 10:52
Member (2006)
English to Afrikaans
+ ...
Old thread Feb 13

chopra_2002 wrote:
I want to know is there any method or a software to extract same and similar sentences (segments) from a file?
...
I am looking for a software or a procedure by which one may remove the similar sentences.


Olaf's response to this old thread made me consider the possibility that I may have misinterpreted Chopra's original message. I had assumed that Chopra meant "retain in a separate file" when he said "extract", but upon rereading his request, it occurred to me that he may have meant "remove" when he said "extract". What do you all think?


chopra_2002
 

chopra_2002  Identity Verified
India
Local time: 15:22
Member (2008)
English to Hindi
+ ...
TOPIC STARTER
You are right Feb 13

I made a mistake. I should have written remove instead of extract. Suppose, if a file has 40 sentences out of which there are 8 sentences which are similar. A CAT tool can remove the same sentences but it can't remove the similar sentences. So, in the example given above, it will be great if could find a feature to keep one of those 8 sentences and the 7 sentences are removed in the newly created file. Later, the TM of this sentence may help in translating the remaining those 7 sentences.

Is it possible by some software or application?

Regards,

Chopra



Samuel Murray wrote:

chopra_2002 wrote:
I want to know is there any method or a software to extract same and similar sentences (segments) from a file?
...
I am looking for a software or a procedure by which one may remove the similar sentences.


Olaf's response to this old thread made me consider the possibility that I may have misinterpreted Chopra's original message. I had assumed that Chopra meant "retain in a separate file" when he said "extract", but upon rereading his request, it occurred to me that he may have meant "remove" when he said "extract". What do you all think?



[Edited at 2019-02-13 07:13 GMT]


 

Olaf Schutze  Identity Verified
Vietnam
English to German
+ ...
It's a cumbersome task Feb 13

especially for people, who fully depend on ready made lines.
Yes, its also possible, extract the plain longest matches ONLY, it is very painful for me, doing half-page-sentences, whilst virtually any TM add's for free the next word, so the TM grows naturally, without doing even a thing, but adding another 10-20 words, whilst one only has a 5% match, that's the point for growing.
I have and use MemoQ 2015 since about a year or so, before I was till on 4.x something.
As work in IT and apart from translation with MemoQ, I do nothing but Linux.
I actually did run the lines also to, some 45000 segments,and that is prepared within a few seconds.
My TM's having about 5.5 million strings + connected/exchanging on-line. That sounds a lot, but that is the way, I work - lazy masting the TM's to have my work ready in timely manner. And yes, they are still growing, barely a string with less than ten choices and even less, to add new words, only improving phrasing


 

Olaf Schutze  Identity Verified
Vietnam
English to German
+ ...
.... Feb 13

chopra_2002 wrote:

I made a mistake. I should have written remove instead of extract. Suppose, if a file has 40 sentences out of which there are 8 sentences which are similar. A CAT tool can remove the same sentences but it can't remove the similar sentences. So, in the example given above, it will be great if could find a feature to keep one of those 8 sentences and the 7 sentences are removed in the newly created file. Later, the TM of this sentence may help in translating the remaining those 7 sentences.

Is it possible by some software or application?

Regards,

Chopra



Samuel Murray wrote:

chopra_2002 wrote:
I want to know is there any method or a software to extract same and similar sentences (segments) from a file?
...
I am looking for a software or a procedure by which one may remove the similar sentences.


Olaf's response to this old thread made me consider the possibility that I may have misinterpreted Chopra's original message. I had assumed that Chopra meant "retain in a separate file" when he said "extract", but upon rereading his request, it occurred to me that he may have meant "remove" when he said "extract". What do you all think?



[Edited at 2019-02-13 07:13 GMT]



I would/do keep them all, because at the ends only a small bit changes or is missing. The missing bits to 100% matches are often the overhung at the end, which is to much or wrong.


 

Olaf Schutze  Identity Verified
Vietnam
English to German
+ ...
"Yes, but again this only works if you're analysing your file against a TM, and I suspect the same " Feb 13

Samuel Murray wrote:

Philippe Etienne wrote:
Trados pre-2009 (Workbench)
It can extract all repetitions in a set of files. ... It could also generate a file with all the segments below a user-defined ceiling (like anything below a 85% match): the "unknown segments". But I understand you want xx%matches only, so it would be what's "left out".


In addition, this feature works only if you have an existing TM, and it extracts only segments that do not match that TM above the threshold. This means that if you try this with an empty TM, it will extract all segments, and if you try this with an exact-match TM, it will extract no segments.

I just checked, and Trados pre-2009 does have the "Create Project TM" feature that I had remembered, but it is/was only available in the professional (i.e. agency) version, and I can't tell from the user guide whether this project TM would have had only one match per segment (in which case it would be useless for the OP's purposes) or multiple matches.

MemoQ 2013 R2
Using the Pre-translate feature and setting the "Good match" value to whatever threshold of interest in the TM settings, you can pretranslate your set of files...


Yes, but again this only works if you're analysing your file against a TM, and I suspect the same will apply: if the TM is empty, all segments will be extacted, if the TM is an exact-match TM, no segments will be extracted.

According to this help page, MemoQ 2014 could also create a "project TM" in the same way as Trados pre-2009, but in the case of MemoQ it claims that the TM would contain all matches, not just the highest matches, which means that creating a project TM against an exact-match TM would result in something similar to what the OP is looking for. This feature is also available in MemoQ 2015 (which I have). Unfortunately the exported TM does not contain any indication of what the match percentages were, so the resulting TM will contain all segments and there would be no way to find and delete segments that fell below a certain threshold.

OmegaT
I just found that OmegaT does have a "Create Project TM" feature that exports all segments above a certain threshold (not just the highest match), although this doesn't help the OP because it also exports exact matches and it does not indicate the match percentage in the export TM.
https://gist.github.com/yu-tang/6526991



[Edited at 2018-07-15 16:24 GMT]



Usually, you should be able to connect more than a single TM to a translation/project. I always start off with an empty one, whilst older/bigger/merged ones are providing, what the empty doesn't have. More or less about 10 I have always in use.
You should number/version your TM's and the older ones as secondary/minor ...
In my oldest ones, there I often now see "oooooh my goodness". If that is to often, I just let die a whole (lowest number) TM, independent of matches. Because when working like I do, there is quick a huge new source to replace

[Edited at 2019-02-13 11:50 GMT]


 

Samuel Murray  Identity Verified
Netherlands
Local time: 10:52
Member (2006)
English to Afrikaans
+ ...
I know of no such tool Feb 13

chopra_2002 wrote:
Suppose, if a file has 40 sentences out of which there are 8 sentences which are similar. It will be great if could find a feature to keep one of those 8 sentences and the 7 sentences are removed in the newly created file.


I know of no tool that can do this automatically, no. I can see the usefulness, though.


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Is there any method or software to extract same and similar sentences (segments) from a file?

Advanced search






TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
SDL Trados Studio 2019 Freelance
The leading translation software used by over 250,000 translators.

SDL Trados Studio 2019 has evolved to bring translators a brand new experience. Designed with user experience at its core, Studio 2019 transforms how new users get up and running and helps experienced users make the most of the powerful features.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search