Is there any method or software to extract same and similar sentences (segments) from a file?
Thread poster: chopra_2002

chopra_2002  Identity Verified
India
Local time: 23:44
Member (2008)
English to Hindi
+ ...
Jul 14

Hi experts,

I want to know is there any method or a software to extract same and similar sentences (segments) from a file?

Wordfast Pro retains one of the same segments and removes all such segments in a newly generated file. But I could not find a software or a method to extract the similar segments by entering the desired percentage (e.g. removing the segments which are 85% similar). I am looking for a software or a procedure by which one may remove the similar sentences.

Does a software or method exist at all for doing so? I would appreciate if you could guide me in this respect.

Thanks and regards,

Chopra


 

Samuel Murray  Identity Verified
Netherlands
Local time: 20:14
Member (2006)
English to Afrikaans
+ ...
Only in very cumbersome ways Jul 14

chopra_2002 wrote:
I want to know is there any method or a software to extract same and similar sentences (segments) from a file?


I know of no automated method, but this is truly something that I would have expected CAT tools to have, i.e. to analyze a file against a TM and then export a TM that contains all matches from that TM up to a certain percentage. Some tools do have a similar feature, but it only exports the single highest match for any segment.

To do what you want, then, you would then create a dummy TM that contains the source text as both source and target text, and then analyze that same file against that TM. You'll get all 100% matches, of course, but a CAT tool should also show you non-100% matches. Now all you need is a way to export all those matches (100% and fuzzy) to a separate TM.

FWIW... from a theoretical point of view:

I have a way to do it in WFP 3, but it is cumbersome and time consuming. I use an AutoIt script that allows me to visit each segment and then automatically copy a certain number of matches from the match pane. I use it to extract segments from a TM server but you can use it to extract segments from a dummy TM created from the current file. I offer no support: wftmserver_extract.zip.

I've just discovered that doing something like this would be quite simple in OmegaT (again, using a script, unfortunately, which I don't have time to write right now), because you can customise the way fuzzy matches are displayed in the match pane, and if you put the cursor in the fuzzy match pane, you can copy all text from it (i.e. all matches at once), and when you press Ctrl+U (go to next segment), the cursor remains in the fuzzy match pane, so you can then easily copy the next set of matches without having to worry about automating the clicking between panes.

qtpbhuzhoadaerazszwa.png

I think I recall that an earlier version of WFC had the ability to create something that was then known as a "project TM", which was something like what you're looking for (except that the extraction would be in TM format), but I could not find any mention of it now.


[Edited at 2018-07-14 08:41 GMT]


chopra_2002
 

Philippe Etienne  Identity Verified
Spain
Local time: 20:14
Member
English to French
Back to the past Jul 14

chopra_2002 wrote:
...But I could not find a software or a method to extract the similar segments by entering the desired percentage (e.g. removing the segments which are 85% similar).

Trados pre-2009 (Workbench)
It can extract all repetitions in a set of files.
It could also generate a file with all the segments below a user-defined ceiling (like anything below a 85% match): the "unknown segments". But I understand you want xx%matches only, so it would be what's "left out".

MemoQ 2013 R2
Using views, it can extract all repetitions in a set of files.
Using the Pre-translate feature and setting the "Good match" value to whatever threshold of interest in the TM settings, you can pretranslate your set of files, then create a view containing only the lines pretranslated up to that specific match value (or its complement), then export the view to a bilingual Word file/memoQ file/bilingual old-Trados doc. file.

Surely all modern CAT Tools also have a workaround for this.

Philippe

[Edited at 2018-07-14 12:50 GMT]


chopra_2002
 

Samuel Murray  Identity Verified
Netherlands
Local time: 20:14
Member (2006)
English to Afrikaans
+ ...
@Philippe (and myself) Jul 14

Philippe Etienne wrote:
Trados pre-2009 (Workbench)
It can extract all repetitions in a set of files. ... It could also generate a file with all the segments below a user-defined ceiling (like anything below a 85% match): the "unknown segments". But I understand you want xx%matches only, so it would be what's "left out".


In addition, this feature works only if you have an existing TM, and it extracts only segments that do not match that TM above the threshold. This means that if you try this with an empty TM, it will extract all segments, and if you try this with an exact-match TM, it will extract no segments.

I just checked, and Trados pre-2009 does have the "Create Project TM" feature that I had remembered, but it is/was only available in the professional (i.e. agency) version, and I can't tell from the user guide whether this project TM would have had only one match per segment (in which case it would be useless for the OP's purposes) or multiple matches.

MemoQ 2013 R2
Using the Pre-translate feature and setting the "Good match" value to whatever threshold of interest in the TM settings, you can pretranslate your set of files...


Yes, but again this only works if you're analysing your file against a TM, and I suspect the same will apply: if the TM is empty, all segments will be extacted, if the TM is an exact-match TM, no segments will be extracted.

According to this help page, MemoQ 2014 could also create a "project TM" in the same way as Trados pre-2009, but in the case of MemoQ it claims that the TM would contain all matches, not just the highest matches, which means that creating a project TM against an exact-match TM would result in something similar to what the OP is looking for. This feature is also available in MemoQ 2015 (which I have). Unfortunately the exported TM does not contain any indication of what the match percentages were, so the resulting TM will contain all segments and there would be no way to find and delete segments that fell below a certain threshold.

OmegaT
I just found that OmegaT does have a "Create Project TM" feature that exports all segments above a certain threshold (not just the highest match), although this doesn't help the OP because it also exports exact matches and it does not indicate the match percentage in the export TM.
https://gist.github.com/yu-tang/6526991



[Edited at 2018-07-15 16:24 GMT]


chopra_2002
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Is there any method or software to extract same and similar sentences (segments) from a file?

Advanced search






CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use SDL Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

More info »
SDL MultiTerm 2017
Guarantee a unified, consistent and high-quality translation with terminology software by the industry leaders.

SDL MultiTerm 2017 allows translators to create one central location to store and manage multilingual terminology, and with SDL MultiTerm Extract 2017 you can automatically create term lists from your existing documentation to save time.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search