Automatic process to add tab/paragraph marker after/before english or asian (chinese) words? Thread poster: 855649 (X)
|
855649 (X) Local time: 04:03 English to Chinese + ...
Stupid question I'm pretty sure, I wouldn't have a clue how to do this personally, but I thought maybe some of the pros here would IF it's actually possible. I want to turn some glossaries I found on the internet into a trados termbase. In order to do that, I gotta turn... Thermal efficiency 热效率 Thermal equivalent of work 热功当量 Thermal expansion 热膨胀 ...into... Thermal efficiency --> 热效率 Thermal equivalent of work --> 热�... See more Stupid question I'm pretty sure, I wouldn't have a clue how to do this personally, but I thought maybe some of the pros here would IF it's actually possible. I want to turn some glossaries I found on the internet into a trados termbase. In order to do that, I gotta turn... Thermal efficiency 热效率 Thermal equivalent of work 热功当量 Thermal expansion 热膨胀 ...into... Thermal efficiency --> 热效率 Thermal equivalent of work --> 热功当量 Thermal expansion --> 热膨胀 With tabs (-->) in between (to put in excel column). The only way I can see a "replace all" working, or some sort of macro/template, is if it can somehow detect the space before asian characters (or even English would be ok) and then, like replace, insert a tabulator mark in its place. When you have a list of 1000 words, using the mouse to click the space between each one and then hitting tab becomes really tedious! Does anyone have any suggestions? Thanks so much for any help.
[Edited at 2008-08-02 13:05] ▲ Collapse | | |
Jaroslaw Michalak Poland Local time: 22:03 Member (2004) English to Polish SITE LOCALIZER A little trick... | Aug 2, 2008 |
Yes, there is a little trick that allows that, but only if the language attribute is different for both entries... In Word open the "Find and replace" dialog, in the "Find" field select language "English" (from the drop-down button below). In the "Replace" field enter ^&^t (caret followed by ampersand, followed by caret followed by a letter "t"). Or you can select "Found text" and "Tab" from the "Special" menu. If the entries are all-text, without the language attribute... See more Yes, there is a little trick that allows that, but only if the language attribute is different for both entries... In Word open the "Find and replace" dialog, in the "Find" field select language "English" (from the drop-down button below). In the "Replace" field enter ^&^t (caret followed by ampersand, followed by caret followed by a letter "t"). Or you can select "Found text" and "Tab" from the "Special" menu. If the entries are all-text, without the language attribute (as you say you copied them from the Internet), it would be harder... Basically, you need a text editor that works with regular expressions and have to match the character codes of the Asian text. However, this depends on the coding of the text, so I cannot give you precise instructions. ▲ Collapse | | |
It can be done with regular expressions | Aug 2, 2008 |
Reading the last suggestion on regular expressions, I realized that it can be done without knowledge of the encoding scheme. The idea is to insert a separator character before the first occurence of something that is NOT an ASCII-symbol and space (and maybe digits and punctuation marks as well). This can be done with 'find and replace' based on regular expressions, depending on the versatility of the regex engine in the text editor. Or it can be done with a relatively simple macro (... See more Reading the last suggestion on regular expressions, I realized that it can be done without knowledge of the encoding scheme. The idea is to insert a separator character before the first occurence of something that is NOT an ASCII-symbol and space (and maybe digits and punctuation marks as well). This can be done with 'find and replace' based on regular expressions, depending on the versatility of the regex engine in the text editor. Or it can be done with a relatively simple macro (albeit slow) in MS Word or other programmable text editor that looks sequentially at all the characters in the line and inserts a separator before the first non-Latin character. By the way, for a list of 1000 entries it would be faster to do it manually than to write a macro - using the text editor find and replace function to search for spaces and replace them by tabs at the appropriate positions. One just needs to be careful with answering 'Yes' or 'No'. The time it takes to write and test a macro would pay off only if the macro is going to be used repeatedly. ▲ Collapse | | |
Jaroslaw Michalak Poland Local time: 22:03 Member (2004) English to Polish SITE LOCALIZER Use the spaces to your advantage... | Aug 2, 2008 |
bsb's answer gave me another idea: replace all the spaces between regular ascii characters with a special sequence, then replace all the spaces with tabs, then replace the special sequence again with spaces. I.e. in Word: 1. Search (with regular expressions): ([a-z]) ([a-z]) Replace: \1###\2 2. Search: " " (space) Replace: ^t 3. Search: ### Replace: " " (space) Of course, this assumes that in the English phrases t... See more bsb's answer gave me another idea: replace all the spaces between regular ascii characters with a special sequence, then replace all the spaces with tabs, then replace the special sequence again with spaces. I.e. in Word: 1. Search (with regular expressions): ([a-z]) ([a-z]) Replace: \1###\2 2. Search: " " (space) Replace: ^t 3. Search: ### Replace: " " (space) Of course, this assumes that in the English phrases there are no special symbols (like apostrophes, quotes, etc.) at the word boundary. If there are, you have to include them in the first search expression (e.g. [a-z@#$%^&*]). ▲ Collapse | |
|
|
855649 (X) Local time: 04:03 English to Chinese + ... TOPIC STARTER Got it working for the most part, minus a few errors | Aug 3, 2008 |
Amazing suggestions. Thanks bsb_2, though I'm not quite sure what you mean by "The idea is to insert a separator character before the first occurence of something that is NOT an ASCII-symbol and space". I'm don't know how to specify non ascii-symbols. Can MS Word do it? Jabberwock, thanks a ton as well, those steps 1, 2, and 3 helped me a ton. Although the problem, like you said, is when there's a lot of () or [], and A / C for example. It's not picking up uppercase letters (simp... See more Amazing suggestions. Thanks bsb_2, though I'm not quite sure what you mean by "The idea is to insert a separator character before the first occurence of something that is NOT an ASCII-symbol and space". I'm don't know how to specify non ascii-symbols. Can MS Word do it? Jabberwock, thanks a ton as well, those steps 1, 2, and 3 helped me a ton. Although the problem, like you said, is when there's a lot of () or [], and A / C for example. It's not picking up uppercase letters (simple way around that is convert all uppercase to lowercase though), but the main problems are the () and [] and / . I tried using the [a-z@#$%^&*] code, but one, those characters aren't included in it, two, it gives me an error saying "^& is not a valid character". However, for anything that doesn't contain those special characters and consists of basically just pure English text (in the English translation of course), it's a success, thanks so much! ▲ Collapse | | |
Jaroslaw Michalak Poland Local time: 22:03 Member (2004) English to Polish SITE LOCALIZER
The code I gave in the comment was simply an example... To match round brackets and / (and uppercase letters) use: [A-Za-z/)(] The square brackets cannot be matched (and I think they cannot be escaped?), but there is an easy workaround: before the described operation just find and replace them with a placeholder (that is, with a character, or a sequence of characters, which does not appear in the original text). For example, replace [ with % and ] with $ (provided that % and $ do n... See more The code I gave in the comment was simply an example... To match round brackets and / (and uppercase letters) use: [A-Za-z/)(] The square brackets cannot be matched (and I think they cannot be escaped?), but there is an easy workaround: before the described operation just find and replace them with a placeholder (that is, with a character, or a sequence of characters, which does not appear in the original text). For example, replace [ with % and ] with $ (provided that % and $ do not appear in the text). ▲ Collapse | | |
855649 (X) Local time: 04:03 English to Chinese + ... TOPIC STARTER
Ah ha, thats what I was looking for. I'm not familiar with the internal codes of word, and personally, I had no idea that you could even support codes like [a-z]. I then tried [A-Z] to see if that worked, but it didn't, and when I tried [a-z@#$%^&*] and added a () in there, it didn't gave an error, so I figured guessing wouldn't work and gave up. Thank again (a ton) for that bit of extra info. | | |
855649 (X) Local time: 04:03 English to Chinese + ... TOPIC STARTER problems with the 'replace with' code now | Aug 4, 2008 |
Jabberwock wrote: The code I gave in the comment was simply an example... To match round brackets and / (and uppercase letters) use: [A-Za-z/)(] Hi Jabberwock, I tried using the ([A-Za-z/)(] [A-Za-z/)(]) code today, and it found them just fine, but when I wanted to replace with \1###\2, this is what happened... it turned this... rack link 齿条联接杆 rack nut 齿条螺母 rack shaft 齿条 ...into... rack l###ink 齿条联接杆 rack n###ut 齿条螺母 rack s###haft 齿条 Nothing changed at all as far as I can tell. Should the \1###\2 code be altered too perhaps? | |
|
|
Jaroslaw Michalak Poland Local time: 22:03 Member (2004) English to Polish SITE LOCALIZER Sorry again... | Aug 4, 2008 |
When I tested the expression I forgot you need to reuse the matched portions... In that case you cannot match parentheses either, you have to replace them with a placeholder just like in case of square brackets... | | |
855649 (X) Local time: 04:03 English to Chinese + ... TOPIC STARTER
Jabberwock wrote: When I tested the expression I forgot you need to reuse the matched portions... In that case you cannot match parentheses either, you have to replace them with a placeholder just like in case of square brackets... Ok, that's easy enough. Thanks again Jabberwock, you've been a great help to me on this. | | |
Jaroslaw Michalak Poland Local time: 22:03 Member (2004) English to Polish SITE LOCALIZER No problem... | Aug 5, 2008 |
I'm glad that I could help. Maybe even someone else will benefit from the discussion... | | |
855649 (X) Local time: 04:03 English to Chinese + ... TOPIC STARTER one more question | Aug 5, 2008 |
Jabberwock, do you think you can help me out once more? I searched on the internet for more of the ms codes so I can stop bothering you and figure them out for myself, but I just don't get it for my next situation. I'm taking Chinese to English now (instead of English to Chinese). Here's an example. 同步器式变速器 synchromesh transmission 直接档变速器 direct drive transmission 超速档变速器 over drive transmission I'm searching f... See more Jabberwock, do you think you can help me out once more? I searched on the internet for more of the ms codes so I can stop bothering you and figure them out for myself, but I just don't get it for my next situation. I'm taking Chinese to English now (instead of English to Chinese). Here's an example. 同步器式变速器 synchromesh transmission 直接档变速器 direct drive transmission 超速档变速器 over drive transmission I'm searching for: Search: ([!a-z] [a-z]) Which finds the space between the Chinese and the English. But adding the tab is my problem. Replacing with a "^t" cuts off the last Chinese character and the first English letter. If I try something like \1^t\2, it says "The Replace With text contains a group number which is out of range". Any variation of that, like ^t\2, \1###\2, etc give the same error. Do you know what I'm doing wrong?
[Edited at 2008-08-05 15:45] ▲ Collapse | |
|
|
Jaroslaw Michalak Poland Local time: 22:03 Member (2004) English to Polish SITE LOCALIZER Try the same... | Aug 5, 2008 |
Hmm... Actually, you should try the same procedure as in the previous case. This allows you to find spaces within the English term and replace with the placeholder ### (step 1), then the remaining spaces are (or should be) only the spaces between Chinese and English - you replace them with tabs (step 2), finally, you restore the English spaces (step 3). For your interest, you should not write ([!a-z] [a-z]), as this means "group 1 is non-lower ASCII character followed by space follo... See more Hmm... Actually, you should try the same procedure as in the previous case. This allows you to find spaces within the English term and replace with the placeholder ### (step 1), then the remaining spaces are (or should be) only the spaces between Chinese and English - you replace them with tabs (step 2), finally, you restore the English spaces (step 3). For your interest, you should not write ([!a-z] [a-z]), as this means "group 1 is non-lower ASCII character followed by space followed by lower ASCII character", but ([!a-z]) ([a-z]), which means "group 1 is non-lower ASCII character, then follows space, then group 2 is ASCII character. I know that regexp (regular expressions) can get confusing, especially in Word, which uses a non-standard flavor of regexp. If you are really interested, you might look at Perl. It is a programming language which allows doing such conversion with two or three lines of code which can be then reused over and over. Of course, first you have to write a few dozen non-working programs (I'm not there yet myself...). ▲ Collapse | | |