Automatic process to add tab/paragraph marker after/before english or asian (chinese) words?
Thread poster: Prof Projex

Prof Projex  Identity Verified
China
Local time: 23:37
English to Chinese
+ ...
Aug 2, 2008

Stupid question I'm pretty sure, I wouldn't have a clue how to do this personally, but I thought maybe some of the pros here would IF it's actually possible.

I want to turn some glossaries I found on the internet into a trados termbase. In order to do that, I gotta turn...
Thermal efficiency 热效率
Thermal equivalent of work 热功当量
Thermal expansion 热膨胀
...into...
Thermal efficiency --> 热效率
Thermal equivalent of work --> 热功当量
Thermal expansion --> 热膨胀

With tabs (-->) in between (to put in excel column). The only way I can see a "replace all" working, or some sort of macro/template, is if it can somehow detect the space before asian characters (or even English would be ok) and then, like replace, insert a tabulator mark in its place. When you have a list of 1000 words, using the mouse to click the space between each one and then hitting tab becomes really tedious! Does anyone have any suggestions? Thanks so much for any help.


[Edited at 2008-08-02 13:05]


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 17:37
Member (2004)
English to Polish
A little trick... Aug 2, 2008

Yes, there is a little trick that allows that, but only if the language attribute is different for both entries...

In Word open the "Find and replace" dialog, in the "Find" field select language "English" (from the drop-down button below). In the "Replace" field enter ^&^t (caret followed by ampersand, followed by caret followed by a letter "t"). Or you can select "Found text" and "Tab" from the "Special" menu.

If the entries are all-text, without the language attribute (as you say you copied them from the Internet), it would be harder... Basically, you need a text editor that works with regular expressions and have to match the character codes of the Asian text. However, this depends on the coding of the text, so I cannot give you precise instructions.


Direct link Reply with quote
 
Boyan Brezinsky  Identity Verified
Bulgaria
Local time: 18:37
English to Bulgarian
+ ...
It can be done with regular expressions Aug 2, 2008

Reading the last suggestion on regular expressions, I realized that it can be done without knowledge of the encoding scheme. The idea is to insert a separator character before the first occurence of something that is NOT an ASCII-symbol and space (and maybe digits and punctuation marks as well).
This can be done with 'find and replace' based on regular expressions, depending on the versatility of the regex engine in the text editor.
Or it can be done with a relatively simple macro (albeit slow) in MS Word or other programmable text editor that looks sequentially at all the characters in the line and inserts a separator before the first non-Latin character.
By the way, for a list of 1000 entries it would be faster to do it manually than to write a macro - using the text editor find and replace function to search for spaces and replace them by tabs at the appropriate positions. One just needs to be careful with answering 'Yes' or 'No'. The time it takes to write and test a macro would pay off only if the macro is going to be used repeatedly.


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 17:37
Member (2004)
English to Polish
Use the spaces to your advantage... Aug 2, 2008

bsb's answer gave me another idea: replace all the spaces between regular ascii characters with a special sequence, then replace all the spaces with tabs, then replace the special sequence again with spaces. I.e. in Word:

1.
Search (with regular expressions): ([a-z]) ([a-z])
Replace: \1###\2

2.
Search: " " (space)
Replace: ^t

3. Search: ###
Replace: " " (space)

Of course, this assumes that in the English phrases there are no special symbols (like apostrophes, quotes, etc.) at the word boundary. If there are, you have to include them in the first search expression (e.g. [a-z@#$%^&*]).


Direct link Reply with quote
 

Prof Projex  Identity Verified
China
Local time: 23:37
English to Chinese
+ ...
TOPIC STARTER
Got it working for the most part, minus a few errors Aug 3, 2008

Amazing suggestions. Thanks bsb_2, though I'm not quite sure what you mean by "The idea is to insert a separator character before the first occurence of something that is NOT an ASCII-symbol and space". I'm don't know how to specify non ascii-symbols. Can MS Word do it?

Jabberwock, thanks a ton as well, those steps 1, 2, and 3 helped me a ton. Although the problem, like you said, is when there's a lot of () or [], and A / C for example. It's not picking up uppercase letters (simple way around that is convert all uppercase to lowercase though), but the main problems are the () and [] and / . I tried using the [a-z@#$%^&*] code, but one, those characters aren't included in it, two, it gives me an error saying "^& is not a valid character". However, for anything that doesn't contain those special characters and consists of basically just pure English text (in the English translation of course), it's a success, thanks so much!


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 17:37
Member (2004)
English to Polish
Sorry... Aug 3, 2008

The code I gave in the comment was simply an example... To match round brackets and / (and uppercase letters) use: [A-Za-z/)(]

The square brackets cannot be matched (and I think they cannot be escaped?), but there is an easy workaround: before the described operation just find and replace them with a placeholder (that is, with a character, or a sequence of characters, which does not appear in the original text). For example, replace [ with % and ] with $ (provided that % and $ do not appear in the text).


Direct link Reply with quote
 

Prof Projex  Identity Verified
China
Local time: 23:37
English to Chinese
+ ...
TOPIC STARTER
Ah ha Aug 3, 2008

Ah ha, thats what I was looking for. I'm not familiar with the internal codes of word, and personally, I had no idea that you could even support codes like [a-z]. I then tried [A-Z] to see if that worked, but it didn't, and when I tried [a-z@#$%^&*] and added a () in there, it didn't gave an error, so I figured guessing wouldn't work and gave up. Thank again (a ton) for that bit of extra info.

Direct link Reply with quote
 

Prof Projex  Identity Verified
China
Local time: 23:37
English to Chinese
+ ...
TOPIC STARTER
problems with the 'replace with' code now Aug 4, 2008

Jabberwock wrote:

The code I gave in the comment was simply an example... To match round brackets and / (and uppercase letters) use: [A-Za-z/)(]


Hi Jabberwock, I tried using the ([A-Za-z/)(] [A-Za-z/)(]) code today, and it found them just fine, but when I wanted to replace with \1###\2, this is what happened...

it turned this...

rack link 齿条联接杆
rack nut 齿条螺母
rack shaft 齿条

...into...

rack l###ink 齿条联接杆
rack n###ut 齿条螺母
rack s###haft 齿条

Nothing changed at all as far as I can tell. Should the \1###\2 code be altered too perhaps?


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 17:37
Member (2004)
English to Polish
Sorry again... Aug 4, 2008

When I tested the expression I forgot you need to reuse the matched portions... In that case you cannot match parentheses either, you have to replace them with a placeholder just like in case of square brackets...

Direct link Reply with quote
 

Prof Projex  Identity Verified
China
Local time: 23:37
English to Chinese
+ ...
TOPIC STARTER
Easy enough Aug 5, 2008

Jabberwock wrote:

When I tested the expression I forgot you need to reuse the matched portions... In that case you cannot match parentheses either, you have to replace them with a placeholder just like in case of square brackets...


Ok, that's easy enough. Thanks again Jabberwock, you've been a great help to me on this.


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 17:37
Member (2004)
English to Polish
No problem... Aug 5, 2008

I'm glad that I could help. Maybe even someone else will benefit from the discussion...

Direct link Reply with quote
 

Prof Projex  Identity Verified
China
Local time: 23:37
English to Chinese
+ ...
TOPIC STARTER
one more question Aug 5, 2008

Jabberwock, do you think you can help me out once more? I searched on the internet for more of the ms codes so I can stop bothering you and figure them out for myself, but I just don't get it for my next situation. I'm taking Chinese to English now (instead of English to Chinese).

Here's an example.
同步器式变速器 synchromesh transmission
直接档变速器 direct drive transmission
超速档变速器 over drive transmission

I'm searching for: Search: ([!a-z] [a-z])
Which finds the space between the Chinese and the English. But adding the tab is my problem. Replacing with a "^t" cuts off the last Chinese character and the first English letter. If I try something like \1^t\2, it says "The Replace With text contains a group number which is out of range". Any variation of that, like ^t\2, \1###\2, etc give the same error. Do you know what I'm doing wrong?

[Edited at 2008-08-05 15:45]


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 17:37
Member (2004)
English to Polish
Try the same... Aug 5, 2008

Hmm... Actually, you should try the same procedure as in the previous case. This allows you to find spaces within the English term and replace with the placeholder ### (step 1), then the remaining spaces are (or should be) only the spaces between Chinese and English - you replace them with tabs (step 2), finally, you restore the English spaces (step 3).

For your interest, you should not write ([!a-z] [a-z]), as this means "group 1 is non-lower ASCII character followed by space followed by lower ASCII character", but ([!a-z]) ([a-z]), which means "group 1 is non-lower ASCII character, then follows space, then group 2 is ASCII character.

I know that regexp (regular expressions) can get confusing, especially in Word, which uses a non-standard flavor of regexp.

If you are really interested, you might look at Perl. It is a programming language which allows doing such conversion with two or three lines of code which can be then reused over and over. Of course, first you have to write a few dozen non-working programs (I'm not there yet myself...).


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Automatic process to add tab/paragraph marker after/before english or asian (chinese) words?

Advanced search






Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »
SDL MultiTerm 2017
Guarantee a unified, consistent and high-quality translation with terminology software by the industry leaders.

SDL MultiTerm 2017 allows translators to create one central location to store and manage multilingual terminology, and with SDL MultiTerm Extract 2017 you can automatically create term lists from your existing documentation to save time.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search