Pages in topic:   [1 2] >
Counting words in a txt file within quotation marks
Thread poster: Afew

Afew  Identity Verified
Kazakhstan
Local time: 08:46
English to Kazakh
Feb 9, 2012

Hello fellow translators,

I have a txt file with software strings in it to be localized. It looks like:

#command some text "text to be localized" // comment

I want to count the words within quotation marks. Is there any way to do it, except manual counting?

I tried importing txt to MS Excel, but it seems the file is not correctly delimited. So, the words I need may appear on different columns.

Any help will be much appreciated.


 

Philip Lees  Identity Verified
Greece
Local time: 05:46
Member (2008)
Greek to English
A job for Perl Feb 9, 2012

Give the file to somebody you know who uses the Perl programming language, and ask them to run this:

perl -i.bak -pe "s/^.+?\"//; s/\".+$//" yourfilename

That will remove everything from your file except the parts in quotes (the original file will be renamed as yourfilename.bak). You can then count the words in the new file.

This assumes that all the lines have the same format.


 

Afew  Identity Verified
Kazakhstan
Local time: 08:46
English to Kazakh
TOPIC STARTER
Some strings are different Feb 9, 2012

Thanks Philip,

Unfortunately, some lines contain only comments and there are lines that contain #command... and "text" but no comments.

I was able to count the number of quotation marks in excel using countif function but it was useless, since there are lines with sentences in quotation marks.


 

Amit Evron  Identity Verified
Vietnam
Local time: 09:46
Spanish to English
+ ...
Send it over Feb 9, 2012

If it's not confidential and if the file isn't too big, feel free to send it over and I'll write a quick perl script. Shouldn't take more than 5 minutes. Just send me a message through Proz and I'll reply with my e-mail address.

 

Tony M  Identity Verified
France
Local time: 04:46
Member
French to English
+ ...
Paste into Word Feb 9, 2012

Haven't tested it, but why not try this:

Select all your text and paste it into Word (etc.)

Do a 'replace all' on the " (careful to get the right character!), replacing with (say) Tab

Select all and convert text to table, using the character you replaced above (e.g. Tab) as the delimiter.

This should enable you to get a column that just has your text to be translated in, and you can take it from there

If you have any lines with no " " at all, they should just appear all in the first column.

Theoretically at least, you ought to be able to reverse the process at the end...

One proviso: one has to assume that each line does end with a Return character or similar; if necessary, you might need to go through and replace whatever the end-of-line delimiter is with something that will work in Word for the conversion to table.


 

Philip Lees  Identity Verified
Greece
Local time: 05:46
Member (2008)
Greek to English
Should still work Feb 9, 2012

Nurzhan Nagashbekov wrote:
Unfortunately, some lines contain only comments and there are lines that contain #command... and "text" but no comments.


I think my script should still work with a small modification (for the comment only lines), but as Amit has kindly offered to take it on I'm happy to hand over to him.


 

Afew  Identity Verified
Kazakhstan
Local time: 08:46
English to Kazakh
TOPIC STARTER
Thanks for suggestions! Feb 9, 2012

Amit Evron wrote:

If it's not confidential and if the file isn't too big, feel free to send it over and I'll write a quick perl script. Shouldn't take more than 5 minutes. Just send me a message through Proz and I'll reply with my e-mail address.


icon_frown.gif It is confidential


 

Afew  Identity Verified
Kazakhstan
Local time: 08:46
English to Kazakh
TOPIC STARTER
This may work... Feb 9, 2012

Tony M wrote:

Haven't tested it, but why not try this:

Select all your text and paste it into Word (etc.)

Do a 'replace all' on the " ....



Thanks Tony, I will try your method.


 

Jaroslaw Michalak  Identity Verified
Poland
Local time: 04:46
Member (2004)
English to Polish
Okapi Rainbow Feb 9, 2012

I think the best option would be to use Okapi Rainbow, especially if you expect more such work form the client. Basically, it would allow you to extract the text you require (using regular expressions) and then calculate the wordcount.

Trados 2007 also has an option to import text based on regular expressions. You have to use a separate application Filter Settings for this. After the import you just analyze the resulting ttx file as usual.

I realize that having to learn regular expressions might seem daunting, but if you plan to translate such texts it will be a sensible investment of your time...


 

FarkasAndras
Local time: 04:46
English to Hungarian
+ ...
CAT Feb 9, 2012

I fervently hope that you'll be using a CAT for this job. The localization of SW strings requires strict formatting consistency and there are a lot of repetitions etc., so it' really not the job you'd want to do by typing over the original.
Now, If you do use a CAT, just do the word count there.
Studio has the required capabilities (i.e. you can specify regex rules that separate the translatable text from the rest), and the Studio package also comes with a specialized sw localization tool (Passolo). Of course there are lots of other tools that'll work, too.

The more interesting question is: who is in charge of this project? Isn't there a PM/client who sorts these things out before you get involved?


 

Philip Lees  Identity Verified
Greece
Local time: 05:46
Member (2008)
Greek to English
Try this Feb 9, 2012

I had a few minutes to spare, so I set up this:

http://quote.writewords.eu/

If you paste your text in the box and click Submit, it should return you only the stuff that's between quotes.


 

FarkasAndras
Local time: 04:46
English to Hungarian
+ ...
perl regex Feb 9, 2012

Philip Lees wrote:

Give the file to somebody you know who uses the Perl programming language, and ask them to run this:

perl -i.bak -pe "s/^.+?\"//; s/\".+$//" yourfilename

That will remove everything from your file except the parts in quotes (the original file will be renamed as yourfilename.bak). You can then count the words in the new file.

This assumes that all the lines have the same format.


It also assumes that there is only one pair of quotes in one line and that there are no escaped quotes inside quoted strings. It'll fail with lines like this:
StringID:4567267; text:"Press the \"Browse\" button to pick a file"; Button:"Browse"
And it doesn't skip lines that have no translatable content at all.

Also, .+? is better written as .* and the " may very well be the last character on the line so .+$// should be .*$//.

So, I'd rewrite your one-liner as:
perl -i.bak -pe "s/^.*\"(.*)\".*$/$1/" yourfilename

...but this still doesn't handle the problem cases I mentioned above.
You could do this (untested) to delete lines that don't contain any quoted string:

perl -i.bak -pe "next unless /\".*\"/; s/^.*\"(.*)\".*$/$1/" yourfilename

... but the bottom line is, it's still only usable if the input file is "simple". You could add negative lookahead/lookbehind to cater for escaped quotes inside the quoted strings etc. to make it work and then somehow adapt it for multiple strings per line, but it starts to get tricky there, and you need to see the input file (or know its spec) to take a reasonable stab at solving the problem.

[Edited at 2012-02-09 10:54 GMT]


 

Afew  Identity Verified
Kazakhstan
Local time: 08:46
English to Kazakh
TOPIC STARTER
Initial stage of the project Feb 9, 2012

I am at the very beginning of the project and just wanted to know what is the wordcount for now. I will definitely try regex. Thanks!

 

Philip Lees  Identity Verified
Greece
Local time: 05:46
Member (2008)
Greek to English
Nobody's perfect Feb 9, 2012

FarkasAndras wrote:

It also assumes that there is only one pair of quotes in one line and that there are no escaped quotes inside quoted strings. It'll fail with lines like this:
StringID:4567267; text:"Press the \"Browse\" button to pick a file"; Button:"Browse"



Oh, sure, it breaks in lots of cases, as does the simpler match I used on the web version:

/"(.+?)"/

I am well aware of the pitfalls of text parsing, which is why I added the caveat about all lines having the same format as the example provided.

As this is not a Perl or a regex forum, I'll leave it at that.


 

Ambrose Li  Identity Verified
Canada
Local time: 22:46
Chinese to English
+ ...
simplier perl code Feb 9, 2012

I think this single line of perl should suffice:

perl -nle 'print $1 if /#command\s+"([^"]*)"/'

This assumes that double quotation can’t occur inside the pair of double quotation marks that marks the string to be translated. Usually this is not the case and (assuming that " is escaped with a single backslash) the perl needed will more likely be

perl -nle 'print $1 if /#command\s+"((?:\\"|[^"])*)"/'

Of course, if escaping of quotation marks occurs but is not signalled by backslashes then the perl code needed will be different.

ETA: The above assumes that continuations don’t occur. If continuations do occur the above won’t work and one-liner solutions might not be sufficient…

[Edited at 2012-02-09 19:03 GMT]


 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Counting words in a txt file within quotation marks

Advanced search







BaccS – Business Accounting Software
Modern desktop project management for freelance translators

BaccS makes it easy for translators to manage their projects, schedule tasks, create invoices, and view highly customizable reports. User-friendly, ProZ.com integration, community-driven development – a few reasons BaccS is trusted by translators!

More info »
SDL MultiTerm 2017
Guarantee a unified, consistent and high-quality translation with terminology software by the industry leaders.

SDL MultiTerm 2017 allows translators to create one central location to store and manage multilingual terminology, and with SDL MultiTerm Extract 2017 you can automatically create term lists from your existing documentation to save time.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search