Patents in pdf images: how to convert them for use with CAT tools?
Thread poster: Silvia Barra

Silvia Barra  Identity Verified
Italy
Local time: 05:01
English to Italian
+ ...
Jan 7, 2009

Hi to all, I've asked this question to many colleagues but never received a satisfactory answer, in the sense that no one knew the solution. Maybe there are some news...
I often translate patents, which I receive in a pdf image (often from a fax). I'd like to translate them using Trados or similar programs, but I don't know if it is possible to convert them in a useful form and how to do that. Some documents can be extracted quite successfully by OCR, but the vast majority of them can not. Programs like Infix that allow editing of pdf files are not useful with pdf images.
Do you have any trick, or program or solution to this problem?
Thanks in advance
Silvia


Direct link Reply with quote
 
Lorenzo Cordini
Local time: 05:01
English
only OCR Jan 7, 2009

I am afraid that in the case of scanned images the only way is OCR. Of course the quality of the result depends on the quality of the input image.

Direct link Reply with quote
 

Erik Freitag  Identity Verified
Germany
Local time: 05:01
Member (2006)
Dutch to German
+ ...
Why not OCR? Jan 7, 2009

Dear Silvia,

the obvious solution used by many translators is of course OCR. You say that the "vast majority" of your texts can't be OCRed. Why is that? If OCR doesn't help, then I can't think of another technical solution.

Maybe outsourcing to a typist is a valid option for you?

Regards,
Erik

PS.: I have very good experiences with Abby FineReader. OCR of patent texts via fax: no problem.

[Bearbeitet am 2009-01-07 14:23 GMT]


Direct link Reply with quote
 

Bogdan Burghelea  Identity Verified
Romania
Local time: 06:01
English to German
+ ...
Two tools Jan 7, 2009

Silvia Barra wrote:

Hi to all, I've asked this question to many colleagues but never received a satisfactory answer, in the sense that no one knew the solution. Maybe there are some news...
I often translate patents, which I receive in a pdf image (often from a fax). I'd like to translate them using Trados or similar programs, but I don't know if it is possible to convert them in a useful form and how to do that. Some documents can be extracted quite successfully by OCR, but the vast majority of them can not. Programs like Infix that allow editing of pdf files are not useful with pdf images.
Do you have any trick, or program or solution to this problem?
Silvia


Dear Silvia,

for the past 8 years now I have been using ABBYY FineReader Professional as my main OCR program and have been and still am very pleased with it. I started with version 5 and now I am using version 9 and it has kept improving over time.

FineReader has the wonderful capacity to transform pdfs into editable format (Word, Excel, rtf or txt files). Therefore, FineReader would have been my main tool for transforming pdfs into something directly translatable.

Nevertheless, six weeks ago I came across another program, called Able2Extract which, at least for some pdfs, works better for me than FineReader.

For instance, FineReader can't or won't read protected pdfs, whereas Able2Extract can.

Able2Extract is also more accurate in preserving the initial layout (together with background images) of the original pdf.

Accuracy in text recognition is the strongest point FineReader has over Able2Extract, otherwise the latter would have replaced it as main pdf transformer.

If both programs should fail in transforming the pdf into something editable, then I would use this workaround (being aware that it is rather cumbersome): I would print out the pdf and then I would scan the printed text with FineReader to turn it into editable text.

In hope this would answer your questions, I wish you the best of luck for 2009.

Bogdan

[Edited at 2009-01-07 14:23 GMT]


Direct link Reply with quote
 

Heinrich Pesch  Identity Verified
Finland
Local time: 06:01
Member (2003)
Finnish to German
+ ...
I use Abby Finereader Jan 7, 2009

I scan page by page selecting only the text, without the line-numbers. After scanning I remove the line-brakes.
If you are lucky the ocr will produce a file which needs no changes prior to translation. Often though the quality of the pdf is very poor (consisting of images from faxed documents). Then it is better you translate them without CAT, just typing the translation into a new document.

Don't forget that the pdf is the initial document, so check especially all numbers before delivery to the customer.

And don't forget to charge for the time you spend on conversion.

Regards
Heinrich


Direct link Reply with quote
 
Boyan Brezinsky  Identity Verified
Bulgaria
Local time: 06:01
English to Bulgarian
+ ...
Why not get the source text directly? Jan 7, 2009

If the patents originate from the US, you could get the text directly from the US Patent and Trademark Office: http://www.uspto.gov/main/patents.htm
Of course, this approach won't work, if the patent has been just applied for and therefore not yet published.


Direct link Reply with quote
 
xxxPeter Manda
Local time: 23:01
German to English
+ ...
patent texts Jan 7, 2009

At least for a more recent patent, if the patent is a US patent and it has been filed with the USPTO, you can obtain the text of the patent from the website.
I suspect that this same may be true of patents filed with the WPTO and at the patent office in Munich (and probably other patent agencies). I would suggest that if you really want to get a hold of a non-pdf'd copy, you (a) check with the client; and then (b) check with the issuing office. There may be a fee for obtaining the non-PDF'd copy, and I think that that fee certainly is something your agency or client should cover.


Direct link Reply with quote
 

Heinrich Pesch  Identity Verified
Finland
Local time: 06:01
Member (2003)
Finnish to German
+ ...
Be careful Jan 7, 2009

Peter Manda wrote:

At least for a more recent patent, if the patent is a US patent and it has been filed with the USPTO, you can obtain the text of the patent from the website.
I suspect that this same may be true of patents filed with the WPTO and at the patent office in Munich (and probably other patent agencies). I would suggest that if you really want to get a hold of a non-pdf'd copy, you (a) check with the client; and then (b) check with the issuing office. There may be a fee for obtaining the non-PDF'd copy, and I think that that fee certainly is something your agency or client should cover.


I once used a copy of a patent from the patent office site instead of the faxed patent from the agency, but too late found out that there were changes and additions by the author. Bad mistake!


Direct link Reply with quote
 

Silvia Barra  Identity Verified
Italy
Local time: 05:01
English to Italian
+ ...
TOPIC STARTER
Did not think about it! Jan 8, 2009

Thank you: I did not think about the patents source! I tried with the patent I'm currently translating and I found it, so I can use my Cat tool (It's faster). I agree with Heinrich: in fact this very patent has some modified pages with respect of version on the website, but I carefully control the two versions, so no problem.
As for OCRs, I'll see for AbbyFinereader. Until now I used other OCR but some documents I received were very difficult to read also by "human eyes", so with OCR was a disaster.
Anyway, thank you for your always precious suggestions and experiences
Good night.
Silvia


Direct link Reply with quote
 

Silvia Barra  Identity Verified
Italy
Local time: 05:01
English to Italian
+ ...
TOPIC STARTER
Great OCR Jan 9, 2009

Silvia Barra wrote:

As for OCRs, I'll see for AbbyFinereader.


I've tried Abbyy Finereader: it's a great software indeed! It recognised almost completely a pdf image file that had a bad resolution and was of bad aspect for other OCR.
Thank you for the suggestion!
Have a nice day (and a nice weekend)!
Silvia


Direct link Reply with quote
 

Silvia Barra  Identity Verified
Italy
Local time: 05:01
English to Italian
+ ...
TOPIC STARTER
Also for French documents? Feb 2, 2009

Silvia Barra wrote:

I've tried Abbyy Finereader: it's a great software indeed! It recognised almost completely a pdf image file that had a bad resolution and was of bad aspect for other OCR.
Thank you for the suggestion!
Have a nice day (and a nice weekend)!
Silvia


Unfortunately the trial time expired before I can testing the software skills in French. Do someone use it for that language?
Thanks
Silvia


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Patents in pdf images: how to convert them for use with CAT tools?

Advanced search






memoQ translator pro
Kilgray's memoQ is the world's fastest developing integrated localization & translation environment rendering you more productive and efficient.

With our advanced file filters, unlimited language and advanced file support, memoQ translator pro has been designed for translators and reviewers who work on their own, with other translators or in team-based translation projects.

More info »
Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »



All of ProZ.com
  • All of ProZ.com
  • Term search
  • Jobs