Text "not found" in PDF, but it is present
Thread poster: Oliver Walter

Oliver Walter  Identity Verified
United Kingdom
Local time: 15:55
Member (2005)
German to English
+ ...
Jun 13, 2007

Hello, can somebody explain this?
In the PDF file
http://www.pthydraulics.com.au/rescue/lukas/service/lukasspares.pdf
(found in a search for terminology), the "Find" function of the Acrobat reader fails to find any of the text that is actually in the file.
To answer your expected questions:

1. Yes, it looks like real text, that can be selected with the text-select tool;
2. No, the document is not encrypted or protected (according to File > Document Security in Acrobat Reader).
3. If I use Google's "view as HTML", the text is correctly present.

In another sense, it does not appear to be text, yet I can read it and select it while viewing the file with Acrobat Reader (V5.1).

If I select the word "program" (top of page 1), then use the clipboard (^C and ^V as usual) to copy it into a text editor (similar to Notepad), save the file and look at it with a hex viewer, the result is a file that is seven bytes long, but the bytes are (hex values):
02 04 0c 11 04 03 12
This looks as if 02 is p, 04 is r, 11 is g and 12 is m.
So, as we say when we're mystified: what's going on?

Any ideas?
Oliver


Direct link Reply with quote
 

Ivo Steinhoff
Brazil
Local time: 11:55
German to Portuguese
+ ...
It's probably not pure text Jun 14, 2007

I follow the link and believe that's a scanned spare parts list save as image file and paste to a text software (like word) and subsequently converted in pdf file. That already happens to me.

Direct link Reply with quote
 

Lesley Clarke  Identity Verified
Mexico
Local time: 09:55
Spanish to English
it is an image I imagine Jun 14, 2007

I hope I am not being obvious, but if it was originally an image, rather than a word document you cannot search for words.

Direct link Reply with quote
 

Oliver Walter  Identity Verified
United Kingdom
Local time: 15:55
Member (2005)
German to English
+ ...
TOPIC STARTER
It's not an image Jun 14, 2007

Paladivo wrote:
It's probably not pure text I follow the link and believe that's a scanned spare parts list save as image file

Lesley Clarke wrote:
I hope I am not being obvious, but if it was originally an image, rather than a word document you cannot search for words.

It is obvious and I considered that. As I wrote in my original posting (point 1):
It is real text that can be selected with the text-select tool of Acrobat Reader.
The word "program" (as I also wrote) contains 7 (seven) bytes. If it were an image it would contain at least one or two hundred bytes.
My testing leads me to believe that
(a) it is text, not an image
(b) the text is encoded in a way that doesn't look like text in a text editor: neither ASCII nor Unicode.
Oliver


Direct link Reply with quote
 

Uldis Liepkalns  Identity Verified
Latvia
Local time: 17:55
Member (2003)
English to Latvian
+ ...
I have encounered like Jun 14, 2007

but have no solution. What happens when you open it in your browser and try to copy it from the HTML view?

Uldis

Oliver Walter wrote:
3. If I use Google's "view as HTML", the text is correctly present.


Direct link Reply with quote
 

PAS  Identity Verified
Local time: 16:55
English to Polish
+ ...
Type 1 Fonts Jun 14, 2007

In Adobe Reader, if you click on the "document properties" item in the file menu and then pick the "font" tab, you will see that the document uses Adobe Type 1 fonts (i.e. not standard true type or open type fonts).
They may use 7 bit encoding, instead of 8 bit.

That is as far as I can help you in this , but maybe this can help you to find a solution...

Is scanning the document (OCR) an option?

What I can't do in Bill's shiny sparklin' new Word 2007 is change the coding of the fonts by using "paste special". There simply are no options available that make any difference.

Pawel Skalinski

[Edited at 2007-06-14 09:05]


Direct link Reply with quote
 

Marie-Céline GEORG  Identity Verified
France
Local time: 16:55
English to French
+ ...
It's not text Jun 14, 2007

Hi Oliver,
First, it's not really text: you can't select a word by double-clicking it, which should be possible if it was text. That's why the terminology search in Acrobat doesn't give any result.
I've tried converting it using Solid Converter PDF and I get an error message telling me that the file contains non-standard coding and cannot be processed normally - and indeed I get a page full of weird symbols.
Then I've had a look at a complete page and found in the footer that the source file is an Adobe PageMaker 6.5 file containing data from tif files. Maybe that's why the encoding is strange - it was created 10 years ago, which is around Middle Age as far as computers are concerned.

Unfortunately, knowing this doesn't tell you how you can look for terminology in such a file. Does Google's HTML view function allow you to search text ?

Marie-Céline


Direct link Reply with quote
 

tectranslate ITS GmbH
Local time: 16:55
German
+ ...
It wasn't a searchable PDF Jun 14, 2007

...but I took the liberty of converting it into one for you.
You can download it here.

I used ABBYY PDF Transformer, in case anyone's interested.

HTH,
Benjamin


Direct link Reply with quote
 

Oliver Walter  Identity Verified
United Kingdom
Local time: 15:55
Member (2005)
German to English
+ ...
TOPIC STARTER
Yes, that's it! Jun 14, 2007

PAS wrote:
In Adobe Reader, if you click on the "document properties" item in the file menu and then pick the "font" tab, you will see that the document uses Adobe Type 1 fonts (i.e. not standard true type or open type fonts).
They may use 7 bit encoding, instead of 8 bit.
Pawel Skalinski

Thank you Pawel, that is evidently the reason : it's an Adobe-special character set.
tectranslate wrote:
It wasn't a searchable PDF
...but I took the liberty of converting it into one for you.
You can download it here.

Thank you very much Benjamin. Your version has TrueType fonts, which makes the document searchable. That's a point in favour of ABBYY's product (and in favour of ProZ as a mutual help place).
Oliver

[Edited at 2007-06-14 21:35]


Direct link Reply with quote
 

Oliver Walter  Identity Verified
United Kingdom
Local time: 15:55
Member (2005)
German to English
+ ...
TOPIC STARTER
Yes, but... Jun 14, 2007

Marie-Céline GEORG wrote:
Does Google's HTML view function allow you to search text ?

Yes, I was able to search in that, but the displayed font was very small and the layout was not at all like that in the original doc. If I really wanted to pursue that line, I could have increased the displayed font size of the browser, or saved it to disk, opened the text in Word and changed the font size there, but in this case Benjamin/tectranslate has provided the best solution.

Thanks to all who have contributed to this.

Oliver


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Text "not found" in PDF, but it is present

Advanced search






Déjà Vu X3
Try it, Love it

Find out why Déjà Vu is today the most flexible, customizable and user-friendly tool on the market. See the brand new features in action: *Completely redesigned user interface *Live Preview *Inline spell checking *Inline

More info »
memoQ translator pro
Kilgray's memoQ is the world's fastest developing integrated localization & translation environment rendering you more productive and efficient.

With our advanced file filters, unlimited language and advanced file support, memoQ translator pro has been designed for translators and reviewers who work on their own, with other translators or in team-based translation projects.

More info »



All of ProZ.com
  • All of ProZ.com
  • Term search
  • Jobs