Trouble with line breaks copypasting from PDF
Thread poster: Paolo Valenti

Paolo Valenti  Identity Verified
Switzerland
Local time: 02:54
German to Italian
+ ...
Dec 14, 2004

Hi everyone, when I copy and paste texts from PDF into word there are always those bloody line breaks in correspondence to the PDF lines and they screw up all the CAT process. I can manually reformat short texts, but it is suicidal for long ones. Do you know a system to eliminate line breaks in whole word documents and are able to explain it to a idiot?

Thank you, Paolo Valenti


Direct link Reply with quote
 

Luciano Monteiro  Identity Verified
Brazil
Local time: 21:54
English to Portuguese
+ ...
OCR Dec 14, 2004

Hi,

Using some OCR software (such as Abbyy or Omnipage) might solve your problem. It should recognize the text in your PDF document and save it as a TXT or DOC file. Typically, line breaks will be removed. You should give it a try.

Regards,

Luciano


Direct link Reply with quote
 

Silvia Vallejo  Identity Verified
Spain
Local time: 02:54
English to Spanish
+ ...
Acrobat 6.0 Dec 14, 2004

In most cases Acrobat 6.0 will let you get rid of unwanted carriage by saving file as rtf.

Direct link Reply with quote
 

RWSTranslation
Germany
Local time: 02:54
Member (2007)
German to English
+ ...
Source files Dec 14, 2004

Hello,

ask your client for the source files.

With kind regards

Hans


Direct link Reply with quote
 

Textklick  Identity Verified
Local time: 01:54
German to English
+ ...
No problem :-) Dec 14, 2004

If the PDF is non-extractable, then try OCR, although it doesn't work well if the PDF is made from a fax or some other messy source ;-(.

Otherwise, there are many tools (some of which are freeware) which can do this for you. Check some of the sites at: http://www.google.de/search?hl=de&q=pdf%20conversion&btnG=Google-Suche&meta=

I don't have any of these tools, but I am sure other colleagues will follow on with their specific recommendations.

Asking the client for the source files is a good idea, but in my experience they are seldom available.

HTH

Chris
+++


Direct link Reply with quote
 
xxxLia Fail  Identity Verified
Spain
Local time: 02:54
Spanish to English
+ ...
Find and replace? Dec 14, 2004

Paolo Valenti wrote:

...a system to eliminate line breaks in whole word documents and are able to explain it to a idiot...




I am not sure if you are referring to another software or simply to a quicker way to manually deal with paragraph marks at the end of each line, i.e. that appear within sentences.

SO, at the risk of spelling out something you already know:-)

Go to the FIND & REPLACE function
- insert in FIND box, 2 paragraph mark symbols
- insert in REPLACE box, XXX (or any other non-standard word)

This will replace all double paragraph marks representing real paragraph breaks (You might also have to do this for instances of 3 paragraph marks, depending how the text is formatted)

Now do a FIND & REPLACE substituting all the remaining paragraph marks with a space.

Finally, do a FIND & REPLACE re-substituting all the XXX with a paragraph mark.

You still might have to do some reformatting, but at least from a CAT perspective, you'll have correct sentence segments.



[Edited at 2004-12-14 18:48]


Direct link Reply with quote
 

Robert Tucker
United Kingdom
Local time: 01:54
German to English
+ ...
pdftotext Dec 15, 2004

You will probably find that pdftotext tools (there's a free one at: www.foolabs.com/xpdf/download.html) will pull the text out of the pdf without too many paragraph marks - it will probably be necessary to insert some.

(Columns may become paragraphs etc, but it's cheaper than buying a full-blown pdf to .doc converter, though you may want to check those out, of course)


Direct link Reply with quote
 

Paolo Valenti  Identity Verified
Switzerland
Local time: 02:54
German to Italian
+ ...
TOPIC STARTER
Thanks to everyone! Dec 17, 2004

I've really appreciated your help. Now it's time for some experimentations...

Paolo Valenti


Direct link Reply with quote
 

Herbert Fipke  Identity Verified
Germany
Local time: 02:54
English to German
+ ...
Have also a look here: Dec 17, 2004

We are just discussing this issue also here:

http://www.proz.com/topic/27682

Maybe you'll find a prgram that suits your needs.


Direct link Reply with quote
 

Dragomir Kovacevic  Identity Verified
Italy
Local time: 02:54
Member
Italian to Serbian
+ ...
why not with Wordfast's PlusTools? Dec 18, 2004

Paolo Valenti wrote:

Hi everyone, when I copy and paste texts from PDF into word there are always those bloody line breaks in correspondence to the PDF lines and they screw up all the CAT process.


Simply, I found out that the best and cleanest, and safest method is to use PlusTools by Wordfast. Anyhow, Wordfast can be downloaded and used for small projects for free, alongside with PlusTools which are always free. PlusTools is a macro, as well, like Wordfast.

Firstly, extract text from PDF, by means of Wordfast (or Plustools, I can't remember now). Then make PlusTools correct the text: all those CR, and similar. The process is flawless and smooth.

D


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Maria Castro[Call to this topic]

You can also contact site staff by submitting a support request »

Trouble with line breaks copypasting from PDF

Advanced search







Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »
SDL Trados Studio 2017 Freelance
The leading translation software used by over 250,000 translators.

SDL Trados Studio 2017 helps translators increase translation productivity whilst ensuring quality. Combining translation memory, terminology management and machine translation in one simple and easy-to-use environment.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums