Trouble with line breaks copypasting from PDF
Thread poster: Paolo Valenti

Paolo Valenti  Identity Verified
Switzerland
Local time: 05:26
German to Italian
+ ...
Dec 14, 2004

Hi everyone, when I copy and paste texts from PDF into word there are always those bloody line breaks in correspondence to the PDF lines and they screw up all the CAT process. I can manually reformat short texts, but it is suicidal for long ones. Do you know a system to eliminate line breaks in whole word documents and are able to explain it to a idiot?

Thank you, Paolo Valenti


 

Luciano Monteiro  Identity Verified
Brazil
Local time: 00:26
English to Portuguese
+ ...
OCR Dec 14, 2004

Hi,

Using some OCR software (such as Abbyy or Omnipage) might solve your problem. It should recognize the text in your PDF document and save it as a TXT or DOC file. Typically, line breaks will be removed. You should give it a try.

Regards,

Luciano


 

Silvia Vallejo  Identity Verified
Spain
Local time: 05:26
English to Spanish
+ ...
Acrobat 6.0 Dec 14, 2004

In most cases Acrobat 6.0 will let you get rid of unwanted carriage by saving file as rtf.

 

RWSTranslation
Germany
Local time: 05:26
Member (2007)
German to English
+ ...
Source files Dec 14, 2004

Hello,

ask your client for the source files.

With kind regards

Hans


 

Textklick  Identity Verified
Local time: 04:26
German to English
+ ...
No problem :-) Dec 14, 2004

If the PDF is non-extractable, then try OCR, although it doesn't work well if the PDF is made from a fax or some other messy source ;-(.

Otherwise, there are many tools (some of which are freeware) which can do this for you. Check some of the sites at: http://www.google.de/search?hl=de&q=pdf%20conversion&btnG=Google-Suche&meta=

I don't have any of these tools, but I am sure other colleagues will follow on with their specific recommendations.

Asking the client for the source files is a good idea, but in my experience they are seldom available.

HTH

Chris
+++


 

xxxLia Fail  Identity Verified
Spain
Local time: 05:26
Spanish to English
+ ...
Find and replace? Dec 14, 2004

Paolo Valenti wrote:

...a system to eliminate line breaks in whole word documents and are able to explain it to a idiot...




I am not sure if you are referring to another software or simply to a quicker way to manually deal with paragraph marks at the end of each line, i.e. that appear within sentences.

SO, at the risk of spelling out something you already know:-)

Go to the FIND & REPLACE function
- insert in FIND box, 2 paragraph mark symbols
- insert in REPLACE box, XXX (or any other non-standard word)

This will replace all double paragraph marks representing real paragraph breaks (You might also have to do this for instances of 3 paragraph marks, depending how the text is formatted)

Now do a FIND & REPLACE substituting all the remaining paragraph marks with a space.

Finally, do a FIND & REPLACE re-substituting all the XXX with a paragraph mark.

You still might have to do some reformatting, but at least from a CAT perspective, you'll have correct sentence segments.



[Edited at 2004-12-14 18:48]


 

Robert Tucker
United Kingdom
Local time: 04:26
German to English
+ ...
pdftotext Dec 15, 2004

You will probably find that pdftotext tools (there's a free one at: www.foolabs.com/xpdf/download.html) will pull the text out of the pdf without too many paragraph marks - it will probably be necessary to insert some.

(Columns may become paragraphs etc, but it's cheaper than buying a full-blown pdf to .doc converter, though you may want to check those out, of course)


 

Paolo Valenti  Identity Verified
Switzerland
Local time: 05:26
German to Italian
+ ...
TOPIC STARTER
Thanks to everyone! Dec 17, 2004

I've really appreciated your help. Now it's time for some experimentations...

Paolo Valenti


 

Herbert Fipke  Identity Verified
Germany
Local time: 05:26
English to German
+ ...
Have also a look here: Dec 17, 2004

We are just discussing this issue also here:

http://www.proz.com/topic/27682

Maybe you'll find a prgram that suits your needs.


 

Dragomir Kovacevic  Identity Verified
Italy
Local time: 05:26
Italian to Serbian
+ ...
why not with Wordfast's PlusTools? Dec 18, 2004

Paolo Valenti wrote:

Hi everyone, when I copy and paste texts from PDF into word there are always those bloody line breaks in correspondence to the PDF lines and they screw up all the CAT process.


Simply, I found out that the best and cleanest, and safest method is to use PlusTools by Wordfast. Anyhow, Wordfast can be downloaded and used for small projects for free, alongside with PlusTools which are always free. PlusTools is a macro, as well, like Wordfast.

Firstly, extract text from PDF, by means of Wordfast (or Plustools, I can't remember now). Then make PlusTools correct the text: all those CR, and similar. The process is flawless and smooth.

D


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Trouble with line breaks copypasting from PDF

Advanced search







WordFinder Unlimited
For clarity and excellence

WordFinder is the leading dictionary service that gives you the words you want anywhere, anytime. Access 260+ dictionaries from the world's leading dictionary publishers in virtually any device. Find the right word anywhere, anytime - online or offline.

More info »
Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search