Remove duplicate lines from text file
Thread poster: Samuel Murray

Samuel Murray  Identity Verified
Netherlands
Local time: 21:41
Member (2006)
English to Afrikaans
+ ...
Aug 27, 2012

G'day everyone

Does anyone know of a simple way (on Windows 7) to remove duplicate lines from a text file? My files are between 200 000 and 500 000 lines long each (though no more than 10 MB large). Even a Perl script or suchlike would help. I seem to recall that I once knew how to do this (using Unix Utils tools), but I forgot how.

Thanks
Samuel


 

Heinrich Pesch  Identity Verified
Finland
Local time: 22:41
Member (2003)
Finnish to German
+ ...
Did you try Excel already? Aug 27, 2012

You can sort the rows and would get duplicates underneath each other. Then you can use a comparison function to eliminate those that are the same as the one above.

 

Samuel Murray  Identity Verified
Netherlands
Local time: 21:41
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
Excel limitations Aug 27, 2012

Heinrich Pesch wrote:
You can sort the rows and would get duplicates underneath each other. Then you can use a comparison function to eliminate those that are the same as the one above.


I tried both Excel 2003 and Excel 2007, but if I copy the 200 000 lines from my text editor and paste it into Excel, in both version of Excel only 25 155 lines get pasted. The same limitation applies to LibreOffice Calc. However, if I paste into an empty text editor, all 200 000 lines are pasted, so the limitation is with Excel/Calc, not with my clipboard.

Do you know an Excel comparision function that I might use on files smaller than 25 000 lines?

Samuel


 

Joakim Braun  Identity Verified
Sweden
Local time: 21:41
German to Swedish
+ ...
Something like Aug 27, 2012

Paste text in col B.
Put row indexes in col A.
Sort on col B.

Then the formula in col C could be something like:

=IF(B1=B2;"";B2)

This compares current row's col B value to preceding row's col B value. If they're identical, an empty string results; otherwise, the current B value results.

(This doesn't work in the very first row, of course.)

Fill col C with the formula.
Copy col C.
Paste values (not formulae) into col D.
Sort on col D.
Remove empty rows.
Sort on col A.
You should now have all the rows containing unique text in the original order.

***

(By the way, the row count limitation is much improved in newer Excel versions:
http://office.microsoft.com/en-us/excel-help/excel-specifications-and-limits-HP010342495.aspx?CTT=5&origin=HP005199291)

[Bearbeitet am 2012-08-27 13:06 GMT]


 

Endre Both  Identity Verified
Germany
Local time: 21:41
Member (2002)
English to German
Regex Aug 27, 2012

You didn't define exactly what duplicates you are after. If they are in consecutive lines (as opposed to occurring anywhere in the file), you can go the regex route, for instance in Notepad++:
Search for:
^(.*\r?\n)\1
Replace with:
\1
(source: stackoverflow)

The "Match case" option sets the regex to case-sensitive or case-insensitive. To make it insensitive to trailing spaces, you'll have to adapt the regex (or simply delete all trailing spaces beforehand in another operation).

If the duplicates can occur anywhere in the file, you either have to accept that the lines are sorted before removing the duplicates (, or else you have to use a database (like Access), setting up a table that doesn't accept duplicates (but then Access imposes a character limit of 255 I believe).

You can also take a look at:
http://winfoes.co.uk/how-to-windows/item/11-how-to-delete-duplicate-lines-from-txt-csv-xls

Not sure, but I suspect this one also sorts the lines before removing the duplicates.

Endre


 

Samuel Murray  Identity Verified
Netherlands
Local time: 21:41
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
Perl Aug 27, 2012

Samuel Murray wrote:
Does anyone know of a simple way (on Windows 7) to remove duplicate lines from a text file?


Some googling got me the following line of Perl, and it appears to work (though one can't really tell with 200 000 lines, and I don't know any Perl):

perl -ne "print unless $a{$_}++" inputfile.txt > outputfile.txt

Are there any perl hackers here who can confirm that that would indeed remove duplicates from the file?

Endre Both wrote:
You didn't define exactly what duplicates you are after. If they are in consecutive lines (as opposed to occurring anywhere in the file), you can...


I need to remove duplicate lines. A line is defined as something from one line break (CRLF or CR or LF) to the next. A duplicate line is a line that has a perfect copy elsewhere in the file. I want to keep one instance of the duplicate (i.e. if there are 100 instances of a line, I want to delete 99 and keep just 1).

It would be good to know how to choose between case sensitive and non-case sensitive (so that other readers can use it too), but for my own purpose the lines contain only numbers and spaces and colons and slashes (no letters).

I don't mind if the lines end up in a different order than in the original file.

I'll have a look at your solutions, thanks.


 

Jaroslaw Michalak  Identity Verified
Poland
Local time: 21:41
Member (2004)
English to Polish
EditPad Lite Aug 27, 2012

Try the EditPad Lite text editor, it has a configurable function for removal of duplicates... It is located under the "Extra" menu.

EDIT: I don't use it regularly, as I prefer Notepad++, but it seems that to use EditPad Lite for commercial purposes, you need to purchase a license. However, they have a three month money-back guarantee.

[Edited at 2012-08-27 13:15 GMT]


 

Endre Both  Identity Verified
Germany
Local time: 21:41
Member (2002)
English to German
You have to sort first then Aug 27, 2012

Samuel Murray wrote:
perl -ne "print unless $a{$_}++" inputfile.txt > outputfile.txt


I'm unfamiliar with Perl, but any one-liner will only get rid of consecutive duplicates, so you'll have to sort first if you want to get rid of non-consecutive ones in a simple way (and if line order is not an issue), and I don't see any sorting happen in that Perl directive.

In Notepad++, there's an option to sort and remove duplicates in one go (TextFX/TextFX Tools/Sort lines; with option Sort outputs only UNIQUE active), but it only works on the active file, so you'll have to open all files.

In the stackoverflow question referenced above, Pablo Santa Cruz also mentions a simple method to sort and output via Cygwin command line:
$ cat yourfile | sort | uniq > yourfile_nodups

With a wrapper that includes all files you need to change, this method is obviously quicker than opening every file in an editor. (Unfortunately I'm not familiar enough with the Unix command line to provide that wrapper, but a bit of googling should help.)

Endre


 

Heinrich Pesch  Identity Verified
Finland
Local time: 22:41
Member (2003)
Finnish to German
+ ...
How about wordfast tm Aug 27, 2012

If you make a tm out of your textfile Wf would reorganize it and show only one instance for each duplicate line (I believe).

 

István Hirsch  Identity Verified
Local time: 21:41
English to Hungarian
Editpad Aug 27, 2012

You can find „Delete duplicate lines” option between „Extra”-s (with 2 options: adjacent or anywhere). I have Editpad Lite (free download) and do not know if there are any restrictions regarding the size of the file.
Sorry, I can see now that Editpad has been mentioned in the meantime.

[Módosítva: 2012-08-27 13:45 GMT]


 

Samuel Murray  Identity Verified
Netherlands
Local time: 21:41
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
@Endre Aug 27, 2012

Endre Both wrote:
Samuel Murray wrote:
perl -ne "print unless $a{$_}++" inputfile.txt > outputfile.txt

I'm unfamiliar with Perl, but any one-liner will only get rid of consecutive duplicates, so you'll have to sort first if you want to get rid of non-consecutive ones in a simple way (and if line order is not an issue), and I don't see any sorting happen in that Perl directive.


My guess (based on what had happened when I got unexpectly few results and some comparisons) is that this line of Perl may delete all duplicates (i.e. 100 of 100 instances, and not 99 of 100 instances as I need).

In Notepad++, there's an option to sort and remove duplicates in one go (TextFX/TextFX Tools/Sort lines; with option Sort outputs only UNIQUE active), but it only works on the active file, so you'll have to open all files.


That works, thanks (you have to install the TextFX plugin, via Notepad++'s UI). It complained about NUL characters, but if you do a search for "special characters" and find/replace \0 with a space, you get rid of those characters.

In the stackoverflow question referenced above, Pablo Santa Cruz also mentions a simple method to sort and output via Cygwin command line:
$ cat yourfile | sort | uniq > yourfile_nodups


I actually tried that exact line using the UnxUtils versions of the programs, but it wouldn't do it (though from the error message I think its just the line syntax that needs fixing).

With a wrapper that includes all files you need to change, this method is obviously quicker than opening every file in an editor.


Fortunately, in my case, the files are already merged (so I'm dealing with a small number of very large files instead of a large number of relatively small files). I have a little AutoIt script that merges files of a type across subdirectories, so that's not a problem.


 

Samuel Murray  Identity Verified
Netherlands
Local time: 21:41
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
EditPad Lite works Aug 27, 2012

István Hirsch wrote:
You can find „Delete duplicate lines” option between „Extra”-s (with 2 options: adjacent or anywhere). I have Editpad Lite (free download) and do not know if there are any restrictions regarding the size of the file.


Thanks. I discovered that Notepad++ deletes more lines than it should (from a file of 1 million lines I was expecting about 2000 uniques, but Notepad++ yielded only about 200; I also tested it and found that some lines present in the big file were missing from the small file). EditPad Lite yielded a more believable number of uniques. It took EditPad Lite about 10 seconds to find all duplicates (not sure if my 6 GB RAM has anything to do with that, though).


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Remove duplicate lines from text file

Advanced search






BaccS – Business Accounting Software
Modern desktop project management for freelance translators

BaccS makes it easy for translators to manage their projects, schedule tasks, create invoices, and view highly customizable reports. User-friendly, ProZ.com integration, community-driven development – a few reasons BaccS is trusted by translators!

More info »
TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search