I've got a ~35GB text file full of data, and I want to parse it so I only have unique results in the end file. In the past, I've had no problem with cat FILE | sort | uniq, but I've never worked with anything of this magnitude. Even running a line count takes an extraordinary amount of time:
time cat FILE | wc -l
2608560847
real 11m18.148s
user 1m35.667s
sys 1m33.820s
[root@server src]#
Any suggestions on how I can go about getting unique records from this type of file?
http://vkundeti.blogspot.com/2008/03/tech-algorithmic-detail...
Thus, sort -u <filename> is your go-to for simple jobs. (Note that you'll need to have enough extra disk space to hold all of the temporary files.)
If you need to do something more sophisticated (e.g. joining lines from a web server log into sessions, then sorting sessions), you can still use divide-and-conquer, but you have to be smarter. Divide the file into N parts based on some logic (i.e. divide into files based on session ID), then sort the lines in each individually, then merge the results back together.
This is what map/reduce frameworks are made to do, of course, but something like Hadoop may be overkill unless you plan to do this type of thing often.