Hacker News

Strange days we live in. Python and C++? What about a line of bash:

tr -s '[:space:]' '\n' < file.txt | sort | uniq -c | sort -rn

I’d like to know the memory profile of this. The bottleneck is obviously sort which buffers everything in memory. So if we replace this with awk using a hash map to keep count of unique words, then it’s a much smaller data set in memory:

tr -s '[:space:]' '\n' < file.txt | awk '{c[$0]++} END{for(w in c) print c[w], w}' | sort -rn

I’m guessing this will beat Python and C++?

pjscott 11 minutes ago [ - ]

> I’d like to know the memory profile of this. The bottleneck is obviously sort which buffers everything in memory.

That's not obvious to me. I checked the manuals for sort(1) in GNU and FreeBSD, and neither of them buffer everything in memory by default. Instead they read chunks to an in-memory buffer, sort each chunk, and (if there are multiple chunks) use the filesystem as temporary storage for an external mergesort.

This sorting program was originally developed with memory-starved computers in mind, and the legacy shows.

knome 7 minutes ago [ - ]

>which buffers everything in memory

gnu sort can spill to disk. it has a --buffer-size option if you want to manually control the RAM buffer size, and a --temporary-directory option for instructing it where to spill data to disk during sort if need be.