Reducing memory footprint

Sensei_S_ · ‎10-29-2015

Dear all,

I need to reduce the memory footprint of my program, while retaining some speed. I might also switch data structures. The problem: I need to count all integers of a fixed size in bits, let's say 32 bits, from a file.

Right now I am using a concurrent hashmap, but after some time it spikes to 10GB or RAM and since computation basically stops, I think I am just trashing my memory (laptop with 8GB). Another option is to use a concurrent vector.

Also, I am using a pipeline to read the integers, I don't know if this point might be optimized.

My impressions are that I could save memory with a vector, but I don't know if I can preallocate all possible integers, in this case, 4GB.

Do you have any suggestions?

Thanks!

Alex_H_4 · ‎11-13-2015

A compressing data structure like a trie or judy array ( http://judy.sourceforge.net ) might be appropriate. They are not multi thread friendly, so a lock around the large store would be needed, and if contention is high a small structure per thread that is periodically stored to the large store.

Or use a concurrent vector, once it grows to a certain size move its contents to a compressing data structure.

You'got a lot of options.

I've stored what would have been 18 GB of 32 bit ints as 3.5 GB in a Judy array. They're fast too.

RafSchietekat · ‎11-14-2015

Quick fix to try first before deciding whether a faster solution is needed: while (input not empty) { count elements using a limited map and write others to an overflow file; input = overflow; }.

Alex, just out of curiosity, have you looked into a possible cause for that thread-unfriendliness? Is it intrinsic, or is there room for improvement here?

I don't see what a concurrent_vector would bring, though.

jimdempseyatthecove · ‎11-14-2015

If you are counting 32-bit integers from a large file, the time to count the entries in a read buffer is likely less than the time for the I/O to read the buffer. I suggest you use a 2-stage parallel_pipeline where 1 stage reads the file and the second stage performs the counts.

If the above scenario demonstrates the bottleneck is the counting stage, then add an additional count stage where the first counting stage counts the low numbers, and the second stage counts the high numbers (split at what you assume will be a balanced point). This method may have an advantage when you have more memory channels (system dependent as to if you have 1, 2, 4, or more memory channels).

Your counters must be sized appropriately (your file may have a worst case of all numbers being the same).

Jim Dempsey

Alexei_K_Intel · ‎11-18-2015

Possibly you already solved your problem but nobody answered you question about "if I can preallocate all possible integers, in this case, 4GB".

On Windows you can allocate as much memory as you have RAM+swap (with a deduction already allocated memory by other processes). However, in you case if you allocate 4GB it means that you have the only one byte for each incoming integer. Is it guaranteed that you have not more than 255 pieces of the same integer? For this solution with preallocated array, you do not need to use concurrent_vector, you can use std::vector<std::atomic<unsinged char>>.

Sensei_S_ · ‎11-18-2015

Sorry for the delay, I was out for business.

@Alex H, I'll look for Judy, they seem really promising. I don't know how things will be with threads, though. I'll take a look at that,

@Raf, for what I am seeing the unfriendliness is quite intrinsic. The large file may contain the same values over and over, so as far as I see, all threads start to contend access (and so they're slow). My guts tell me that a hashtable in this case brings too much overhead (*), while a simple approach with a vector (preallocated) could save me some memory.

I don't know how it is implemented in TBB, but an inner liked list or similar structures are probably too much in this case. Is there a way to debug this and see how things are going inside the hashtable?

What I see profiling is this: it starts very fast, 100% per thread CPU occupancy, good, but as things progress the throughput of integers per second (read from the disk) drops to almost zero, and memory starts growing. It progresses, of course, it terminates, but how it progresses is really bothering me.

@Jim, I wanted to investigate this, but I don't know how to see how things progress inside a pipeline. My current algorithm uses a serial reader filter, and a parallel counter filter, in a pipeline. Can I do something to take a look at this?

@Alex K, I need to count integers, as uint64, unfortunately. Is a vector of atomics that less expensive than a concurrent vector?

Thanks for all of you, you're invaluable!

RafSchietekat · ‎11-18-2015

#6 "@Raf, for what I am seeing the unfriendliness is quite intrinsic."

That was a question to "Alex H." about Judy arrays. :-)

RafSchietekat · ‎11-18-2015

#6 "@Alex K, I need to count integers, as uint64, unfortunately. Is a vector of atomics that less expensive than a concurrent vector?"

A concurrent_vector is only meant to allow concurrent growth, whereas a vector might move existing elements to a different location and invalidate existing iterators. Both vector and concurrent_vector allow concurrent element access, but neither protects against data races (more than one thread accessing a particular element and at least one of them writing), so you would still need an atomic element type if that is a possibility. If the container is pre-allocated, it is always cheaper to use a plain vector. In short, there is no use for a concurrent_vector here (unless perhaps if you want to bet that most integers in that file are small and you don't know the actual maximum in advance, but...).

There are many solutions. I suggested multiple rounds because it might be the simplest addition to your current code, or maybe not. Another solution is to use 4 GB worth of counters that go only to 255, with a concurrent_hash_map for overflow values. Because the number of map entries would be bounded by 1/256 of the number of possible 32-bit integers, both containers should comfortably fit within 6 GB or less, and your computer has 8 GB, so...

Your choice of using atomics in that vector, or Jim's pipeline-based idea (which also has a cost, though).

Try both, and let us know!

(2015-11-23 Added) To detect overflow with atomics, you might want to make the threshold 255 minus the possible number of threads, using a test-and-test-and-set strategy if the distribution is quite sparse, i.e., if 255-N<=v you go directly to overflow, and otherwise you then check whether 255-N<=v++ (don't forget to decrement again if overflow is detected). This might even work satisfactorily with nibbles instead of bytes, halving the size of the required array, but you would also need CAS operations instead of simple additive operations.

(Added) After the counting phase, don't assume that the array will contain 255-N for all entries in the map! This will probably be the case for most or perhaps even all entries, depending on the input, but occasionally multiple threads might collide, retreat below the threshold, go to overflow, and then never encounter the integer again.