Re: concurrent_queue - can never get back virtual address SPACE

turks · ‎03-11-2009

I am using a concurrent_queue together with scalable_malloc/free to queue up multiple lines that get processed in parallel using a tbb pipeline. This works great and it is gratifying to see 8 CPU usage near 100% on the tough jobs. (thank you, Intel!)

In order to handle really tough jobs, I monitor available virtual memory (using GlobalMemoryStatus) since each of parallel transform tasks might use upwards of 120 Mb each. In case I get low on virtual memory, I end the pipeline prematurely and gracefully, and then start up a new pipeline with one fewer thread, thus letting the full VM space get distributed among fewer tasks.

The problem is that even though all allocated memory is freed up, GlobalMemoryStatus never reports an increase in dwAvailVirtual.
This is because in the process of handling thousands of lines, even when I insure that no more than 50 lines are in the concurrent_queue, each with a 1Mb malloc, the addresses of that 1MB buffer leapfrog all across the virtual memory space (3Gb).

At the end of the processing, after giving back all memory, deleting the pipeline, and deleting the concurrent_queue, I'd like to end up in a memory state close to when my process started. But, no. If I don't end my system process, even though I've given back all allocated memory, the system keeps my peak VM usage subtracted from the dwAvailVirtual.

I tried doing a VirtualFree on the same address and length right after I do a scalable_free of the buffer just processed by the pipeline. I think that only works if paired with a VirtualAlloc.

I'd like to not have to write my own thread-safe memory manager since except for scribbling all over the 32-bit address space, the scalable memory allocators do exactly what I want, especially with 8 cores pumping away.

Is there a way to de-commit or de-reserve the address space of memory allocated and deallocated by scalable_alloc?

I'm thinking making the queue itself operate on items that include a fixed maximum size buffer. Each push would use that much memory, and each pop would return it. Still, though, since they effectively use scalable_malloc, I would probably have the same problem.

Added info: For speed, instead of using a bounded concurrent_queue I am, as recommended, periodically checking the size and if above a certain limit (100), doing a Sleep(25) to allow the pipeline to do more processing.

If there is another way to accomplish this which would not use up the virtual address space so much?

Note that this is address "space" not memory that is the problem.
The question thus reduces to:
Is there ANY way for a process to give back used address space?
This may only be a problem for 32-bit operating systems, and Windows family in particular.

I think the problem can also be seen with a loop instead of queue:

for (int i=0; i<1024; i++)
{
byte *p1 = scalable_malloc(1024 * 1024 + i);
scalable_free (p1);
}
// At this point over 1 Gb of virtual memory has been allocated and de-allocated
// Yet, as reported by GlobalMemoryStatus, available virtual address space has decreased
// by a huge amount, irreversibly it appears.

Thanks,
Mitch

RafSchietekat · ‎03-12-2009

scalable_malloc() will pass requests of nearly 8 kB or more on to malloc() (with some overhead), so you might as well call that directly. Does that change the situation? For smaller blocks, which it quite efficiently allocates by itself from 1-MB chunks, scalable_malloc() indeed never returns the 1-MB chunks, but that does not seem to be relevant here.

turks · ‎03-12-2009

Quoting - Raf Schietekat

scalable_malloc() will pass requests of nearly 8 kB or more on to malloc() (with some overhead), so you might as well call that directly. Does that change the situation? For smaller blocks, which it quite efficiently allocates by itself from 1-MB chunks, scalable_malloc() indeed never returns the 1-MB chunks, but that does not seem to be relevant here.

Thanks, Raf. I did try using regular malloc/free and the same thing happens. Memory is returned. The next malloc returns with memory that has a new and different address. This too gets returned. The net effect of allocating and freeing memory over a long time is that the running process has "seen" a wide variety of memory addresses. Thus even though all memory has been given back, if one checks how much virgin address "space" is available (with GlobalMemoryStatus()), it has been seriously reduced; a large range of address space has been used.

Why does my application care about available memory space rather than memory really used?
It may not matter that 2Gb of memory is available if there is only 1 Gb of available address space to put it in.
If my app starts getting low on available virtual memory, I reduce the number of parallel cores in the pipeline in order to share the limited virtual memory/address space among fewer cores.

I might add that address does eventually get re-used. When I first start the process there is 1.8Gb of virtual space available (according to GlobalMemoryStatus) and the app is shown to be actually using only 150Mb of virtual memory as shown in the Windows Task Manager. After running a few executions, virtual memory usage after temporarily peaking as high as 2GB or more (in Task manager) returns to the 150Mb after all memory is freed. At this point, for this run and all subsequent runs within this same running process, only 1.4Gb of virtual memory available is reported by GlobalMemoryStatus, never any more until the running process is ended and restarted. It seems to settle on this value.

Mitch

RafSchietekat · ‎03-12-2009

I'm glad to hear that this has nothing to do with TBB, then?

P.S.: Are you just worried by the diagnostic (which seems harmless enough if malloc() merely postpones some administration), or are you actually getting allocation errors (you don't explicitly state what you want to do, like allocate a 2-GB block, or map a file...)? Anyway, there are probably better resources for you to consult about this problem.

P.S.: Also, use of virtual memory does not necessarily imply that storage in the swap file is used.

turks · ‎03-12-2009

Quoting - Raf Schietekat

I'm glad to hear that this has nothing to do with TBB, then?

P.S.: Are you just worried by the diagnostic (which seems harmless enough if malloc() merely postpones some administration), or are you actually getting allocation errors (you don't explicitly state what you want to do, like allocate a 2-GB block, or map a file...)? Anyway, there are probably better resources for you to consult about this problem.

P.S.: Also, use of virtual memory does not necessarily imply that storage in the swap file is used.

The way it is relavent to TBB is that when running the pipeline with multiple parallel cores, each task requests its own memory for processing the most complex line. I do need to check that there is enough vm for say 8 cores.

A snip form my processing log:

Starting Multi-threading with 8 CPU cores. Current VM: 1811.5 MB.
Pipeline started. Current VM avail now: 1740.6 Mb.
Pipeline Finished.
Core 0: 81 Mb max on line: 13144
Core 1: 81 Mb max on line: 13137
Core 2: 81 Mb max on line: 13138
Core 3: 81 Mb max on line: 13139
Core 4: 81 Mb max on line: 13140
Core 5: 81 Mb max on line: 13141
Core 6: 81 Mb max on line: 13142
Core 7: 81 Mb max on line: 13143

The next job (repeat of same) that comes in results in the following, no longer starting at 1811.5 MB free:

Starting Multi-threading with 8 CPU cores. Current VM: 1658.4 MB.
Pipeline started. Current VM avail now: 1587.6 Mb.
Pipeline Finished.
Core 0: 81 Mb max on line: 13144
Core 1: 81 Mb max on line: 13137
Core 2: 81 Mb max on line: 13138
Core 3: 81 Mb max on line: 13139
Core 4: 81 Mb max on line: 13140
Core 5: 81 Mb max on line: 13141
Core 6: 81 Mb max on line: 13142
Core 7: 81 Mb max on line: 13143

Other routines also need temporary virtual memory, even up to 1Gb or more.
Some killer jobs may require 130 Mb max for each line to be processed, i.e. per core.
Therefore, to degrade gracefully, I periodically check the amount of available virtual memory
and if there is too small an amount left for other functions, I (re)start the pipeline specifying fewer cores.

If I'm unable to give back memory address space after its used, then it's possible that I'll have to downshift to fewer cores due to low available vm space because the concurrent queue temporarily touched a lot of memory even though all that memory was freed up.

This whole concept of using up address space is weird to me. I ask for memory and do give it back. I should be able to do that any number of times and, with the exception of fragmentation concerns, the system should have as much memory and be left in the same state as before.

This is not a direct problem of TBB, though. My pipeline solution runs fast at the expense of needing VM per core.
I'm checking if there are other resources regarding this one-way using up of address space.

Thanks for responding.
Mitch

RafSchietekat · ‎03-12-2009

I guess I still don't understand your concern about reported available VM, because immediately returning all memory may not be beneficial for performance. Is there a more relevant diagnostic, like available VM at maximum allocation? Have you actually observed premature failure to allocate enough memory (try to provoke it)?

turks · ‎03-12-2009

Quoting - Raf Schietekat

I guess I still don't understand your concern about reported available VM, because immediately returning all memory may not be beneficial for performance. Is there a more relevant diagnostic, like available VM at maximum allocation? Have you actually observed premature failure to allocate enough memory (try to provoke it)?

I have observed premature failure, though by forcing the issue by allowing the concurrent queue to get 10000 lines ahead of processing, thereby using up much more VM address space.

I think I may have hit on something that appears to work well.

Using VirtualAlloc/Free instead of scalable_malloc/free does appear to de-commit the virtual address space. Hooray.

The only thing is that the VirtualFree calls are done in parallel by many tasks. It is in the transform filter of my pipeline which is a parallel part. For that reason I would prefer to use Intel scalable_malloc. Also scalable_malloc deals with small sizes much better than VirtualAlloc, whic has a minimum alloc of the page size, or 4K.

I hope (and will try to find out) if VirtualAlloc/Free is thread safe!

RafSchietekat · ‎03-12-2009

I still don't see the problem, but I'm happy that you're happy.

Dmitry_Vyukov · ‎03-12-2009

Quoting - turks

Thanks, Raf. I did try using regular malloc/free and the same thing happens. Memory is returned. The next malloc returns with memory that has a new and different address.

Are you running debug or release version of the program? You have to check such things in release build! Some memory allocators 'conserve' freed memory in order to simplify debugging.

Dmitry_Vyukov · ‎03-12-2009

Quoting - turks

Also scalable_malloc deals with small sizes much better than VirtualAlloc, whic has a minimum alloc of the page size, or 4K.

I may bet that it's actually 64k. Check the system allocation granularity parameter (GetSystemInfo()). If you are allocating 4k blocks, you are just wasting 60k or your process' virtual address space.
You have to first reserve blocks of AllocationGranularity size, and then commit memory form them in blocks of PageSize size.

turks · ‎03-14-2009

Quoting - Raf Schietekat

I still don't see the problem, but I'm happy that you're happy.

Test1:
Start my process.
Call GlobalMemoryStatus to get available VM. (returns 1700MB)
Run my process using scalable_malloc and finally scalable_free.
Call GlobalMemoryStatus to get available VM. (returns 1400MB)
After freeing all memory via scalable_free I have 300MB less virtual memory space.

Test2:
Start my process.
Call GlobalMemoryStatus to get available VM. (returns 1700MB)
Run my process using VirtualAlloc and finally VirtualFree.
Call GlobalMemoryStatus at the end of the run to get available VM. (returns 1700MB)
After freeing all memory via VirtualFree I have not lost ANY virtual memory space.

RafSchietekat · ‎03-14-2009

"After freeing all memory via scalable_free I have 300MB less virtual memory space."
Don't worry, be happy, as long as that 300MB is tapped again during a next round (test 3: do test 1 repeatedly, at least 3 times, and probe both at minimum and maximum allocation)? Also, you haven't addressed Dmitriy Vyukov's very relevant point in #8. But you should probably move this discussion to a Microsoft Windows-related forum (a short debriefing afterwards is still welcome), because I don't see anything here related to TBB (in test 1, the requests are simply passed on to malloc() with negligeable relative overhead at 81 MB, so you might as well go straight to malloc()).

turks · ‎03-16-2009

Quoting - Dmitriy Vyukov

Are you running debug or release version of the program? You have to check such things in release build! Some memory allocators 'conserve' freed memory in order to simplify debugging.

Release version. I've observed different initialization in debug versions, but I didn't know some freed memory might be secretly saved.

turks · ‎03-16-2009

Quoting - Raf Schietekat

"After freeing all memory via scalable_free I have 300MB less virtual memory space."
Don't worry, be happy, as long as that 300MB is tapped again during a next round (test 3: do test 1 repeatedly, at least 3 times, and probe both at minimum and maximum allocation)? Also, you haven't addressed Dmitriy Vyukov's very relevant point in #8. But you should probably move this discussion to a Microsoft Windows-related forum (a short debriefing afterwards is still welcome), because I don't see anything here related to TBB (in test 1, the requests are simply passed on to malloc() with negligeable relative overhead at 81 MB, so you might as well go straight to malloc()).

Putting the same job in a loop shows that the memory does get re-used; i.e. there is no continued address space loss for each pass of the loop.

The latest is using VirtualAlloc/Free exclusively does solve this problem, though it is slightly slower for small allocations and I really do not know if it is multi thread safe.

I will take this to a MS Win forum and report back if there's an answer.
Thanks to all.

Dmitry_Vyukov · ‎03-17-2009

Quoting - turks

I really do not know if it is multi thread safe.

It definitely multi-thread safe! You can't crash critical component of the OS (the memory manager) by calling API function from several threads. :)

turks · ‎03-18-2009

Quoting - Dmitriy Vyukov

It definitely multi-thread safe! You can't crash critical component of the OS (the memory manager) by calling API function from several threads. :)

Dmitriy,
Thanks. Glad to hear it with such sureness. We are testing with VirtualAlloc in place. Tnx.