Find cores of thread

Abdulsalam__Babu · ‎05-04-2018

Hi

I'm new to TBB. I have a file reading function. It reads the files and fill into the structure passed as parameter to the function. I have a file X and when it is read, it takes nearly ~15 secs. I'm trying to read 4 times in a for parallel_for loop. I was expecting each read operation will be in a separate thread, and each thread will be in a different core [ my system has 12 physical cores and 24 logical cores ] , so it should be ~15 secs.

If I read it in normal for loop for 4 times, it takes ~57 secs. When I read in parallel_for ,it takes ~41 secs.

How can I see the threads are running in different cores?

Is there any tool to check this? How about Intel Concurrency Checker? I could not find the download of ICC as well.

Looking for your feedback.

Aleksei_F_Intel · ‎05-08-2018

Hi Babu,

TBB is the library that emphasizes the work with algorithms rather than paying attention to
low-level details such as threads managing, work-to-thread mapping, etc. Usually the user tell the
library that there is a piece of work that can be done in parallel and the library squeezes as
maximum parallelism from it as possible, which is limited by number of cores a particular
workstation has, the number of threads to use by the library (equals to the number of logical
threads by default), work item dependencies, etc.

In your particular case, the task seems to be completely IO-bound. It means that CPU is mostly idle
while waiting for IO operations to complete. Even after file has been read, TBB worker thread also
does not compute anything (i.e. does not utilize CPU), but copying/moving read data from one memory
into another (fills in the structure), thus, the task becomes a memory-bound problem. Therefore, I
would not expect CPU is used much in this case.

Usually, in IO-bound scenarios user provides a dedicated thread(s) that play(s) a producer role by
reading input data and providing it for crunching to other threads, consumers.

As for the tool, you can try Intel(R) VTune(TM) Amplifier with its "Analyze User Events API" feature
(available as a check-box option during analysis setup). In this case a "Platform" tab with per
thread timeline should be shown after analysis completes.

Regards, Aleksei

Abdulsalam__Babu · ‎05-08-2018

Thanks Aleksei for your response.

I have few questions on your response:

You mentioned "the task seems to be completely IO-bound. It means that CPU is mostly idle
while waiting for IO operations to complete."

I didnt get what is meant by CPU is mostly idle? Why is it idle?

Usually, in IO-bound scenarios user provides a dedicated thread(s) that play(s) a producer role by
reading input data and providing it for crunching to other threads, consumers.

How do you do this? Do you have any example? If you can provide some link, it will be very helpful.

I'll try the Intel(R) VTune(TM) Amplifier..

Regards

Babu

Abdulsalam__Babu · ‎05-08-2018

I would like to know what is the impact of file I/O in multiple threads.

Say I have multiple files to be read/write. If I do reading/writing files in parallel_for ( each thread reads/writes an individual file ) , does file I/O has any performance impact while doing in multiple threads?

Vladimir_P_1234567890 · ‎05-08-2018

Hello,

It would be good to understand which drive you use once you mention I/O in multiple threads:

If you use a SATA spin HDD you might just lock your system out in I/O depending on file number and file sizes.

if you use SATA SSD you might see serialization on I/O SATA serial queue and your 1.4x speed-up might be not very bad depending on file number and file sizes.

If you use NVMe SSD drive with multiple parallel I/O queues then scalability might be a way better than 1.4x.

if you use windows you can open the resource monitor to check how I/O is bounded.

cheers,
--Vladimir

Abdulsalam__Babu · ‎05-09-2018

Thanks Vladimir. I could not understand completely.

If you use a SATA spin HDD you might just lock your system out in I/O depending on file number and file sizes.

I didnt get this.

if you use SATA SSD you might see serialization on I/O SATA serial queue

Do you mean parallel read/write is not possible here?

Let me ask this: If I do file read/write in multiple threads ( each thread read/write a separate file ), while one thread is doing (scheduled) a job, the other threads will be blocked by I/O ? Or Other threads are free to do read/write?

Regards

Babu

Vladimir_P_1234567890 · ‎05-09-2018

Hello, let me rephrase the question: what interface do you target your program, SATA or NVMe?

If you target SATA which stands for "Serial ATA". Then you need to expect that all your I/O threads will be blocked by I/O serial operations. there might be TBB design patterns for such I/O but I let TBB guys to respond to this.

if you target NVMe (https://en.wikipedia.org/wiki/NVM_Express). Then it more much more friendly to parallel operations.

Regarding SATA HDD vs SATA SSD. there is a big difference in the response time (https://en.wikipedia.org/wiki/Solid-state_drive#Hard_disk_drives). And if you fill disk queue with several different files concurrently response time for SATA HDD will grow more than for SATA SSD and increased response time will slowdown your I/O operations. If you want to play with data you can start several disk benchmarks in parallel, for example several instances of CrystalDiskMark.

--Vladimir

Abdulsalam__Babu · ‎05-10-2018

My hard drive is SATA HDD. Then you mean to say that if one thread is doing a FILE I/O, then all other threads doing FILE i/O will be blocked even if they are running at different cores?

And SATA HDD's response time is slower when there are multiple FILE I/O than the SATA SSD. Am i correct?

Vladimir_P_1234567890 · ‎05-11-2018

Abdulsalam, Babu wrote:

My hard drive is SATA HDD. Then you mean to say that if one thread is doing a FILE I/O, then all other threads doing FILE i/O will be blocked even if they are running at different cores?

There might be OS caching mechanisms and other ways to improve the throughput so you see some scaling. But "Blocking I/O" and "Serial ATA" means exactly that if you do I/O from different cores by different threads then cores might be busy with other tasks but your threads will wait until I/O is available for them

Abdulsalam, Babu wrote:

And SATA HDD's response time is slower when there are multiple FILE I/O than the SATA SSD. Am i correct?

right, SSD does not have overheads on spinning and head positioning.

But again. For example my home HDD has 128MB RAM buffer so if there is "a multiple I/O" with this buffer size the I/O should be fast enough. Drives might have 8MB RAM buffer or might have 8GB SSD buffer so you will see a different behavior of I/O there.