Solved: Thanks for the feedback!

high_end_c_ · ‎04-15-2017

Hi, comparison of code on i7 boxes with HDD against two different KNL systems (at least one of which has SSD) I see that the time for former to read two 2.6 GB files line by line is <20 seconds but for the KNL systems it is 70-90 seconds. Can anybody explain eg is it that KNL reads from disk over internal PCIe? Yours, M

jimdempseyatthecove · ‎04-17-2017

>> the time in question covers malloc as well as the read

There is a 3rd time in there as well. This is the "first touch" time that it takes for the O/S to map an address range of a page of virtual memory to physical memory (and page file) the first time you write (or read) addresses within the page.

The KNL is relatively slow as a scalar CPU (~1.3GHz). On the order of 1/2 to 1/3rd that of your Core i7. To improve performance (if this is really an issue with you), I suggest you parallelize the array read using a parallel pipeline technique. One thread reading in the text data in binary chunks, locating the last line terminator in the chunk buffer, then passing the front portion of the buffer into the pipeline. Then the reader thread reading the next chunk following the first chunk, locating the last line terminator, but this time passing the prior line terminator to current last terminator into the pipeline. After the last chunk of the # chunks you configured for, you copy the residual data (that following the last line terminator in the last chunk buffer) to a reserved area that precedes the first chunk buffer (permitting the span sent into the pipeline to include an entire input section.

The degree of parallelization is dependent on the number of chunk buffers (and available threads). Your chunk buffer sizes should be in multiples of disk sector sizes (512, or 2048, or ???) you determine this as it will vary depending on physical disk. Make the chunk buffer size a sufficient number of sector sizes to accommodate one line of input. This is assuming that you wish to partition at the line breaks.

Jim Dempsey

View solution in original post

gaston-hillar · ‎04-15-2017

Hi High end c.,

Can you show/share the code that you are using to read these files?

high_end_c_ · ‎04-16-2017

Hi, essentially we mimic 2d arrays throughout (to best represent the physics).. Firstly, we malloc in a helper function:

  int *data = (int *) malloc(rows*cols*sizeof(int));
  int **array = (int **) malloc(rows*sizeof(int *));
  int i;
  for (i=0; i<rows; i++) {
    array = &(data[cols*i]);
  }
  return array;

Then we read:

for i () {

for j () {

fscanf(file," %d ",&a);

}

}

This is repeated for both files and the time in question covers malloc as well as the read.

jimdempseyatthecove · ‎04-17-2017

>> the time in question covers malloc as well as the read

There is a 3rd time in there as well. This is the "first touch" time that it takes for the O/S to map an address range of a page of virtual memory to physical memory (and page file) the first time you write (or read) addresses within the page.

The KNL is relatively slow as a scalar CPU (~1.3GHz). On the order of 1/2 to 1/3rd that of your Core i7. To improve performance (if this is really an issue with you), I suggest you parallelize the array read using a parallel pipeline technique. One thread reading in the text data in binary chunks, locating the last line terminator in the chunk buffer, then passing the front portion of the buffer into the pipeline. Then the reader thread reading the next chunk following the first chunk, locating the last line terminator, but this time passing the prior line terminator to current last terminator into the pipeline. After the last chunk of the # chunks you configured for, you copy the residual data (that following the last line terminator in the last chunk buffer) to a reserved area that precedes the first chunk buffer (permitting the span sent into the pipeline to include an entire input section.

The degree of parallelization is dependent on the number of chunk buffers (and available threads). Your chunk buffer sizes should be in multiples of disk sector sizes (512, or 2048, or ???) you determine this as it will vary depending on physical disk. Make the chunk buffer size a sufficient number of sector sizes to accommodate one line of input. This is assuming that you wish to partition at the line breaks.

Jim Dempsey

jimdempseyatthecove · ‎04-17-2017

BTW I forgot to mention, when you get to the point of reusing a chunk buffer you must assure that the chunk buffer is no longer in use (IOW you need a flag or other means of control to avoid this).

Jim Dempsey

high_end_c_ · ‎04-17-2017

Thanks for the feedback!

If i pararallelise for Phi then my Xeon would also go faster, which is generally welcome . But it does mean still some relative difference in read from file times between the two architectures :(

yours, M

jimdempseyatthecove · ‎04-17-2017

High...

The problem isn't simply the (disk) read time, your problem is compounded by your choice of implementation: fscanf(file," %d ",&a);

What this does is:

loop:
    read raw (binary) data into an internal buffer, buffer size may have been specified on fopen, but more likely takes on default, possibly 1 sector (512 bytes).
    character-by-character copy from this buffer to up-level buffer through token to token delimiter (space, comma, newline, ...) copying ~3:10 bytes into up-level buffer
    convert token to numeric (%d)
    insert into a and when "first touching next page" cause page fault into OS to map physical RAM&page file page to process virtual memory
end loop

The pipelining method reads at least one line worth of data into (next) buffer (entire line is in either this buffer or tail end of prior buffer through some point in this buffer).
Data is not copied to an up-level buffer, rather it is parsed in place repeat:(scan to digit, use atoi or atoll, insert into a, advance over digits)

Note, the above will still experience the "first touch" issue, but this can be resolve by (after allocation) launching an independent thread (OpenMP task) to walk the allocation in page size steps plopping something (e.g. null character) into the buffer. A progress indicator should be used to assure that the pipeline task does not overtake the touch memory task. (this task is only to be used at first allocation of that range of addresses from your heap). The touch memory task can run concurrently with the conversion tasks as long as the output from the conversion tasks to not overtake the touch memory task.

The point being of all this is to get (parallel) overlap between raw disk reads and the processing to convert the text to numeric form.

You purchased the KNL to get the benefit from highly parallel programming, this is a good candidate to start with. Once you get your first parallel pipeline written, you can adapt it for other purposes (different text to numeric conversions, output from binary to text). For this specific case, 6 or 10 threads may be optimal (1 reader, 1 touch, 4 or 8 convert). Your number of buffers should exceed the number of conversion threads by some number (for you to determine). Do not take the lazy way and read the entire input file as you get no advantage of overlapping the reads with the processing (you also waste memory).

Jim Dempsey

jimdempseyatthecove · ‎04-17-2017

What is your values for nrows and ncols? (and representative data values)

Jim Dempsey

high_end_c_ · ‎04-17-2017

ncols is 1000s or 10s-of-1000s whereas ncols is ~100. Data vals, are all positive ints, below about 300.

I'm aware there's also sparsity so that's another level of opt to come.

Currently it's the ratio of KNL/i7 that is of interest. I can see for reasonable depth inputting pipeline (ie more steps (threads) than on these i7 (12 since one is hexacore with HT switched on) gains can be had :)

yours, M

jimdempseyatthecove · ‎04-18-2017

>>ncols is 1000s or 10s-of-1000s whereas ncols is ~100

Which one of those is rows?

If the first is rows, each input line would approximate 400 characters. Using pipeline buffer size of 4096 might be effective. Though for testing you would have 4 tuning knobs:

1) buffer size (multiple of 512)
2) Number of processing threads (exclusive of reader and first touch threads)
3) Number of buffers in excess of number of processing threads
4) Selector as to where test data resides (SSD or HD)

Jim Dempsey

SergeyKostrov · ‎04-19-2017

It is not clear in what MCDRAM and Cluster modes the KNL system is set. >>...read two 2.6 GB files line by line is <20 seconds but for the KNL systems it is 70-90 seconds. Can anybody >>explain eg is it that KNL reads from disk over internal PCIe? My next comment is not related to the problem reported but I'd like to inform you that: - If in a Flat MCDRAM mode a buffer size is greater than 16GB, or - If in a Hybrid-50-50 MCDRAM mode a buffer size is greater than 8GB a performance impact could happen when doing memory-bound operations.

SergeyKostrov · ‎04-19-2017

>>...The KNL is relatively slow as a scalar CPU (~1.3GHz). On the order of 1/2 to 1/3rd that of your Core i7... That is true only for cases when number of OpenMP threads used in processing on the KNL system is equal or less than number of OpenMP threads used in processing on a Core i7 system. For example, - If an OpenMP processing ( 4 OpenMP threads ) is done on an Ivy Bridge system ( 2.7 GHz ) with 4 hardware threads and processing times are compared to OpenMP processing ( also 4 OpenMP threads ) on a KNL system ( 1.3 GHz ) with 64 hardware threads than the Ivy Bridge system will outperform the KNL system by ~2.1x ( 2.1 = 2.7 GHz / 1.3 GHz ). - If all 64 hardware threads of the KNL system are used in the processing, and KMP_AFFINITY is set to scatter, then the KNL system outperforms the Ivy Bridge system by at least 16x.

high_end_c_ · ‎04-20-2017

Sergey Kostrov wrote:

It is not clear in what MCDRAM and Cluster modes the KNL system is set.

Thanks. I did have MCDRAM as cache and as flat - no noticeable difference in time for the above (for 2 files, not just the 1 given). Cluster is quadrant in all cases.

Yours, M

Very slow read from file