- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>> the time in question covers malloc as well as the read
There is a 3rd time in there as well. This is the "first touch" time that it takes for the O/S to map an address range of a page of virtual memory to physical memory (and page file) the first time you write (or read) addresses within the page.
The KNL is relatively slow as a scalar CPU (~1.3GHz). On the order of 1/2 to 1/3rd that of your Core i7. To improve performance (if this is really an issue with you), I suggest you parallelize the array read using a parallel pipeline technique. One thread reading in the text data in binary chunks, locating the last line terminator in the chunk buffer, then passing the front portion of the buffer into the pipeline. Then the reader thread reading the next chunk following the first chunk, locating the last line terminator, but this time passing the prior line terminator to current last terminator into the pipeline. After the last chunk of the # chunks you configured for, you copy the residual data (that following the last line terminator in the last chunk buffer) to a reserved area that precedes the first chunk buffer (permitting the span sent into the pipeline to include an entire input section.
The degree of parallelization is dependent on the number of chunk buffers (and available threads). Your chunk buffer sizes should be in multiples of disk sector sizes (512, or 2048, or ???) you determine this as it will vary depending on physical disk. Make the chunk buffer size a sufficient number of sector sizes to accommodate one line of input. This is assuming that you wish to partition at the line breaks.
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi High end c.,
Can you show/share the code that you are using to read these files?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, essentially we mimic 2d arrays throughout (to best represent the physics).. Firstly, we malloc in a helper function:
int *data = (int *) malloc(rows*cols*sizeof(int)); int **array = (int **) malloc(rows*sizeof(int *)); int i; for (i=0; i<rows; i++) { array = &(data[cols*i]); } return array;
Then we read:
for i () { for j () { fscanf(file," %d ",&a); } }
This is repeated for both files and the time in question covers malloc as well as the read.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>> the time in question covers malloc as well as the read
There is a 3rd time in there as well. This is the "first touch" time that it takes for the O/S to map an address range of a page of virtual memory to physical memory (and page file) the first time you write (or read) addresses within the page.
The KNL is relatively slow as a scalar CPU (~1.3GHz). On the order of 1/2 to 1/3rd that of your Core i7. To improve performance (if this is really an issue with you), I suggest you parallelize the array read using a parallel pipeline technique. One thread reading in the text data in binary chunks, locating the last line terminator in the chunk buffer, then passing the front portion of the buffer into the pipeline. Then the reader thread reading the next chunk following the first chunk, locating the last line terminator, but this time passing the prior line terminator to current last terminator into the pipeline. After the last chunk of the # chunks you configured for, you copy the residual data (that following the last line terminator in the last chunk buffer) to a reserved area that precedes the first chunk buffer (permitting the span sent into the pipeline to include an entire input section.
The degree of parallelization is dependent on the number of chunk buffers (and available threads). Your chunk buffer sizes should be in multiples of disk sector sizes (512, or 2048, or ???) you determine this as it will vary depending on physical disk. Make the chunk buffer size a sufficient number of sector sizes to accommodate one line of input. This is assuming that you wish to partition at the line breaks.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
BTW I forgot to mention, when you get to the point of reusing a chunk buffer you must assure that the chunk buffer is no longer in use (IOW you need a flag or other means of control to avoid this).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the feedback!
If i pararallelise for Phi then my Xeon would also go faster, which is generally welcome . But it does mean still some relative difference in read from file times between the two architectures :(
yours, M
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
High...
The problem isn't simply the (disk) read time, your problem is compounded by your choice of implementation: fscanf(file," %d ",&a
What this does is:
loop:
read raw (binary) data into an internal buffer, buffer size may have been specified on fopen, but more likely takes on default, possibly 1 sector (512 bytes).
character-by-character copy from this buffer to up-level buffer through token to token delimiter (space, comma, newline, ...) copying ~3:10 bytes into up-level buffer
convert token to numeric (%d)
insert into a
end loop
The pipelining method reads at least one line worth of data into (next) buffer (entire line is in either this buffer or tail end of prior buffer through some point in this buffer).
Data is not copied to an up-level buffer, rather it is parsed in place repeat:(scan to digit, use atoi or atoll, insert into a
Note, the above will still experience the "first touch" issue, but this can be resolve by (after allocation) launching an independent thread (OpenMP task) to walk the allocation in page size steps plopping something (e.g. null character) into the buffer. A progress indicator should be used to assure that the pipeline task does not overtake the touch memory task. (this task is only to be used at first allocation of that range of addresses from your heap). The touch memory task can run concurrently with the conversion tasks as long as the output from the conversion tasks to not overtake the touch memory task.
The point being of all this is to get (parallel) overlap between raw disk reads and the processing to convert the text to numeric form.
You purchased the KNL to get the benefit from highly parallel programming, this is a good candidate to start with. Once you get your first parallel pipeline written, you can adapt it for other purposes (different text to numeric conversions, output from binary to text). For this specific case, 6 or 10 threads may be optimal (1 reader, 1 touch, 4 or 8 convert). Your number of buffers should exceed the number of conversion threads by some number (for you to determine). Do not take the lazy way and read the entire input file as you get no advantage of overlapping the reads with the processing (you also waste memory).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What is your values for nrows and ncols? (and representative data values)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ncols is 1000s or 10s-of-1000s whereas ncols is ~100. Data vals, are all positive ints, below about 300.
I'm aware there's also sparsity so that's another level of opt to come.
Currently it's the ratio of KNL/i7 that is of interest. I can see for reasonable depth inputting pipeline (ie more steps (threads) than on these i7 (12 since one is hexacore with HT switched on) gains can be had :)
yours, M
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>ncols is 1000s or 10s-of-1000s whereas ncols is ~100
Which one of those is rows?
If the first is rows, each input line would approximate 400 characters. Using pipeline buffer size of 4096 might be effective. Though for testing you would have 4 tuning knobs:
1) buffer size (multiple of 512)
2) Number of processing threads (exclusive of reader and first touch threads)
3) Number of buffers in excess of number of processing threads
4) Selector as to where test data resides (SSD or HD)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
It is not clear in what MCDRAM and Cluster modes the KNL system is set.
Thanks. I did have MCDRAM as cache and as flat - no noticeable difference in time for the above (for 2 files, not just the 1 given). Cluster is quadrant in all cases.
Yours, M

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page