please post comments with your #xpublog thoughts, feedback, and suggestions

James_Reinders · ‎08-10-2021

Together we can make this a great place to discuss my blogs and videos, share feedback, and offer suggestions for future blogs and articles (I love input!). I will generally write about software development for XPUs (heterogeneous computing), including oneAPI, SYCL, DPC++, C++, performance, and parallel programming. Don't be surprised to see Python and Fortran in the mix, they are both great tools.

Please do not request support here. Please post support related questions at https://community.intel.com/t5/Intel-oneAPI-Toolkits/ct-p/oneapi?emcs_t=S2h8ZW1haWx8dG9waWNfc3Vic2Ny... where our support engineers will review them and help you. Of course, if you do not get enough attention there - I do want to know about it (after I get over my disappointment, I will help figure out things if the support forum isn't working).

In my first xpublog post, I discuss three things:

what the heck is an XPU - and why should I care?
what is SYCL?, and
how can it be used to make our neurotic kitten lighten up (you won't want to miss that) how can I try it out for myself using the Intel DevCloud for oneAPI (all free, easy, and very cool)

As a community, I appreciate respectful technical discussions a great deal. You won't find me thrilled if any discussion gets off track in that respect. Technical debate is great - it makes us all stronger. Let's have fun in our discussions while respecting each other.

I look forward to your thoughts, feedback, and a good discussion.

James Reinders

Twitter JamesReinders
LinkedIn https://www.linkedin.com/in/jamesreinders/
Community Forum https://community.intel.com/t5/James-Reinders-Blog/bg-p/james-reinders-blog?emcs_t=S2h8ZW1haWx8dG9wa...
DPC++ Book download https://link.springer.com/book/10.1007%2F978-1-4842-5574-2

James_Reinders · ‎08-26-2021

This is embarrassing - comments have been auto-deleted by a spam-deleting-bot gone wild.

Some technical glitches annoy me more than others, and this one really annoys me.

I apologize to all of you who have posted comments, many of which I replied to. Of course, our team is looking to restore them - but it appears they might not be able to, because they were not 'hiding' suspected spam, they were deleting it. (yes - I've given some feedback on the brilliance of that - I hope I wasn't too harsh, we call make mistakes.) Their algorithm went crazy, and you can guess the rest

I will update this post when conditions change - and I look forward to your comments.

BTW - we do get slammed with SPAM, dozens a day on a blog like this... so aside from being very annoyed this spam-deleting-bot went crazy, I hate spammers and what they do to our lives.

Let me share this interesting post and question, from @philiprbrenan, that I happened to keep a copy of before before the bot got it:

This alphabetic references sound like marketing to me. Please could you back up your statements with more and better details? A smattering of entertaining mathematics that might be enjoyable? Avoiding the manufacture and promulgation of yet more acronyms? Perhaps you could either simplify and generalize or focus in and expand rather than occupying the less interesting alphabetic middle ground? The future is difficult to predict: what will you do if your predictions fail to materialize? Make another prediction? The: πpu ? You heard it here first.

First of all - I love the πPU. You definitely said it first!

Yes - XPU can be used as a marketing term, but I prefer to things:

(1) XPU as a technical term... to talk about the technical challenge of abstractions where there is a cut line between "what the developer writes/programs" and "what the runtime and hardware handle." When we start thinking of about how much can we support such abstractions - we get into really interesting conversations. I believe we need to have abstractions that allow a freedom to use any XPU, and allow full performance, and allow confidence in a stable/portable approach. We can do a lot to get this today, but ultimately a lot hinges on how similar various XPUs are. Abstracting various GPUs can be done at a lower level (like C++), than abstracting GPUs and FPGAs (which probably needs a domain specific language to avoid refactoring). However, even in the latter case a lot can be done at the C++ level to make offloading to both GPUs and FPGAs easier. Lots more to discuss. I'm planning a blog (think October) now about this abstraction - with concrete advice about what we know are the best practices today.

(2) XPU is a metaphor. It's a metaphor for anything we can offload any computation to.

Cheers!

- james

JNorw · ‎09-09-2021

I attended a webinar today on advanced dpc++, and ran across a surprise.

There are several forms of wait() associated with queue submits. Surprisingly, they all do the same thing.

queue Q;

Q.wait();

auto ev1= Q.submit( {..});

ev1.wait();

Q.submit( {..}).wait();

all wait for all prior submits to the same Q to complete before continuing.

I would have guessed that one or two of those forms would wait only for the completion of the current associated kernel;

It appears you would need a separate queue to wait for a specific single kernel submit to complete.

James_Reinders · ‎09-09-2021

@JNorw

That is an astute observation - a wait operation applies to a queue, and waits for all submissions to a queue to have completed.

An individual submit operation, to a queue, returns an event object for which there is no wait operation.

You are correct that using multiple queues to a single device, is the solution to achieve what you want.

I have a blog coming out next week - where I cover 'why you might use multiple queues to a single device.'

I will admit this was not on my list, until now. Thank you for pointing this out!

I'll add a link to the blog here, when I post it.

JNorw · ‎09-10-2021

I had asked a question about limits of number of queues at the recent advanced dpc++ webinar ... if it was tied to any hardware limit, for example.

I went ahead and created a test case to create a number of queues. While I successfully created 100 and used them, I found that adding queues one at a time is apparently taking a long time, relative to the execution time of the kernel. That's just my guess for what is happening.

Is there some more efficient way to initialize the task pool so that adding queues doesn't cause the overhead.?

Now that I think about this some more ... is this perhaps the JIT compilation time times 100 for the kernels?

After looking for JIT delay comments in the dpc++ site, I see that 140ms per kernel is not unusual, with the proposed solution being to use AOT compilation. I don't recall seeing an AOT compilation option in the cmake options for the book. It looks like this issue could be discussed along with the separate queues discussion, since it appears that each kernel in a separate queue could result in another 140ms of JIT time.

There is apparently also some longer application exit time associated with more separate queues, although I didn't look into it more.

I don't see a way to upload code here, so I uploaded the question on the dpc++ site, https://community.intel.com/t5/Intel-oneAPI-Data-Parallel-C/Long-overhead-for-initial-use-of-multiple-queues/m-p/1313720#M1546

JNorw · ‎10-01-2021

Nice queues video, and the info about host_device type going away is helpful.

I'm interested to see how Intel's recently announced IPU ideas will be integrated into oneAPI and dpc++.

I see Sapphire Rapids' support for user level interrupts described in the latest architecture manual. Will this enable new features in oneapi?