In order to efficiently compute the intersection (common elements) between two vectors of integers, I found only one of the output masks returned by AVX512-VP2INTERSECT instructions are needed. I wrote an emulation of VP2INTERSECT instructions and with some work found it could be made faster than the native version. I thought this could be useful to others so I wrote it up in the following preprint:
I wanted to ask whether this (computing the intersection or vectors of integers) is the typical intended, or most common use of these instructions (note that if the vectors are not sorted you also only need the first output mask). If it is, then this could be useful for two reasons:
1) developers could use the faster emulated version,
2) Intel can either save the silicon space for VP2INTERSECT, or possibly optimize its microcode based on the above preprint.
I wanted to ask for feedback, in case I am mistaken and I should be revising the paper.
Note that I *can* think of one case where both output masks returned by VP2INTERSECT instructions are needed: removing common elements in two vectors of integers. In this case one needs both masks to locate the common elements in each of the two vectors of integers. I have personally not had a need to do this, but it may have been an intended use case.
Thank you for the assistance.