New instruction proposal: memcmp-like int compare

Daniel_F_ · ‎08-19-2017

One of most used functions in our app is one which works like C memcmp function for various data types. For int types up to 32 bits it is enough to cast them to wider type and subtract; for int64 we have to use expression "a < b ? -1 : a > b". Recently I looked for some dedicated instruction which provide such functionality, but did not find any. Please add new instruction which will compare two ints and return negative/zero/positive value.

jimdempseyatthecove · ‎08-20-2017

Consult the Intel Intrinsics Guide https://software.intel.com/sites/landingpage/IntrinsicsGuide/#=undefined&cats=Compare&expand=651

Check the Compare box. You have vector compares, returning a mask, for 8, 16, 32 and 64 bit lanes.

With those, you can write your own vmemcmp-like int compar function.

Jim Dempsey

Daniel_F_ · ‎08-20-2017

jimdempseyatthecove wrote:

Consult the Intel Intrinsics Guide https://software.intel.com/sites/landingpage/IntrinsicsGuide/#=undefined&cats=Compare&expand=651

Check the Compare box. You have vector compares, returning a mask, for 8, 16, 32 and 64 bit lanes.

With those, you can write your own vmemcmp-like int compar function.

Jim Dempsey

Thanks for reply. Function mentioned by me is used by our custom hashtable implementation, so vector instruction would not help there. We probably would have to replace our hashtables with something else to take advantage of them. With current implementation dedicated non-vector compare instruction would be better.

andysem · ‎08-21-2017

for int64 we have to use expression "a < b ? -1 : a > b"

Why can't you use subtraction for int64?

Daniel_F_ · ‎08-21-2017

andysem wrote:

for int64 we have to use expression "a < b ? -1 : a > b"

Why can't you use subtraction for int64?

It may cause int range overflow, e.g. INT64_MIN - 1 == INT64_MAX.

James_C_Intel2 · ‎08-21-2017

Why can't you use subtraction for int64?

I think he is after some way of doing this without using branches (which will be badly predicted). It should be relatively simple to use cmov, though, something like

static inline int compare(int a, int b)
{
    int one = 1;
    int minusone = -1;
    int res;

    __asm__ volatile ("cmpl %1,%2\n"
                      "movl $0,%0\n"
                      "cmovll %3,%0\n"
                      "cmovgl %4,%0":
                      "=r"(res): "r"(a),"r"(b),"r"(one),"r"(minusone)
                      );
    return res;
}

Whether that is better than the code the compiler generates by default is unclear, of course, and I haven't looked at IACA to investigate.

Daniel_F_ · ‎08-21-2017

Cownie, James H (Intel) wrote:

Why can't you use subtraction for int64?

I think he is after some way of doing this without using branches (which will be badly predicted). It should be relatively simple to use cmov, though, something like
static inline int compare(int a, int b)
{
    int one = 1;
    int minusone = -1;
    int res;

    __asm__ volatile ("cmpl %1,%2\n"
                      "movl $0,%0\n"
                      "cmovll %3,%0\n"
                      "cmovgl %4,%0":
                      "=r"(res): "r"(a),"r"(b),"r"(one),"r"(minusone)
                      );
    return res;
}
Whether that is better than the code the compiler generates by default is unclear, of course, and I haven't looked at IACA to investigate.

Yes, branchless code is one of my goals, and gcc actually generates such code (see below). However dedicated instruction would be faster here. Additionally it could allow compilers to perform some additional optimizations - ability to get -1 as a result of comparison may be handy in some cases.

cmp(long, long):
  xor eax, eax
  cmp rdi, rsi
  mov edx, -1
  setg al
  cmovl eax, edx
  ret

jimdempseyatthecove · ‎08-21-2017

>>Function mentioned by me is used by our custom hashtable implementation, so vector instruction would not help there.

Why do you say that.

Your original thread post requests a memcmp-like integer compare. This implies you have a list of integers that you wish to compare. Hash tables, other than for collision lists do not have "lists" in need of searching. So, we need to ask: what is your real requirement (be specific).

If your problem is that you have a list of hash keys that are to be searched for a match (as opposed to having a key being used to construct an index), then the vector compare, producing a bitmask, is the correct choice to use.

Jim Dempsey

Daniel_F_ · ‎08-21-2017

jimdempseyatthecove wrote:

>>Function mentioned by me is used by our custom hashtable implementation, so vector instruction would not help there.

Why do you say that.

Your original thread post requests a memcmp-like integer compare. This implies you have a list of integers that you wish to compare. Hash tables, other than for collision lists do not have "lists" in need of searching. So, we need to ask: what is your real requirement (be specific).

If your problem is that you have a list of hash keys that are to be searched for a match (as opposed to having a key being used to construct an index), then the vector compare, producing a bitmask, is the correct choice to use.

Jim Dempsey

Part of our hashtable implementation are AVL trees, and this compare function is used to traverse them.

Everything is written in C so we cannot use templates, code uses callback functions.

jimdempseyatthecove · ‎08-21-2017

I must be misunderstanding something. From my experience, a hash table is used to avoid trees and traversals. You perform a hash from an item being stored (e.g. name), and that hash is the location of the data (or an indication of a collision). IOW there is no searching.

If you are using a binary tree structure (and using the hash code as "plain key" for storage), then for each node (bucket) of keys you would do a binary search or sequential search depending on if the data (keys) stored in the node/bucket were sorted or not (design decision).

If your design uses a binary search in the node/bucket, then you do not need a memcmp-like instruction/function.

Jim Dempsey

Daniel_F_ · ‎08-21-2017

Our implementation is a mix of these two approaches - our "hashtable" is a set of AVL trees. Code first uses hash function to select one of trees, then it traverses it with help of compare function.

andysem · ‎08-22-2017

jimdempseyatthecove wrote:

I must be misunderstanding something. From my experience, a hash table is used to avoid trees and traversals. You perform a hash from an item being stored (e.g. name), and that hash is the location of the data (or an indication of a collision). IOW there is no searching.

A typical hash table-based container consists of an array of buckets, and each of the buckets is also a data structure, such as a list or a tree. You first map the hash value onto the array of buckets (which is fast) and then you search for the element in the bucket (which can be slow if you have lots of collisions). Having a bucket that simply stores one element is possible, but only works if you don't have collisions at all, which is hard to guarantee most of the time.

I believe, Daniel says that in his case buckets are implemented as AVL trees.

andysem · ‎08-22-2017

Daniel F. wrote:

Quote:

andysem wrote:

for int64 we have to use expression "a < b ? -1 : a > b"

Why can't you use subtraction for int64?

It may cause int range overflow, e.g. INT64_MIN - 1 == INT64_MAX.

1. While signed integer overflow is UB in C, it is not in x86, so you can use assembler or cast the values to unsigned integers. Overflows still need to be handled though to produce the correct result. It should be possible with some bit twiddling but I'm not sure the code will be faster than the naive approach.

2. memcmp operates on unsigned bytes, so it doesn't apply to your case. An instruction would have to provide at least two variants - for signed and unsigned comparison.

3. Although I admit that three way comparison (https://en.wikipedia.org/wiki/Three-way_comparison) is fairly common in high level programming languages, I'm not sure the hardware implementation would be beneficial in terms of performance or power consumption. Note that with the branch instruction the CPU is able to exploit branch prediction to hide the comparison latency while with a dedicated instruction the following instructions will just form a dependency chain. I don't know how cmov instructions are implemented, but there is opinion (http://yarchive.net/comp/linux/cmov.html) that these instructions are not very useful because of this.

Daniel_F_ · ‎08-22-2017

Yes, this is my structure - buckets implemented as AVL trees.

Ad.2. In my case I need to compare two ints in order to determine place for them in AVL tree, so I can use either signed or unsigned comparison or any other strict ordering. But in other cases this may be important. Good catch.

Ad.3. In my case code calls comparison function as a callback, so in fact there are two comparison: first one to determine ration between two numbers (inside compare function), and then calling code performs 2nd one to interpret result of 1st one. Unfortunately C language is less inline-friendly, C++ is far more flexible there. I do not know how much this affects branch prediction.