The following reproducer exposes an inefficiency with string comparisons. The compiler generates a call to for_cpstr() that consumes substantial compute time for a case I have. Tried with versions 11.1 and 13.1 and get the same assembly code. What is that call good for ? The compiler should be able to do much better with such a simple piece of code.
subroutine scmp(a,b,l, flag)
character(*), intent(in) :: a
character(*), intent(in) :: b
integer, intent(in) :: l
logical, intent(out) :: flag
if (a(1:l) == b(1:l)) then
flag = .true.
flag = .false.
end subroutine scmp
for_cpstr() compares two Fortran CHARACTER objects dealing with "strings" of different length and the different relational operators (EQ here). Is this the call itself you complain about (the overhead of the call) or the efficiency of the code implementing for_cpstr() ? I don't see anything, the compiler could do better.
O.k. I remember that the world of multi-byte-characters and multi-language strings can be rather complex. (Fortran programmers don't do such stuff.)
In my case the string comparison deals with keywords of less than a dozen plain ASCII characters, so the subroutine call overhead gets significant. It seems like for_cpstr() then calls intel_memcmp() so the overhead is even worse. One workaround I found is to start comparing just the first character and immediately return in case of inequality. That single character operation does not invoke for_cpstr().
Specific code generation for equality / inequality would be desired here. I think it could be done with less than 10 assembly instructions.
I think the suggestion is that the compiler special-case comparison of short strings of known same constant length and just do integer compares. That seems like a worthwhile idea to me, as such comparisons are frequently found in Fortran code.
yes - in general it might be helpful to have special implementations of short strings. But here the length is not a compile-time constant and intel_memcmp() does the optimizations like returning as soon as the first character pair is different, using 4-byte/8-byte integer compares and even using the vector-extensions.
the only optimization I see here would be a special version for strings of equal lenght ( saves a very, very few cycles) or in-lining - but in-lining is not free either ( code size increase, additional register pressure).
It would be interesting to see benchmark data on how many cycles could be gained really.
See the Oprofile data to show the relevance of this issue in the complete program context. The workaround I mentioned cuts 5.7% of total runtime.
samples % symbol name
522549 20.2889 m_tdt512_mp_psi_sweep_cycl_sf_
264716 10.2781 afpktr_
129155 5.0147 __svml_floorf4.R
. . .
121155 4.7041 for_cpstr
78103 3.0325 _intel_fast_memcmp
samples % symbol name
519767 25.7614 m_tdt512_mp_psi_sweep_cycl_sf_
162412 8.0497 afpktr_
128690 6.3783 __svml_floorf4.R
. . .
21856 1.2935 for_cpstr
. . .
11758 0.6959 _intel_fast_memcmp
Oh, my eyes were not good enough to see that the end position was a variable and not a constant. I agree with Heinz that an optimization here is more difficult.
In the above case, the segment (1:l) is the same, in this situation (lengths equal and == operator) you could treat the "arrays" as integer(sizeof character), dimension(l). I do not think the optimization check would be hard to perform and impliment. N.B. this may introduce issues with respect to "l" indexing out of bounds.