- Page 2

Livshin__David · ‎11-18-2018

Some OpenMP related documents state that in order for loop to be treated by OpenMP is must be “countable” providing different definitions for loop being “countable”:

the number of iterations in the loop must be countable with an integer and loop use a fixed increment.
the loop count can be “determined” ( what does it mean “determined”? )

Is it indeed the requirement of OpenMP? Or is it requirement of a specific compiler implementation of OpenMP?

Can the following code ( doesn't seems to be countable ) be parallelized by OpenMP ( note that the question is if the code can be pararallelized and not if there is a way to create a parallel equivalent of the code )

for ( i = 0; i < cnt; )
{
 x1 = 2.0 * x - 1.;
 if ( x1 < 1.0 )
 {
  i = i + 3;
  x = x*2.;
 }
 else // if ( x1 >= 1. )
 {
  i = i + 2;
  x = x/2.;
 }
}

Thank you,


David

www.dalsoft.com

jimdempseyatthecove · ‎11-22-2018

For more accurate correctness test, change tally to:

tally += (a * ((i & 15) + 1)) + (b * ((i & 15) + 1));

The idea is to produce a hash that includes the location as well as the value... without having too many of the results bits get truncated in the process. The earlier use of i directly would have caused this. For complete correctness two results (4 arrays) would be required for direct results comparison (placement as well as value).

Jim Dempsey

Livshin__David · ‎11-22-2018

fixed the sources and the check sums now fully agree with the timing being:

loop - sequential code: 19.94

loop_dco - code generated by autoparallelizer: 17.78

loop_Jim: 13.62

Livshin__David · ‎11-23-2018

Last post shows that your manually created code is 24% faster than auto-parallelized version of the loop. The explanation is quite simple as auto-parallelizer creates code to ensure memory disambiguation ( which introduces about 20% overhead ), something that is missing in the manually created version.

I created code that adds some weight to the real calculation of the loop replacing assignments of a and b by:

b = exp(x1) * cos(x1) + cosh( x1 );
a = sqrt(x1) * log(1 / x1) – sinh( x1 );

( not that it makes sense mathematically, just uses transcendentals that seems to be expensive ).

I also run the code for 5 iterations instead of 10 in the previous try. The execution times now:

loop - sequential code: 20.05

loop_dco - code generated by auto-parallelizer: 12.96

loop_Jim: 11.48

this time your manually created code is 11% faster than auto-parallelized version of the loop.

I don't know how much time took you to create the code, to run auo-parallelizer took few seconds. The 11% difference is the price to pay for convenience ( no need to modify the source, no time to create the code ) and protection/precaution ( all necessary checks performed ).

jimdempseyatthecove · ‎11-23-2018

I agree that 11% might not be worth the effort for using manual work distribution, but please note that your test platform has only 2 cores. It is yet unclear how the underlying thread teaming is performed, and if your auto-parallelizer scales with more cores. What is important, is the technique as to how you make the determination as to how to partition this particular example.

Did your verification assure the placement of the results were the same as with the serial version. IOW simply testing the end result of x is insufficient.

Jim

Livshin__David · ‎11-23-2018

Not sure I understood what do you mean by “teaming”, but currently every thread gets equal number of iterations ( that is why it so important to determine loop count ). As to “scaling with more cores”, auto-parallelizer doesn’t care about number of cores, however not every algorithm scales nicely, see

www.dalsoft.com/Increasing_number_of_cores_is_not_always_beneficial.pdf

As to the verification, all the test I run – the one in our conversation and many many more - were tested to assure that results generated by parallel and serial codes are reasonably identical ( in the loop we discussed, the check sums were verified ).

Being able to run verification only on one 2 core machine I consider as a major flaw and a drawback. I would like to evaluate the tool on ( many ) multicore x86 systems. There are samples of the results of parallelization and instructions on how to execute them and how to report the results, see

www.dalsoft.com/documentation_auto_parallelization.html#howToRunProvidedExecutables

These executables run on x86 Linux system and some ( one ) need Fortran shared libraries to be installed. Desperately looking for people to run these samples and to report the generated data ( as described in the above link ).

jimdempseyatthecove · ‎11-23-2018

Thanks for the links. When I have time, I will see if I can run your executables on two of my systems:

Intel Xeon E5-2620 6-core/12-Thread
Intel Xeon Phi 5200 64-core/256-Thread

I will compile the source code for the non-dco samples as-written
Then I will edit the source code to how I would think a High Performance Computing programmer would address the same problem using OpenMP.
It would only be fair to return the modified sources back to you to remove the OpenMP directives and to dco-ify the same code.

Jim Dempsey

Livshin__David · ‎11-23-2018

Thank you.

Impressive hardware!

Note that there is no need for any development effort, the executables and the scripts to run them are provided at the link I specified - it is necessary just to run these scripts and to report the results so I will be able to publish them on my site.

Having sad that I like to add - I am not against another round of the "manual-automatic parallel code generation" dueling between us.

Livshin__David · ‎11-24-2018

Comparison of your manually created code to the code generated by the auto-parallelizer felt as not complete: auto-parallelizer creates code to ensure memory disambiguation which manually created code doesn’t. I modified the assembly file created by dco ( auto-parallelizer ) and bypassed this code. The execution time now:

loop - sequential code: 20.05

loop_dco - code generated by auto-parallelizer: 12.96

loop_dco_fake - code generated by auto-parallelizer with memory disambiguation being disabled: 11.92

loop_Jim: 11.48

this time your manually created code is less than 4% faster.

Actually I estimated the overhead of memory disambiguation to be about 10% ( 20% for the original code – I reported that ). The difference that still exists ( less than 4% ) is due to the overhead introduced by the calculation of the loop count ( done by dco ) as well as other checks that I didn’t touch.

jimdempseyatthecove · ‎11-25-2018

Did your verification verify that the store index i's were juxtaposed correctly (IOW not verify that the final x was correct)?

Also, a test with 2 cores is not necessarily indicative of what will happened with 4, 8, ... cores.

Jim Dempsey

Livshin__David · ‎11-25-2018

Did your verification verify that the store index i's were juxtaposed correctly (IOW not verify that the final x was correct)?

Yes of course, we discussed this already.

Also, a test with 2 cores is not necessarily indicative of what will happened with 4, 8, ... cores.

more than that – with more cores the overhead introduced by dco will become more profound thus widening the gap between our implementations ( most of the code that introduces the overhead is executed by the master process and doesn’t benefit from the cores available ).

However what seems to be important is

the price/benefit ratio between time spent to manually develop code ( price ) and the improvement that provides over code generated by auto-parallelize ( benefit ).
how secure the manually developed code really is – for example no memory disambiguation was attempted in the code you developed, which I don't know how to justify: you may have no control on how and were your code is executed, which makes relation between memory access addresses not known.

David Livshin

www.dalsoft.com

jimdempseyatthecove · ‎12-10-2018

Your #1 is incomplete for evaluation purposes. One must look at the incremental cost amortized across the reduction in runtime for the expected life time * number of instances of the application. If the application were to only be run once, for a short duration (days), then your worth it calculation is correct. However, for research and/or business applications where the number of runs are many and run times are long, then 5% - 10% faster are a significant benefit.

For #2, for a properly constructed piece of code it will be known in advance that either:

a) disambiguation isn't necessary (as conflicts cannot occur)
b) disambiguation is required (as conflicts can occur)

In the case of b), generally a simple test can be performed (e.g. overlapping array sections), and then an alternate form of a) can be selected (or resort to safe reduction methods).

In my sample code provided, the construction was such that conflict could not occur except for a meaningless mashing of pointers:

	a = (double*)_mm_malloc(N * sizeof(double), 64); // (double*)malloc(N * sizeof(double)); // new double;
	b = &a[1];

Jim Dempsey

Livshin__David · ‎12-11-2018

then 5% - 10% faster are a significant benefit.

I am writing code optimizers for many many years and didn't see a customer that was interested in a tool that gives 5% improvement. I thinks such a person wasn't born yet.

In my evaluations I always consider 5% difference in code executions ( either may - 5% improvement or otherwise ) to be nothing more than a "noise".

For #2, for a properly constructed piece of code it will be known in advance that either:
a) disambiguation isn't necessary (as conflicts cannot occur)
b) disambiguation is required (as conflicts can occur)
In the case of b), generally a simple test can be performed (e.g. overlapping array sections), and then an alternate form of a) can be selected (or resort to safe reduction methods).

Disregarding the fact that such a crude way to perform disambiguation, in my my opinion, is unacceptable, even that you wouldn't be able to do - remember that the loop we are discussing is not countable, thus the loop count can not be easily determined and therefore array regions utilized by the loop may not be calculated.

In your case you allocated memory thus eliminating the need for disambiguation, but why do you think that this represents the real life situation. In the real life, as I know it, the routines are often called without any means to recreate the values of their parameters, thus making dynamic memory disambiguation necessary.

jimdempseyatthecove · ‎12-11-2018

If you bothered to look at my code you will notice the partitioning is done after the fact of generating the output index. IOW the input index is irrelevant, the output passes through a write permitted filter based on the store index. Each thread has a specific output index range and thus cannot conflict amongst threads.

The author of the code is responsible for defining any requirements on the input arguments (e.g. not permitted to overlap, can overlap where "B" follows "A" by at least vector width, can overlap where "A" follows "B" by at least vector width, or permitted to overlap, not allocated, allocated, not initialized, initialized, ...). Your position seems to be "don't know/care" and the code will test the situation and take appropriate action. Your position is valid for undisciplined programmers, which is alright if that is who you are dealing with.

My preference is if there is a possibility of conflicting usage, that I (the programmer) .not. rely on auto dependency testing with code path generation. Especially with multi threading. But this is my preference as I want to know, and control, what the code is doing.

Jim Dempsey

Livshin__David · ‎12-17-2018

I did look at your code, I even had to fix it for it to run ( see posts #19, #23 ), but fail to understand how the way you arranged the calculation eliminates memory aliasing. Being on the subject, what would you say if the line 'i = i + 2' will be replaced by 'i = i - 2'?

Livshin__David · ‎12-17-2018

The following is my presentation at the "Compiler, Architecture, And Tools Conference" ( see post # 9 ):

the full text of the presentation:

http://www.dalsoft.com/Autoparallelizer_for_a_multi_core_x86_architecture.pdf

the slide show:

http://www.dalsoft.com/presentation_Autoparallelizer_for_a_multi_core_x86_architecture.pdf

jimdempseyatthecove · ‎12-20-2018

David,

What you have done for auto-parallelism is quite interesting. And once mature (bugs shaken out), very impressive. As to if this technology can be incorporated into existing compilers (IP acquired) or sold as a stand-alone post-processor, I have some prior experience on this. I suspect that either nothing will happen or you might get some interest in IP acquisition. Selling or consulting for post-processing may get but a few interested parties.

Back in the stone age (1990-1992) my minicomputer business faded away due to PCs disrupting the minicomputer business. Back in those days, a large portion of the market still ran MS-DOS in Real Mode or using a DOS Extender in a transitional Real/Protected mode. Windows (Protected Mode) was still becoming of age. Looking to form a new PC based software company, I'd asked myself what could I offer.

The first product was to port the Digital Equipment Corporation text editor TECO to the PC. This text editor is more capable than AWK, Brief and others, at pattern matching and MUNGing up text. Mung is a self-referencing acronym Mung Until No Good. Due to the steep learning curve (too many features) there was had little interest in this product.

Then I though, well, how can I get interest in this... ah... incorporate it in a tool for programmers. At that time I had gained considerable programming experience using Borland Turbo-C, and then Borland C++. During debugging, I'd often looked at the disassembly code and noticed that although the Borland compilers produced optimized coded, that in places the code was non-optimal. I guess it was good enough for Borland as their compiler was better than Microsoft's.

I thought to myself, hmm..., If I used the Borland compiler to produce assembler output, I could use my TECO editor to search out the non-optimal code sections and tidy them up. While the code patterns tended to be the same, the register assignments and branch labels would differ. TECO is capable of fuzzy match so I could search out a fuzzy large pattern (multiple lines) containing smaller fuzzy match patterns (registers and labels), then massage the fuzzy stuff and produce optimized replacement code. I called this the PeepHole Optimizer.

The above procedure was performed again for different non-optimal code sequences. After I thought I found all, or enough of, the low hanging fruit. I re-examined my output files only to discover that may tweaked-up code produce some sections that could benefit from further optimizations using the same fuzzy match and rewrite macros. IOW, I could pass my output back through the same editing macros to yield further improvements (and add new macros to handle exception cases). A typical example of MUNG (without the NG).

The end process produce code that was typically 15% faster. I thought this would sell as a post-processor. I found little interest (maybe 100 sales). Some of my users asked if this would work on MS C/C++. So I gave it a try and to my surprise although the assembler code looked quite different between Borland and Microsoft, the fuzzy pattern matching found and corrected much of the code produced by the MS compiler and yielded similar performance improvements.

At this time the Intel 80386 CPU was quite popular. You had 16-bit MS-DOS, MS-DOS with DOS Extender (protected mode) and Windows (protected mode). IOW we had the MS-DOS/DOS-Extender market and the Windows market. MS-DOS could live in the low 1MB of RAM but was capable of using the additional RAM as a RAM-DISK via a driver. The DOS extenders could use the extra memory but only in 16-bit selector model. The CPU though was capable of 32-bit Flat Model mode but DOS could not run in that mode. Some of us programmers figured out that the CPU DS and ES selectors could be pre-conditioned to have Huge Granularity bit set permitting addressing indexed/based by 32-bit registers. The only way to perform the indexing was by modifying assembly code to instruct the assembler to emit the 32-bit address prefix byte. Borland TASM assembler would do this. At that time users could write code in assembler that would run 16-bit MS-DOS code with 32-bit addressing capability for data. The only problem was this required assembly coding.

Now then, I thought, I can optimize the assembly output of Borland and Microsoft C/C++ code by way of the assembler output, so with enhancement to the PeepHole optimizer I could add Flat Model programming. Side bar: at that time a 16-bit program (C/C++) could have a pointer, a 16-bit address off of the DS (Data Segment), a "far pointer" two 16-bit words one to be loaded into the ES (Extra Segment) register and the other to be used as a 16-bit index (from this pointer you could byte offset index +/- 32KB), and finally a "huge pointer", which has the two 16-bit words like the far pointer, but the compiler could generate code that the index would manipulate both the segment and offset thus permitting arrays larger than 32KB (but still confined within the 1MB box).

What I did was to pattern match for huge pointer manipulation, and convert this to 32-bit Flat Model, inclusive of using the CPU's SIB (Scale Index Base) capability. This made for a vast reduction in code that manipulated huge pointers and permitted 32-bit indexing. The next problem though was the C Runtime Library heap manager was locked in the 1MB box (640KB or 720KB) as presented by MS-DOS. Because I could manipulate huge pointers, the only remaining issue was with the allocation. While it was an easy process to locate the malloc/free, identifying if the pointer receiving the malloc or supplied to free was a huge pointer or far pointer was problematic. I knew the debugger had this information, so I required the C/C++ compiler to add the debug symbol table to the assembler output, and text macro code to search the debug symbol table to find out if the pointer was attributed as huge. If so, the allocation/free was redirected to my own memory manager who's heap could expand to all of physical RAM.

Now I had a product that could run in MS-DOS who's code, though restricted to the low 640KB, had access to all of physical RAM as data (4MB, 8MB, 16MB, ...) and executed 2x to 4x faster. Great, or so I thought. There were two marketing problems with this:

a) I was a little company and few businesses and/or research institute would deal with a small outfit (non-major vendor)
b) The product couldn't work inside Windows

The problem in Windows was a result of a poor decision on Intel's part when they implemented the Virtual 8086 Machine mode (V86) to run MS-DOS applications (this is a quazi Real Mode). While the memory management hardware supports a process mapped (via selectors) and via the Page Tables be place on arbitrary pages, someone at Intel though it wise that any index used that resulted in an index beyond 1MB from the base of the Selector (in V86 mode) was to generate a GP fault... regardless as to if that selector contained a size larger than 1MB. IMHO there was absolutely no reason for this seeing that for a typical MS-DOS session the Selector could be allocated with a size of 1MB and that attempt to address outside of the box would generate a Page Fault. As such my atypical MS-DOS session while it could manipulate the granularity and potentially acquire the additional RAM for the Selector, it was prohibited from addressing that memory.

What this long story is about is a cautionary tail that sometimes superior products fail.

Jim Dempsey

Livshin__David · ‎12-20-2018

Jim,

Thank you very much for taking the time and sharing your story with me and all the readers of this forum – and what the story it is!

For me it might serve as a rude awakening, but I never really was asleep. I understand that a failure is a very possible outcome of my efforts, but instead of fretting about a likely flop I would like to focus on the possible success and how to make it happen. Don't know how much can I rely on your help and to expect your assistance, but anything providing the guidance for marketing of my tool will be greatly appreciated.

In the case that that is going to be the last post, I would like to thank you for the most exciting and interesting dialog that we had and wish you Happy Holidays!

And, cant resist one more request, – wish me luck.

David Livshin

www.dalsoft.com

jimdempseyatthecove · ‎12-20-2018

David,

I wish you all the luck in attaining some measure of reward for all your hard work. Personally I think your best possibility will be to see if one of the compiler vendors (read those that sell compilers) could be interested in acquiring your intellectual property. The Open Source community will want you to donate all your effort, and then provide support. I am not generally against Open Source, but someone has to pay your bills.

RE your text (.pdf)

I think it would help if you included some better examples and some charts. IOW a little more in-depth presentation (study) comparing OpenMP and dco. I think the major strengths you have over OpenMP (which at times can be a weakness) is that with your technique the programmer need only specify that the loop is a candidate for parallelization. Whereas with OpenMP, the programmer may be required to supply clauses private, shared, reduction, simd, loop count etc.... IOW dco is somewhat the Parallel Programming for Dummies approach to yield good results. Dco may benefit from additional directives that may eliminate some of your preamble code (e.g. determine if loop count is sufficient for parallelization).

In your pdf you showed:

do 140 i = 1, nk
  x1 = 2.d0 * x(2*i-1) - 1.d0
  x2 = 2.d0 * x(2*i) - 1.d0
  t1 = x1 ** 2 + x2 ** 2
  if (t1 .le. 1.d0) then
    t2 = sqrt(-2.d0 * log(t1) / t1)
    t3 = (x1 * t2)
    t4 = (x2 * t2)
    l = max(abs(t3), abs(t4))
    q(l) = q(l) + 1.d0
    sx = sx + t3
    sy = sy + t4
  endif
140 continue

As an example and stated that the q(l) had dependencies that made is hard or inefficient to parallelize using OpenMP. This is not true, and should not be used as an exemplar. To wit:

!$omp parallel do default(private) reduction(+:q,sx,sy)
do i = 1, nk
  x1 = 2.d0 * x(2*i-1) - 1.d0
  x2 = 2.d0 * x(2*i) - 1.d0
  t1 = x1 ** 2 + x2 ** 2
  if (t1 .le. 1.d0) then
    t2 = sqrt(-2.d0 * log(t1) / t1)
    t3 = (x1 * t2)
    t4 = (x2 * t2)
    l = max(abs(t3), abs(t4))
    q(l) = q(l) + 1.d0
    sx = sx + t3
    sy = sy + t4
  endif
end do

The only change to the loop code was to modernize it by replacing the "do 140.../140 continue" with "do.../end do". Then adding the single OpenMP directive to the modernized code. The reduction clause eliminates the race conditions for the shared objects without requiring these to be atomic operations (or critical sections). It does add the expense at the beginning of the loop to zero-out the to be reduced data and at the end of the loop the combination code to perform the reduction. As to how this compares with dco this would be up to a test.

The example in the Utilized technology section was covered above in post #15. While I did have to modify the code, the modifications were slight and quite easy to introduce without bunging up the results. dco did not win on performance but it did win on ease of coding.

The cmppth example would be harder to manually code for use with OpenMP. Something like:

int cmppth (BIT *p, BIT *q, int ninputs )
{
  atomic<int> ret = INT_MAX;
  #pragma omp parallel reduction(min:ret)
  {
    int my_code = omp_get_thread_num() << 1; // 0, 2, 4, 6, ...
    #pragma omp for
    for (int i = 0; i < ninputs; i++)
    {
      if (p != q)
      {
        if (p < q)
        {
	  ret = my_code + 1; // +1 indicates return -1 for thread my_thread >> 1
          break;
        }
        else
        {
          ret = my_code;
          break;
        }
      } // !=
      if(ret < my_code)
        break;
    } // for
  } // parallel
  if(ret == INT_MAX)
     return (0);
  if(ret & 1)
     return(-1);
  return (1);
}

Jim Dempsey

Livshin__David · ‎12-21-2018

Jim,

Thank you for your reply.

For the Fortran code, introducing reduction of the array ‘q’ ( reduction(+:q) ) is unacceptable as a general solution – reduction creates a private copy of the array inside every thread which, when ‘q’ is a “large” array, may be problematic not to say fatal. dco came up with another solution that doesn't require memory allocation and is very efficient.

For ‘cmppth’, the code that you created has a ‘break’ inside ‘omp for’ area which is, as I understand it, is not allowed. However the problem is not that. I can easily generate parallel version of this loop – actually I did but couldn't verify it because test cases for  “eqntott” benchmark I have contain few tens of items which is not enough to benefit from parallelization. The problem is not to allow iterations after the exit from the loop to be executed. It is not clear if the code you proposed guarantees that. Also ‘atomic ret’ may be expensive to implement - you set it once inside the loop ( which is fine ), but test it at every iteration ( which is something that I would try to avoid ).
Being on the subject, do you know “real” benchmark that has critical loop with the break/return/goto inside it?

David Livshin

www.dalsoft.com

Livshin__David · ‎12-21-2018

Second part of the reply seems to be altered, so resubmitting it: For ‘cmppth’, the code that you created has a ‘break’ inside ‘omp for’ area which is, as I understand it, is not allowed. However the problem is not in that. I can easily generate parallel version of this loop – actually I did but couldn't verify it because test cases for “eqntott” benchmark I have contain few tens of items which is not enough for parallelization. The problem is not to allow iterations after the exit from the loop to be executed. It is not clear if the code you proposed guarantees that. Also ‘atomic ret’ may be expensive to implement - you set it once inside the loop ( which is fine ), but test it at every iteration ( which is something that I would try to avoid ). Being on the subject, do you know “real” benchmark that has critical loop with the break/return/goto inside it?

jimdempseyatthecove · ‎12-21-2018

I agree that the cost of reduction for arrays may be too high for lightly loaded loops. I did/do not see any issue with break (C/C++) or exit (Fortran), but their would be objection to goto outside of the scope of the loop. The array reduction can be eliminated using the % filter as in #15, but this adds code and overhead. If dco performs better this would be a good example to include.

The cmppth example is another bad example do to not being applicable for a real world test case (unless you can find an in-use example with a very large ninputs).

Jim Dempsey

Countable loops in openMP