>>If best performance in the

Nick_W_ · ‎09-13-2013

Hi,

We have an engineering sample of the Xeon Phi (60 core version). We have installed the mpss_gold_update_3-2.1.6720-16-rhel-6.4

1) We followed the instructions to upgrade flash but micinfo is reporting this:

Host: Linux

OS version: 2.6.32-358.el6.x86_64

Driver version: 6720-16

MPSS version: 2.1.6720-16

BUT: flash version, SMC, UOS and device serial number are all reported as "NotAvailable"...

2) micsmc produces error messages every few seconds such as "Warning: mic0 device connection lost!" and "Information: mic0: Device connection restored"

3) We have a C++ native application running on the Phi. It can be configured to run multithreaded (pthreads), we have taken benchmarks. It runs fractionally more slowly in 240 threads (60 cores x 4 HW threads per core) compared to a single thread... (we have accounted for the overhead of starting new threads)

The application runs against a fixed size sample data as follows:

- when compiled for the host system: 0.2 seconds

- when run on the Phi in a single thread: 0.59 seconds

- when run on the Phi in 240 threads: 0.6 seconds

Repeated runs while micsmc is running shows that no more than 2 cores are being used at any one time...

Any help with this would be much appreciated

Nick_W_ · ‎09-13-2013

Update:

We greatly increased the run time of our sample code (using 240 threads) - and watched performance using top.

All cores are being utilised but less than 1% of each core is being reported by top.

(We simplified the code so it reads through a 1.6M memory buffer thousands of times)

So we are thinking the problem is related to memory bandwidth...

Let us know what you think - perhaps you have standard benchmarking software we could try on our card?

robert-reed · ‎09-13-2013

Running only a single thread on the coprocessor will gain you access to only half the cycles available on a single core, so performance in that scenario SHOULD be pretty poor. And running 240 threads on 60 cores may have pushed you beyond any performance elbows that may be present because of the activity (or lack of it) in your test application. Have you tried collecting data with thread counts anywhere between these extremums?

It also wouldn't hurt if you could show us what your test program is doing.

Charles_C_Intel1 · ‎09-13-2013

A program that runs for only a fraction of a second is also troublesome. Remember, we're runing a Linux kernel here, and you're asking the OS to create 240 threads and all the associated OS data structures associated with a thread. Even on a full-blown Xeon, that is a time-consuming *serial* process. A workload with a longer runtime will make it much easier to work out where you are running into an application issue, and where you are just running into the OS having a lot of work to do (a longer program should minimize the overall effect of the OS).

Charles_C_Intel1 · ‎09-13-2013

Ooops, sorry, didn't read clearly. How much longer is your runtime in the threaded code now?

I agree with Robert - seeing your code would be useful.

Bernard · ‎09-15-2013

Are not you over subscribing your data with large number of threads?As Charles said for short period of running code you can accumulate an overhead related to creation 240 threads and their data structures.So partly core`s hardware threads will spent some time creating OS threads for your program.

Bernard · ‎09-15-2013

Another question is related to your application and its ability to be efficiently threaded.

Nick_W_ · ‎09-16-2013

There was a bug in the code which was dividing the workload amongst threads... so that explains a lot. Thanks for all your comments though.

Out of interest, we were start all threads, halting them at a thread barrier until they were all ready. Then releasing them and starting a timer... The end timer was after all threads have completed their workload and joined. The only synchronisation point between threads was a __synch_add_and_fetch.

With the bug fixed, we are seeing better performance for much larger inputs, pushing runtimes up to several seconds (60% CPU showing in top) - but there is still some kind of overhead which is making the threaded version look slower for shorter runtime. Perhaps the join? This is for curiosity only since our target application will involve streaming very large data volumes.

Bernard · ‎09-16-2013

Can you profile your application with Vtune?

jimdempseyatthecove · ‎09-16-2013

>>The only synchronisation point between threads was a __synch_add_and_fetch.

I am assuming the __synch_add_and_fetch was used for a thread completion count barrier.

Try replacing the counter with an array of done flags (initialized to 0). This converts atomic read/modify/write with atomic write. Your barrier code becomes:

volatile int DoneFlags[nThreads];
...
for(int i=0;i<nThreads;++i) DoneFlags = 0;
...

DoneFlags[iThread] = 1; // (omp_thread_num())
for(int i=0;i<nThreads;++i) { while(!DoneFlags) _mm_pause(); }

If the barrier is reached repetatively within a parallel region, consider changing done flag to trip count (and your test for done too).

DoneFlags[iThread] = iTrip; // iTrip is in local scope of thread, initialized to 1
for(int i=0;i<nThreads;++i) { while(DoneFlags != iTrip) _mm_pause(); }
++iTrip;

I don't have my Xeon Phi yet so I cannot confirm if you can use byte sized flags, 32-bit flag should work though. On Sandybridge, the write mask is capable of working in byte sized units. Flag/Trip count checking can use wider units.

Jim Dempsey

robert-reed · ‎09-16-2013

Caution: using an array of "done" flags offset at integer widths across an array of HW threads will likely expose your code to significant "false sharing" where the manipulation of individual flags will have cache-line effects causing thrashing in the neighboring thread flags, as the cache lines are modified and evicted.

jimdempseyatthecove · ‎09-16-2013

Robert,

The Xeon Phi has latency problems with XCHGADD (and other interlocked instructions). The direct write was suggested as a means to reduce the latency (though it is not a perfect solution). The user is free to experiment as to if the flags are packed into adjecent locations or spread across individual cache lines. For barrier I suggest using packed because the scan of all done flags can be made with fewer cache line loads. Though the writes are slower, the writes are made at skewed time intevals, thus for all but the last thread, the extra write latency consumes spinwait time.

One must make the pudding though.

Jim Dempsey

Nick_W_ · ‎09-17-2013

I can't post any code at the moment but this information might help pinpoint some problems:

- A large reference data file is loaded (161Mb). The code accesses this data in a fairly random order...

- A global input buffer of 1.8Mb is loaded

- A global results buffer of 8Mb is allocated

- The input is divided evenly amongst the threads.

- The threads work independently apart from when writing to the output buffer. The outbuf index is controlled by the _synch_add_and_fetch. Apart from the thread barrier (which only exists for timing purposes), this is the only synchronisation point.

It's not easy to CPU utilisation figures due to the short runtime but it only seems to hit around 40% max

jimdempseyatthecove · ‎09-17-2013

If the amount of output per segment of input is the same, then you do not need a _synch_add_and_fetch to acquire a slot in the output buffer. (outOffset = (8/1.8)*inOffset*iThread)

Your requirements may differ.

Jim Dempsey

Nick_W_ · ‎09-17-2013

Thanks but the output size is not known beforehand.

We noticed that when timing individual threads, the thread times were averaging 0.004s (longest being about 0.006s) so there is a some large overhead associated with joining the threads since the total runtime is currently 0.012 seconds (I made recent improvements - the main change was that I forced CPU affinity per thread)

Nick_W_ · ‎09-17-2013

Trying to use VTUNE 2013 GUI...

I get an error "Error: Problem accessing the sampling driver"

I have followed the instructions to install the driver but it didn't help - I noticed that I got a file not found error when running sep_micboot_create.sh (I also tried the older version under the vtune_amplifier_xe directory and got the same problem).

It could not find "sep3_8-k1om-2.6.38.8-*.ko"

robert-reed · ‎09-17-2013

sep_micboot_create.sh has been a ghost for at least several releases--at one point it just echoed a message indicating its deprecation. My usual advice for dealing with this error: uninstall/ream/reinstall. That is, run the uninstall.sh script in the VTune Amplifier installed location (nominally /opt/intel/vtune_amplifier_xe is a symlink to the current install). Then rmdir /opt/intel/vtune_amplifier_xe_2013 and any "sep" directories in /opt/intel/mic (to ensure no old .ko files are left around to confuse things). Then reinstall VTune; the installer should ask whether you want it to install the coprocessor driver--let it do so if possible. The VTune Amplifier installation should automatically do a service mpss restart (thus the message it spits out that this may take a little time). If that sequence doesn't clear the message you reported, yours would be the first case.

If best performance in the barrier is required, I would at least try to get the HW threads per core to team together and issue one barrier-reached notice per quad, to reduce snoop traffic on the rings

Nick_W_ · ‎09-18-2013

Thanks Robert.

I tried this but the install complains that we don't have a supported version of Linux (uname is just showing GNU/Linux but I think its CentOS)

Everything was reinstalled but no mpss restart was done (I did one manually)

I disabled NMI watchdog timer on the host as requested (mic doesnt have one)

No joy... could this unsupported OS problem also cause performance issues when running native MIC programs?

jimdempseyatthecove · ‎09-18-2013

>>If best performance in the barrier is required, I would at least try to get the HW threads per core to team together and issue one barrier-reached notice per quad, to reduce snoop traffic on the rings.

Good advice - don't use synch_add_and_fetch on the local (core) team else this will post ring traffic too. Use the multiple flags (4) and have one team member be the one performing the synch_add_and_fetch on the global barrier.

Jim Dempsey

robert-reed · ‎09-19-2013

If it's just CentOS, that should in itself not be a problem. We have Intel Xeon Phi coprocessors on CentOS machines in our lab that work fine. Is the warning you get just that, a warning, or does it actually abort the install? And after the service mpss restart, are you then able to run a micinfo that reveals details from the coprocessor side (rather than getting a bunch of fields marked "NotAvailable")? With mpss service running, what happens when you run "miccheck"?

Nick_W_ · ‎09-20-2013

The install will proceed despite an 'unsupported OS' and the drivers are reported to be installed. I then run the script to setup the environment, start the amplxe-gui, set it up to run my native mic code using ssh.

When I run a hotspot analysis, I can see my program run to completion in the terminal window but then the message appears in the bottom box of the gui saying that no results were collected and that the sampling driver may need to be restarted

Micinfo is still reporting NotAvailable for all the "version" section fields

Miccheck reports OK for everything.

I know CentOS is just rebranded Redhat - I was wondering if somehow fooling the install to think it was Red Hat might be a good idea?

BTW - we are getting excellent performance now - the solution in our case was to increase the number of threads greatly (to mask slow memory accesses) but there are just one or two threads that are spoiling the party by running 5x slower - so it would be great to get vtune to work to find out why...

Xeon Phi not using more than 1 or 2 cores