Software Archive
Read-only legacy content
17061 Discussions

Xeon Phi not using more than 1 or 2 cores

Nick_W_
Beginner
2,381 Views

Hi,

We have an engineering sample of the Xeon Phi (60 core version). We have installed the mpss_gold_update_3-2.1.6720-16-rhel-6.4

 

1) We followed the instructions to upgrade flash but micinfo is reporting this:

Host: Linux

OS version: 2.6.32-358.el6.x86_64

Driver version: 6720-16

MPSS version: 2.1.6720-16

BUT: flash version, SMC, UOS and device serial number are all reported as "NotAvailable"...

 

2) micsmc produces error messages every few seconds such as "Warning: mic0 device connection lost!" and "Information: mic0: Device connection restored"

 

3) We have a C++ native application running on the Phi. It can be configured to run multithreaded (pthreads), we have taken benchmarks. It runs fractionally more slowly in 240 threads (60 cores x 4 HW threads per core) compared to a single thread... (we have accounted for the overhead of starting new threads)

The application runs against a fixed size sample data as follows:

- when compiled for the host system: 0.2 seconds

- when run on the Phi in a single thread: 0.59 seconds

- when run on the Phi in 240 threads: 0.6 seconds

Repeated runs while micsmc is running shows that no more than 2 cores are being used at any one time...

 

Any help with this would be much appreciated

0 Kudos
25 Replies
Nick_W_
Beginner
1,831 Views

Update:

We greatly increased the run time of our sample code (using 240 threads) - and watched performance using top.

All cores are being utilised but less than 1% of each core is being reported by top.

(We simplified the code so it reads through a 1.6M memory buffer thousands of times)

So we are thinking the problem is related to memory bandwidth...

Let us know what you think - perhaps you have standard benchmarking software we could try on our card?

0 Kudos
robert-reed
Valued Contributor II
1,831 Views

Running only a single thread on the coprocessor will gain you access to only half the cycles available on a single core, so performance in that scenario SHOULD be pretty poor.  And running 240 threads on 60 cores may have pushed you beyond any performance elbows that may be present because of the activity (or lack of it) in your test application.  Have you tried collecting data with thread counts anywhere between these extremums?

It also wouldn't hurt if you could show us what your test program is doing.

0 Kudos
Charles_C_Intel1
Employee
1,831 Views

A program that runs for only a fraction of a second is also troublesome.  Remember, we're runing a Linux kernel here, and you're asking the OS to create 240 threads and all the associated OS data structures associated with a thread.  Even on a full-blown Xeon, that is a time-consuming *serial* process.  A workload with a longer runtime will make it much easier to work out where you are running into an application issue, and where you are just running into the OS having a lot of work to do (a longer program should minimize the overall effect of the OS).

0 Kudos
Charles_C_Intel1
Employee
1,831 Views

Ooops, sorry, didn't read clearly.  How much longer is your runtime in the threaded code now? 

I agree with Robert - seeing your code would be useful.

0 Kudos
Bernard
Valued Contributor I
1,831 Views

Are not you over subscribing your data with large number of threads?As Charles said for short period of running code you can accumulate an overhead related to creation 240 threads and their data structures.So partly core`s hardware threads will spent some time creating OS threads for your program.

0 Kudos
Bernard
Valued Contributor I
1,831 Views

Another question is related to your application and its ability to be efficiently threaded.

0 Kudos
Nick_W_
Beginner
1,831 Views

There was a bug in the code which was dividing the workload amongst threads... so that explains a lot. Thanks for all your comments though.

Out of interest, we were start all threads, halting them at a thread barrier until they were all ready. Then releasing them and starting a timer... The end timer was after all threads have completed their workload and joined. The only synchronisation point between threads was a __synch_add_and_fetch.

With the bug fixed, we are seeing better performance for much larger inputs, pushing runtimes up to several seconds (60% CPU showing in top) - but there is still some kind of overhead which is making the threaded version look slower for shorter runtime. Perhaps the join? This is for curiosity only since our target application will involve streaming very large data volumes.

0 Kudos
Bernard
Valued Contributor I
1,831 Views

Can you profile your application with Vtune?

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,831 Views

>>The only synchronisation point between threads was a __synch_add_and_fetch.

I am assuming the __synch_add_and_fetch was used for a thread completion count barrier.

Try replacing the counter with an array of done flags (initialized to 0). This converts atomic read/modify/write with atomic write. Your barrier code becomes:

volatile int DoneFlags[nThreads];
...
for(int i=0;i<nThreads;++i) DoneFlags = 0;
...

DoneFlags[iThread] = 1; // (omp_thread_num())
for(int i=0;i<nThreads;++i) { while(!DoneFlags) _mm_pause(); }

If the barrier is reached repetatively within a parallel region, consider changing done flag to trip count (and your test for done too).

DoneFlags[iThread] = iTrip; // iTrip is in local scope of thread, initialized to 1
for(int i=0;i<nThreads;++i) { while(DoneFlags != iTrip) _mm_pause(); }
++iTrip;

I don't have my Xeon Phi yet so I cannot confirm if you can use byte sized flags, 32-bit flag should work though. On Sandybridge, the write mask is capable of working in byte sized units. Flag/Trip count checking can use wider units.

Jim Dempsey

0 Kudos
robert-reed
Valued Contributor II
1,831 Views

Caution: using an array of "done" flags offset at integer widths across an array of HW threads will likely expose your code to significant "false sharing" where the manipulation of individual flags will have cache-line effects causing thrashing in the neighboring thread flags, as the cache lines are modified and evicted.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,831 Views

Robert,

The Xeon Phi has latency problems with XCHGADD (and other interlocked instructions). The direct write was suggested as a means to reduce the latency (though it is not a perfect solution). The user is free to experiment as to if the flags are packed into adjecent locations or spread across individual cache lines. For barrier I suggest using packed because the scan of all done flags can be made with fewer cache line loads. Though the writes are slower, the writes are made at skewed time intevals, thus for all but the last thread, the extra write latency consumes spinwait time.

One must make the pudding though.

Jim Dempsey

0 Kudos
Nick_W_
Beginner
1,831 Views

I can't post any code at the moment but this information might help pinpoint some problems:

- A large reference data file is loaded (161Mb). The code accesses this data in a fairly random order...

- A global input buffer of 1.8Mb is loaded

- A global results buffer of 8Mb is allocated

- The input is divided evenly amongst the threads.

- The threads work independently apart from when writing to the output buffer. The outbuf index is controlled by the _synch_add_and_fetch. Apart from the thread barrier (which only exists for timing purposes), this is the only synchronisation point.

It's not easy to CPU utilisation figures due to the short runtime but it only seems to hit around 40% max

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,831 Views

If the amount of output per segment of input is the same, then you do not need a _synch_add_and_fetch to acquire a slot in the output buffer. (outOffset = (8/1.8)*inOffset*iThread)

Your requirements may differ.

Jim Dempsey

0 Kudos
Nick_W_
Beginner
1,831 Views

Thanks but the output size is not known beforehand.

We noticed that when timing individual threads, the thread times were averaging 0.004s (longest being about 0.006s) so there is a some large overhead associated with joining the threads since the total runtime is currently 0.012 seconds (I made recent improvements - the main change was that I forced CPU affinity per thread)

0 Kudos
Nick_W_
Beginner
1,831 Views

Trying to use VTUNE 2013 GUI...

I get an error "Error: Problem accessing the sampling driver"

I have followed the instructions to install the driver but it didn't help - I noticed that I got a file not found error when running sep_micboot_create.sh (I also tried the older version under the vtune_amplifier_xe directory and got the same problem).

It could not find "sep3_8-k1om-2.6.38.8-*.ko"

0 Kudos
robert-reed
Valued Contributor II
1,831 Views

sep_micboot_create.sh has been a ghost for at least several releases--at one point it just echoed a message indicating its deprecation.  My usual advice for dealing with this error: uninstall/ream/reinstall.  That is, run the uninstall.sh script in the VTune Amplifier installed location (nominally /opt/intel/vtune_amplifier_xe is a symlink to the current install).  Then rmdir /opt/intel/vtune_amplifier_xe_2013 and any "sep" directories in /opt/intel/mic (to ensure no old .ko files are left around to confuse things).  Then reinstall VTune; the installer should ask whether you want it to install the coprocessor driver--let it do so if possible.  The VTune Amplifier installation should automatically do a service mpss restart (thus the message it spits out that this may take a little time).  If that sequence doesn't clear the message you reported, yours would be the first case.

If best performance in the barrier is required, I would at least try to get the HW threads per core to team together and issue one barrier-reached notice per quad, to reduce snoop traffic on the rings

0 Kudos
Nick_W_
Beginner
1,831 Views

Thanks Robert.

I tried this but the install complains that we don't have  a supported version of Linux (uname is just showing GNU/Linux but I think its CentOS)

Everything was reinstalled but no mpss restart was done (I did one manually)

I disabled NMI watchdog timer on the host as requested (mic doesnt have one)

No joy... could this unsupported OS problem also cause performance issues when running native MIC programs?

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,831 Views

>>If best performance in the barrier is required, I would at least try to get the HW threads per core to team together and issue one barrier-reached notice per quad, to reduce snoop traffic on the rings.

Good advice - don't use synch_add_and_fetch on the local (core) team else this will post ring traffic too. Use the multiple flags (4) and have one team member be the one performing the synch_add_and_fetch on the global barrier.

Jim Dempsey

0 Kudos
robert-reed
Valued Contributor II
1,831 Views

If it's just CentOS, that should in itself not be a problem.  We have Intel Xeon Phi coprocessors on CentOS machines in our lab that work fine.  Is the warning you get just that, a warning, or does it actually abort the install?  And after the service mpss restart, are you then able to run a micinfo that reveals details from the coprocessor side (rather than getting a bunch of fields marked "NotAvailable")?  With mpss service running, what happens when you run "miccheck"?

0 Kudos
Nick_W_
Beginner
1,768 Views

The install will proceed despite an 'unsupported OS' and the drivers are reported to be installed. I then run the script to setup the environment,  start the amplxe-gui, set it up to run my native mic code using ssh.

When I run a hotspot analysis, I can see my program run to completion in the terminal window but then the message appears in the bottom box of the gui saying that no results were collected and that the sampling driver may need to be restarted

Micinfo is still reporting NotAvailable for all the "version" section fields

Miccheck reports OK for everything.

I know CentOS is just rebranded Redhat - I was wondering if somehow fooling the install to think it was Red Hat might be a good idea?

BTW - we are getting excellent performance now - the solution in our case was to increase the number of threads greatly (to mask slow memory accesses) but there are just one or two threads that are spoiling the party by running 5x slower - so it would be great to get vtune to work to find out why...

0 Kudos
Reply