Hi Tim P.
Never crossed my mind that it would not. Do you know something/have something to share ?
Disclaimer: I am skweeky new to KNL, so I don't know beans and may very well be embarrassing myself.
Also, had/have no idea that a windows linux subsystem existed/exists. Is this a real thing or you are joking with me?
That would be harsh ;-)
Hi Loc N
Went to the page, you point, but seems the page is not properly updated (or I missed something).
In there, it lists as supported OS 3 Linux variants. Then further down, there is a link for getting Xeon Phi processor Software for windows.
But I have no idea what this really means. Can you please offer some pointer to get informed on what I might be getting myself into?
(The last thing I need is an offloading operational model)
Thank you both for your help,
From my understanding of the web pages, you have two requirements:
a) Have an operating system that supports KNL (IOW supports context save/restore of 512-bit vectors and the KNL APIC/CPUID)
b) Have a compiler that can (optionally) generate AVX-512 instructions
The literature states that Windows 10 will support the KNL. Note, I use KNL with Linux CentOS 7.2, so I cannot say if their are any quirks.
I do know that Windows affinity management is different than that on Linux. Windows has stuck with a 64-bit affinity mask. On systems with more than 64 logical processors, the logical processors are partitioned into groups that contain no more than 64 logical processors. This shouldn't affect your programming unless you are directly manipulating a thread's affinity. (e.g. you have to specify which group and then which affinity bit position within group).
Offload programming model for Intel® Xeon Phi™ processor is also supported. From what I know, this feature Offload over Fabric is only available in Linux for now.
To use Offload over Fabric, you need to have a host machine (Intel® Xeon® processor-based) and a target machine (Intel® Xeon Phi™ processor). You then connect two machines using a fast fabric interconnection such as Intel® Ommi-Path Architecture. You run your code on the host machine, then the parallel portion of the code is offloaded to the target machine. More information can be found in the “Intel® Xeon Phi™ Processor Offload Over Fabric User’s Guide” document.
This is slightly off topic, but related. If you permit offloading from on KNL (as host) to another KNL (as host), then why not permit offloading from Xeon (as host) to Xeon (as host). IOW an alternative of using offloading as opposed to MPI. Perhaps this is already supported but not documented or commonly known.
thank you both for trying to respond.
What I am looking for is a genuine windows machine (that would run excel and visual studio) and use KNL as main processor (not co-processor).
Do not want to offload anything, do not want to know off-loading exists. As far as I am concerned this cannot possibly be the mandate of a library developer. For starters it would result in non-portable code base.
I do appreciate the effort but, very likely because of my total ignorance on the matter.
To be concrete, I read in here:
(pls, this is not an attempt to place an add, have absolutely no affinity with the shop - pun intended ;-) )
"The Xeon Phi x200 chips in LGA3647 form-factor can run as a host, directly with an operating system on board (including Windows Server 2016) which is an upgrade over the older Xeon Phi parts which only ran as co-processors on the PCIe bus. "
Also, what is of importance to know is: what would be the impact on excel and visual studio of the much smaller speed of the processors than what is customarily used in a system - meaning is it something one can live with, or is excruciatingly slow?
TIA for your help
KNL is only available as a host processor at this time.
The KNL when running single threaded scalar applications (like significant portions of the compilers and VS IDE) is relatively slow. Depending on model of KNL it is about 1GHz as opposed 2.5-3.5GHz of a Xeon host. The relative speed difference will vary depending on application (threaded ness as well as vector utilization).
What is your intended use of the system?
What I am interested in is c++ code base with excel as a front end developed using vs2015 and intel compilers.
I am not sure that VS compiler is single-threaded as you say - there are flags to do parallel builds, which I typically use and delegate different compilation units to different processors..
Also do not know that the navigation part and find utilities are single threaded either. Do you know this to be the case?
Finally, I was hoping more for some type of hands-on experience, rather than an -albeit educated- guess.
Thank you for trying to help, P-
From my experience, having both a KNL system and a Xeon E5-2620v2 available, both running CentOS 7.2, if I am in the early phase of software development of large projects that involve Edit, Build, Edit, Build, ..., Debug, Edit, Build, Debug, ... Edit, Build, Profile (e.g. VTune), Edit, Build, Profile, ... That this sequence is best performed on the E5-2620v2. At the point in development where you need to test large numbers of threads and/or highly vectorized code, it is best then to transition over to the KNL system.
>>What I am interested in is c++ code base with excel as a front end developed using vs2015 and intel compilers.
This does not tell me (us) the computational requirements of your C++ code. Is it highly parallelizable? Is it highly vectorizable? If not, then to what extent? Please explain the computational requirements of the C++ code, what is the runtime of the computation section verses the review time in Excel? (provide information on the system you used to obtain this information).
Thank you for your reply.
My code has to do with numerical analysis and is vectorized (in the sense of using blas/lapack with appropriately aligned containers) which covers more than 80% of my needs (the only reason the remaining 20% is not is was waiting for some portable way of doing these things).
It is Monte-Carlo simulations and PDE solvers.
Excel is used as a front end, in the sense of (manually) recalculating various scenarios. I do expect a penalty because of the broker btn excel and my library but that is a small fraction of the job. However, if on the other hand it takes for ever to navigate within it, it is not going to work - in the sense that any potential client wont buy into it - actually this is something that KNL architects mite want to think of.
Unless the focus is totally elsewhere in which case I would greatly appreciate if someone told me that KNL is not meant for windows.
However, even in Linux, I cannot imagine that people are expected to work with 1/3 of the speed they are accustomed to.
All the best, P-
PS: to work model where development happens in one machine and execution on another is far from ideal.
Have you reached the point in development of your software to run it on a desktop or workstation (4 to 16 cores)?
If so, have your run VTune to profile?
Have your run a scaling test using desktop or workstation?
Have you profiled the scaling test?
From my little understanding of Monte-Carlo simulation, it relies heavily on a random number generator. This means it is imperative that you use a highly parallel and vectorizable random number generator.
When you run the scaling test, keep an eye out for the how/if the random number generator bottlenecks the program.
From the scaling test (assuming you have sufficient number of logical processors), you might be able extrapolate and then estimate the curve for the 64 core/(128/192/256) thread scenario. Bering in mind of the uncertainties of when you may hit a memory bandwidth wall and/or barrier issues.
>>My code has to do with numerical analysis and is vectorized (in the sense of using blas/lapack with appropriately aligned containers)
I assume you will be using MKL to provide these functions. You should be aware that if your program is designed to use the parallel version of MKL that this also means that your program is serialized between calls to MKL. When this serializing causes a bottleneck, see if your code can be restructured to use the serial version of MKL... in parallel. IOW move the parallelization out a level. This can improve parallelization. As to which technique works best, you will have to experiment (this does mean you will need two implementations of your outer level code).
I appreciate the hints -not sure for the tone.
Since you are asking I do use mkl, parallel for the big tasks and serialized when sending to tbb threads and yes I do experiment a lot.
As far as the rng goes I use mkl's sfmt which I call one for every step (do cross-sectional MC, not path-wise).
However -and I say this in a sincere fashion- I fail to see the relevance of these points to my question.
Thank you for the info, P-
PS: Btw it really depends on what you do with the Monte Carlo simulation but generation of random numbers is a small fraction of your time.
Some people ortho-normalize the sample - which is a time consumer for many steps of multi-dimensional problems.
However the bulk of the time is spent in propagating vector random processes and working with the results (e.g.. performing pricing tasks which very often have a lot of complications).