- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
No doubt using MPI on larrabee isn't "optimal", but given the opportunity to acquire a decent speedup with little effort and at low cost, it would still be interesting.
Link copiado
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
For example, the approach by NVidia - CUDA - is fairly disastrous. Everything would need to be rewritten, and it's very complicated, and not very stable. They promise a great speedup, but it's simply not worth it.
We have code here that we are happy with - maintainable, extensible. We just need to be able to run it as it is. Having a C++ compiler, OpenMP and MPI would make many people here happy.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
MPI on Larrabee is certainlyan interesting idea -- it opens up, in principle, access to a portfolio of existing code, and gets us experimenting with a message-passing model (in contrast to shared-state threads, such as OpenMP or pthreads).
Analogous work is already out there: MPI runs just fine on multi-core CPU systems, typically putting one MPI process per core. While modern MPI implementations recognize and exploit the shared memory - and are thus reasonably efficient at message passing in that context- there are reports of scaling challenges for typical applications. In particular: MPI codes moving in from clusters are usually designed with large node memories in mind; downsizing those memory expectations to a single"node" presentsobviousdesign constraints. Taking things still further, down to the memory localpresent ona single Larrabee processor, will amplify those constraints.
I wonder if this were the issue wateenellende had in mind, anticipating what might or might not be "optimal"?
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
If we assume that the data of the program fits into the memory of one node with a larrabee in it, then certainly the larrabee has computational power in abundance. The question is if we can get the data there fast enought to keep all those resources occupied.
On a cluster, it is already problematic to keep 1 cpu busy due to the memory-bandwidth, I would expect it to be 24 or 48 times worse on a larrabee.
I guess this question is equally valid for OpenMP or pthreads.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Agree, managing data locality is going to be a challenge for certain workloads, and also agree there are likely lessons to be harvested from the cluster community.
One interesting place to start -- "interesting" in the sense that we'll have better chance to succeed -- would be codes which were redesigned to accomodate cluster interconnect B/W limitations. Any suggestions?
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
One interesting place to start -- "interesting" in the sense that we'll have better chance to succeed -- would be codes which were redesigned to accomodate cluster interconnect B/W limitations. Any suggestions?
This application may be interesting: http://openlb.org/
This is the so-called "Open Lattice Boltzmann" package. This lattice boltzmann thing is a relatively new method of doing fluid dynamics. From a computational standpoint, it is very closely related to cellular automata, except that the calculations are on real numbers, not ints. Each iteration of the "machine" requires 2 steps: some internal computations per cell, and then exchanging information with the neighbour cells.
These cellular automata/lattice boltzmann applications are almost embarrassingly parallel, so Larrabee would probably be the perfect platform for it. Please do note that I am not involved in this project, so I might some of the details wrong, but anyway, here goes:
-It is open-source and has an active community, so it's easily available
-There are 3 versions: serial, OpenMP and MPI, so it's usable as a benchmark
On the serial and MPI versions, the 2 steps of each iteration can be done with one pass over the dataset, with OpenMP each iteration requires 2 passes over the dataset. To my knowledge (i.e. according to what I've heard) the OpenMP version is hardly any faster that the serial version, because any computational gain is offset by the increased memory bandwidth usage.
The question is then, how do/would the 3 versions compare on a Larrabee ?
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Thanks! The OpenLB is indeed an interesting case; I've finally got some bandwidth to look into this, compiling the codes on some manycore systems here....
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
No doubt using MPI on larrabee isn't "optimal", but given the opportunity to acquire a decent speedup with little effort and at low cost, it would still be interesting.

- Subscrever fonte RSS
- Marcar tópico como novo
- Marcar tópico como lido
- Flutuar este Tópico para o utilizador atual
- Marcador
- Subscrever
- Página amigável para impressora