Re: mkl_sparse_?_ev vs dfeast_scsrev "guess" eigenvectors

Gagan · ‎03-31-2024

hi all,

i'm currently trying to tackle a very large matrix (~250k x 250k), which is symmetric with a non-unit diagonal.

for this matrix, i am certain the smallest eigenvector will always be zero with the unit eigenvector.

with dfeast_scsrev, i have the ability to encode this knowledge into the solver by setting the last ~250k values of the output array to 1, (250k-dimensional eigenvector [lol, still carrying this conversation over ten years later]).

my question is whether i can do something similar with mkl_sparse_?_ev; my guess is that i cannot, but i was hoping this could be added in future releases since it would help speed up the solution.

we know that:

if fpm[2] = 2, then this behaviour MUST be enabled (even though, i assume it is not because we are calling dfeast_scsrev through the mkl_sparse_?_ev wrapper)

so, in my opinion this is a worthwhile, arguably compulsory, feature.

thank you,

former black belt and MKL pioneer since 2013

Gagan · ‎03-31-2024

just as i had done about a decade ago, posting a test case for the team to fix a critical bug in their MKL solver, i am now providing another cutting-edge test case for them so they can set the tone for the mathematical optimisation subdiscipline.

this bad boy is 229938 x 229938

i have included the M, iM, and jM just like last time.

what's the goal? BEAT MATLAB.

for smaller matrices, sparse_?_ev can be very fast. however, i think the team doesn't get many cases where someone is using ~10^5 square matrix and trying to extract >10^2 eigenvectors. you know--being cutting edge for over a decade, that.

give it a go on one of your test clusters and let me know how it performs. even with 128GB memory and 2x6cores, my 2012 mac pro is getting long in the tooth.

set the number of eigenvectors to 163 (162 plus the bottom eigenvector which is unitary with eigenvalue 0)

Mark_L_Intel · ‎04-04-2024

Hello @Gagan,

can you attach the sample code that you used for testing?

Have you used Intel oneAPI compiler? How have you compiled your code?

How to run an executable (after I built a reproducer)?

What sparse matrix format do you use (e.g., CSR)?

Thank you.

Mark

Gagan · ‎04-04-2024

hi mark,

test case is in my original thread, here is the direct link. nothing has changed in terms of the program used to compute. i note that, surprisingly, the team has decided to fiddle with the fpm input parameters.
- at the time i posted this test case, fpm[4]=1 reflected the decision to use the "guess" of all ones as the smallest eigenvector (eigenvalue zero).
- the idea still applies, but there is no parameter in 2023.2, unless it is not documented.
- the parameter still exists in parallel studio's 2020.2.899 MKL as fpm[5]=1
  - you guys should stop screwing with these input parameters.
    - please.
      - this isn't what intel is known for--"changing things that work".
      - stop doing it in the future.
        you have 128 parameters and many are reserved. screw with those if you want to test functionality. don't change existing ones.
i use the "big guns" aka ICPC/ICC from parallel studio XE's 2020.2.899.
please consult that test case. it should still work and run for enough time that you get the same behaviour.
i can only use CSR man-- feast makes you.

this behaviour may only exist on macintosh platforms, but i can confirm it has nothing to do with the dual processor since both the macbook and mac pro are showing exactly 50% CPU load as maximum.

if i use SPARSE_EV with the krylov-schur decomposition, i get 100% CPU load. only the feast eigensolver gets 50% load.

also as a feature request: could you please add printout switch to fpm input parameters so we can see what is going on? it is helpful for people observing optimisation behaviour when producing solution.

let me know if you have any other questions.

Mark_L_Intel · ‎04-05-2024

Hello @Gagan,

First, thanks for pointing to the test case and answering my questions. I'd like to file an internal ticket on your behalf, but I may need to make the information more "crisp." In other words, please help define a specific reproducer if you could. I think you suggested a feature request (not a bug), so please specify what parameters (and in what functions) you would like to change. This should be specific to oneMKL version number -- please see next.
We don't support Parallel Studio anymore. Could you use the latest oneMKL 2024.1? If not, it would be nice to understand why you can't. And what latest oneMKL version you might consider using? Regarding compilers, the classic compilers (icpc and icc) have been deprecated. The recommendation is to use icx and icpx compilers supplied with oneAPI. I'd suggest downloading Intel oneAPI Base Toolkit and giving it a try. However, please see next regarding macOS support.
Per 2024.0 oneMKL Release Notes, "Intel® oneAPI Math Kernel Library (oneMKL) for macOS deprecated in release 2023.0 and will now be discontinued as of Intel ® oneMKL release version 2024.0 and later releases." If you use macOS exclusively, please take it into consideration.
In your last communication, you described a specific performance issue (50% load). Would it be possible to define a reproducer more specifically so we can reproduce it on our side?

Gagan · ‎04-05-2024

hi mark, no problem.

1. in terms of feature requests, i would like two things for 'mkl_sparse_d_ev'

the ability to print runtime status to screen, like dfeast_scsrev already does via input parameter fpm[0]=1
the ability to input a "subspace guess", like dfeast_scsrev already does via input parameter fpm[4]=1

2. i cannot, for obvious reasons, use the latest oneMKL 2024.1 at this time. i do plan to move away from the mac platform (to windows) in the near future, but the cost associated with the hardware setup i would want is high, and therefore it has to wait. i was hoping to speak with someone at intel about sponsoring my work, which i would acknowledge in my future activities. i opened a csr request about this, but haven't heard back. is it possible you could put me in touch with the appropriate department? my need for a robust commercial solution is an incentive to leave this platform sooner, rather than later.

3. i understand and am aware that macos is no longer supported, and i completely support the decision. intel is not in the business of providing welfare to companies who foolishly decide to become a competitor without any comparable experience.

4. if you run that test case as i described in my 10-year old post (make sure you download iM, jM and M.txt) with test.c, you should observe the behaviour i'm speaking of. to be precise, the behaviour is as follows:

when using dfeast_scsrev with the test case, you will observe only a 50% overall CPU load (at least in macos). the version of MKL does not matter. it has been this way for over ten years.
when using mkl_sparse_d_ev:
- pm[2]=1 (krylov-schur method), the algorithm properly uses nearly 100% CPU
- pm[2]=2 (FEAST), the algorithm only uses about 50% CPU.

so i have isolated the CPU load issue to the FEAST eigensolver in particular, but i'm not sure if it's an issue only on mac or it affects other platforms as well. i am surprised no one has raised this question otherwise, so it may be helpful for your team to use the test case on the other operating systems to observe whether the FEAST eigensolver uses 100% CPU load.

i understand mac is no longer supported, but i do recommend you run the test case on any mac machine you may have kicking around to see what i'm talking about. i'm positive it's an issue on mac, but i don't see how it wouldn't effect other platforms since, as i've noted above, the newer krylov-schur implementation uses 100% CPU load.

i hope my aims, requests, and desires are clearer. thanks for taking the time, i am more than happy to engage further.

x86 WILL take over the world, AGAIN!

Gagan · ‎04-06-2024

hi mark,

could i also make another request, with respect to the FEAST eigensolver?

i see a lot of the input parameters (or switches) are not implemented. i'm not sure how much more work this would be for the team, but i think it would be nice to have more of the features intended by the author.

for example, if we consult page 5 source document referenced in the bibliography, it would be nice to add (using one-indexing, by the way):

fpm[14]=2, i.e. the stochastic estimates of the number of eigenvalues on the interval. the implementation only offers fpm[14]=1, which stops after a single contour integration.
fpm[16]=1, that enables trapezoidal integration for a symmetric eigenproblem, which is very important to me. i am a big fan of gauss, but gaussian integration is not what i need. right now processed images show gaussian quadrature (see attached image)

i've attached typical output from FEAST, in this situation it was 6 contours and fpm[14]=1, as i wanted to visually inspect it. maybe the team will get a shot of inspiration looking at it, who knows.

in reality the activations having a gaussian pattern is likely irrelevant, since "accuracy is accuracy", and it becomes much more subtle as time elapses. some practitioners make a habit of discarding the first handful of volumes to ensure the measurements are attuned.

when i use MATLAB, which seems to typically default to krylov schur or the-likes, it never shows such patterns because of the rigorous nature of the input data being decomposed; that is, the input represents the correct abstract theory of lebesgue integration, and i therefore don't need gaussian quadrature to bail me out.

pretty cool the world is gonna see the true power of an x86, huh! lol i just pop stitches imagining trying to do this on an ARM lol. VFP, NEON, etc. christ.

i also love how this forum doesn't have smileys, which i endorse because "x86 ain't no game"

Gagan · ‎04-12-2024

hi @Mark_L_Intel ,

sorry to bug you but i've made a new test case that should work on all platforms.

this case should allow you to test any of the three eigensolvers, using "feast", "ev" [krylov schur], and "svd" as the first argument, and then the integer-value d for the second argument.

this test case is simply an updated version of the old one, where the only difference is the input files M.txt and jM.txt are now binary files instead of text files, to ensure the precision of the matrix is captured.

i have spent the past week inspecting my main program and i'm very satisfied with the result, even though mkl_sparse_d_ev may complete a little faster than i'd like when using fpm[7]=1 and fpm[8]=1, but it looks right so who am i to argue?

i've attached the updated test.c file, and also an updated test case (iM.txt, M.txt, jM.txt) that i've verified works well with mkl_sparse_d_ev (and therefore should work with feast too).

i compiled it using:

icpc -stdlib=libc++ -std=c++20 -DMKL_ILP64 -o testCase test.c -I/opt/intel/oneapi/mkl/latest/include -L/opt/intel/oneapi/mkl/latest/lib -lmkl_intel_ilp64 -lmkl_core  -lpthread -lm -lmkl_intel_thread

i thought i could update the files posted, but for some reason i can't.

anyways, the NEW files are attached. i would like it if you could replace the files i've uploaded in this thread with these.

i don't want people getting confused.

Mark_L_Intel · ‎04-15-2024

Hi @Gagan,

I modified your last sample and built it with the latest 2024.1 oneAPI:

changed "int" parameters of dfeast_scsrev to MKL_INT to compile with ILP64. All my changes have a "//mil" comment at the end of the lines for easy reference/search. The sample test.cpp is attached.
changed test.c extension to test.cpp to compile with oneAPI "icpx" compiler (otherwise compiler complained about <thread> header)
The flags needed can be picked up by using oneMKL link Line Advisor on Ubuntu (with OpenMP):
```
icpx test.cpp -DMKL_ILP64 -qmkl-ilp64=parallel -O2 -o test.x
```
the build was done with the latest oneAPI compiler and oneMKL 2024.1.
I experimented with different numbers of threads. So, I commented out the NUM_THREADS definition at the beginning of the sample and defined it later manually. I assume you divide by 2 below because you work on a hyperthreaded system and want to run with a number of physical cores:
```
mkl_set_num_threads(NUM_THREADS/2);
```
Ran executable on Ubuntu-based system (output out-th16.txt is attached):
```
./test.x 900
```
You could use Intel Developer Cloud, https://devcloud.intel.com/oneapi/get_started/, to repeat the above experiment (or you could use any Ubuntu system with enough memory after installing oneAPI).
I see some idle threads while running the above test. Let me consult internally.
fpm[5] is described here: fmp parameters. You can also download documentation: onemkl_docs_2024.1.0.zip
All above uses dfeast_scsrev. Can you supply a sample with mkl_sparse_d_ev?
I discussed your post with developers. I might be able to file feature request(s).

Mark

Mark_L_Intel · ‎04-17-2024

Hi @Gagan,

I realized that I made a mistake – while moving files around, I ended up with the previous version of your sample and modified it. Let me work on your last version of test.c. I will update my last post above after that.

Mark

Gagan · ‎04-18-2024

hi mark,

thanks for being so prompt on this. let me know your findings.

1. acknowledged

2. acknowledged

3. what does qmkl-ilp64=parallel do exactly? the manpage doesn't seem very helpful. if i am linking all of the libraries already, and using the DMKL_ILP64, then this flag shouldn't make any difference, right? i got strange errors using this flag with icpc 2023.2, like:

internal error: 04010002_12240

internal error: 04010002_12241

i assume this is for non-mac systems and icpx users.

4. acknowledged

5. that NUM_THREADS/2 is only there because i was investigating whether the dfeast_scsrev solver was effected by the number of threads. it is not. i have a hyperthreaded system.

6. acknowledged

7. eh, not much of a cloud guy. thanks though. i have enough memory on my system, but i am curious as to how much AVX instructions would help my computations.

8. yes, this is what i was saying. it seems the feast solver knows the "real" number of cores and ignores hyperthreading, unlike mkl_sparse_d_ev when fpm[2]=1, which uses all of the threads.

9. the feature is fpm[4] =1, as is documented here

it would be nice to have this feature in mkl_d_sparse_ev. as it stands, we cannot use mkl_sparse_d_ev with a subspace guess, even though feast supports it (krylov-schur does too, but not in MKL, see MATLAB docs here describing 'starting vector' v0) because mkl_sparse_d_ev is presumably a wrapper to dfeast_scsrev

10. you're "already working on it"

11. thanks

i think the important thing for mkl_sparse_?_ev is to not lose the feast subspace "guess" functionality offered by the ?feast_* interface.

it would be nice if the team would consider providing a direct interface to the feast extremal eigenvalue problem outside of mkl_sparse_?_ev with fpm[1]=2.

lastly (correct me if i'm wrong): i think it's important to tell users that FEAST assumes the subspace guess is provided in ROW MAJOR form because the output is in COLUMN MAJOR (first eigenvector/column is index 0->length-1), which may lead users to believe the subspace guess must be input as column major (which is not correct, and feast would complain "subspace too small" if i did that).

for example in my situation:

i want 'd' eigenvectors, so i have to solve for d+1 eigenvectors (where the d+1 eigenvector is the unit eigenvector with eigenvalue 0).

assume each eigenvector has nrRows elements, and we want d+1 eigenvectors.

by setting every (i+1)*d element to 1, i am telling the solver that the *smallest* eigenvector should have value 1.

any other guess will not work:
- i.e. 0->nrRows [i.e. first eigenvector in column major format]:
  - for(int i=0; i<nrRows;i++)
    - rawOutput[i] = 1.0 //if the first nrRows elements are 0, and the input subspace is column-major, then this should be correct
- or [1,...,d+1 element of row 1; ...; 1, ..., d+1 element of row nrRows]:
  - for(int i=0;i<nrRows;i++)
    - rawOutput[i*(d+1)] = 1.0 //if the first element of every row in a matrix is stored in row-major with eigenvalues output in ascending order, then this would be the way to set the subspace guess to 1.0 for the smallest eigenvector
- or [0,...,nrRows*d,1,...,nrRows*(d+1)] (column-major, assuming eigenvectors are returned in descending order of eigenvalue)
  - for(int i=0;i<nrRows;i++)
    - rawOutput[nrVoxels*d + i ]= 1.0; //setting the last eigenvector to 1.0, assuming column-major order with eigenvectors in order of descending eigenvalue
- all of these attempts result in FEAST outputting far less than what i am asking for.
ONLY this works:
- for(int i=0; i< nrRows; i++)
  - rawOutput[(i+1)*d] = 1.0
- but this may not be right, unless we expect the input subspace to be in row-major order, with eigenvectors in descending order. this is completely opposite of the output format.

if i use fpm[4] =0, it will say M0 is too small and stop iterating.

is expecting to output an infinite number of eigenvectors, since that's what the underlying theory is stating (these d+1 eigenvectors are points on a smooth manifold).
that shouldn't stop it from iterating further, but it does.

this is probably a little complicated, and i'm sorry about that, but i want to really see if FEAST is better than krylov-schur.

for real mathematics, it doesn't seem like FEAST is useful at all.
must be why they call the alleged (key word here) use of mathematics to explain natural phenomena as "physics" and not mathematics xD hahahaha

all of the stuff i'm describing can be tested with the program i've provided. so take your time. i'm happy i got krylov-schur working better than matlab so please don't feel rushed

Mark_L_Intel · ‎09-05-2024

Hello @Gagan

As I mentioned in your other post, I combined your inputs from this current post and your other post comments into the internal feature request ticket, and I had conversations with the development team regarding your suggestions. In case of any updates from the development team, I will post them here.

Mark_L_Intel · ‎09-06-2024

Hello @Gagan,

I received a message from our developers that it is in their plans to work on your feature requests starting this fall.