I run my program on single xeon phi. I use offload mode but I found there is no performance improvement when the number of threads is more than 180. I attempt to run my kernel function separately and each kernel deals with an half data with 120 threads. Is there a possible way to run kernels concurrently on single device? Thank you.
You need at least 2 threads per core to saturate the memory bandwidth on KNC (I assume this is what you're using). It's likely that you won't see better overall performance if you do run two kernels.
To answer your question: yes it's possible by using OpenMP tasks.