Directly calling mkl from python, and try to use more than one thread.

Roger_X_ · ‎06-14-2016

I am doing some sparse matrix calculation, and called mkl directly from python.

That worked, but only a single thread is used. When I use the top command, one of the cpu core has 100% usage, other cpu cores has about 0% usage.

How to make the mkl function use multiple threads?

I have tried setting the OMP_NUM_THREADS, MKL_NUM_THREADS, MKL_DOMAIN_NUM_THREADS environmental variables to 12.

The code also try to set number of mkl threads to 12 by mkl.mkl_set_num_threads(byref(c_int(num_cpu)))

Does the sparse matrix routines of mkl support multithreading calculation?

The mkl is the 2016 version.

Thank you.

The code is below:

from ctypes import *
import scipy.sparse as spsp
import numpy as np
import multiprocessing as mp

# Load the share library
mkl = cdll.LoadLibrary("libmkl_rt.so")


def get_csr_handle2(data, indices, indptr, shape):
	a_pointer   = data.ctypes.data_as(POINTER(c_float))
	ja_pointer  = indices.ctypes.data_as(POINTER(c_int))
	ia_pointer  = indptr.ctypes.data_as(POINTER(c_int))
	return (a_pointer, ja_pointer, ia_pointer, shape)


def get_csr_handle(A,clear=False):
	if clear == True:
		A.indptr[:] = 0
		A.indices[:] = 0
		A.data[:] = 0
	return get_csr_handle2(A.data, A.indices, A.indptr, A.shape)


def csr_t_dot_csr(A_handle, C_handle, nz=None):
	# Calculate (A.T).dot(A) and put result into C
	#
	# This uses one-based indexing
	#
	# Both C.data and A.data must be in np.float32 type.
	#
	# Number of nonzero elements in C must be greater than
	#     or equal to the size of C.data
	#
	# size of C.indptr must be greater than or equal to
	#     1 + (num rows of A).
	#
	# C_data    = np.zeros((nz), dtype=np.single)
	# C_indices = np.zeros((nz), dtype=np.int32)
	# C_indptr  = np.zeros((m+1),dtype=np.int32)

	(a_pointer, ja_pointer, ia_pointer, A_shape) = A_handle
	(c_pointer, jc_pointer, ic_pointer, C_shape) = C_handle

	trans_pointer   = byref(c_char('T'))
	sort_pointer    = byref(c_int(0))

	(m, n)          = A_shape
	sort_pointer        = byref(c_int(0))
	m_pointer           = byref(c_int(m))     # Number of rows of matrix A
	n_pointer           = byref(c_int(n))     # Number of columns of matrix A
	k_pointer           = byref(c_int(n))     # Number of columns of matrix B
	                                          # should be n when trans='T'
						  # Otherwise, I guess should be m
	###
	b_pointer   = a_pointer
	jb_pointer  = ja_pointer
	ib_pointer  = ia_pointer
	###
	if nz == None:
		nz = n*n #*n # m*m # Number of nonzero elements expected
			 # probably can use lower value for sparse
			 # matrices.
	nzmax_pointer   = byref(c_int(nz))
	 # length of arrays c and jc. (which are data and
	 # indices of csr_matrix). So this is the number of
	 # nonzero elements of matrix C
	 #
	 # This parameter is used only if request=0.
	 # The routine stops calculation if the number of
	 # elements in the result matrix C exceeds the
	 # specified value of nzmax.

	info = c_int(-3)
	info_pointer = byref(info)
	request_pointer_list = [byref(c_int(0)), byref(c_int(1)), byref(c_int(2))]
	return_list = []
	for ii in [0]:
		request_pointer = request_pointer_list[ii]
		ret = mkl.mkl_scsrmultcsr(trans_pointer, request_pointer, sort_pointer,
				    m_pointer, n_pointer, k_pointer,
				    a_pointer, ja_pointer, ia_pointer,
				    b_pointer, jb_pointer, ib_pointer,
				    c_pointer, jc_pointer, ic_pointer,
				    nzmax_pointer, info_pointer)
		info_val = info.value
		return_list += [ (ret,info_val) ]
	return return_list

def test():
	num_cpu = 12
	mkl.mkl_set_num_threads(byref(c_int(num_cpu))) # try to set number of mkl threads
	print "mkl get max thread:", mkl.mkl_get_max_threads()
	test_csr_t_dot_csr()

def test_csr_t_dot_csr():
	AA = np.random.choice([0,1], size=(12,750000), replace=True, p=[0.99,0.01])
	A_original = spsp.csr_matrix(AA)
	A = A_original.astype(np.float32).tocsc()
	A = spsp.csr_matrix( (A.data, A.indices, A.indptr) )

	A.indptr  += 1 # convert to 1-based indexing
	A.indices += 1 # convert to 1-based indexing
	A_ptrs = get_csr_handle(A)

	C = spsp.csr_matrix( np.ones((12,12)), dtype=np.float32)
	C_ptrs = get_csr_handle(C, clear=True)

	print "=call mkl function="	
	
	while (True):
		return_list = csr_t_dot_csr(A_ptrs, C_ptrs)

if __name__ == "__main__":
	test()

So far, numpy linked with mkl can use multiple threads in the following code without setting any environment variables.

import ctypes
mkl = ctypes.cdll.LoadLibrary("libmkl_rt.so")
print mkl.mkl_get_max_threads()
import numpy as np

a = np.random.normal( 0,1, (100,1000))

while True:
        a.dot(a.T)

Ying_H_Intel · ‎06-15-2016

Hi Roger,

The sparse matrix should be multithreaded as MKL user guide mentioned:

OpenMP* Threaded Functions and Problems

All Level 3 BLAS and all Sparse BLAS routines except Level 2 Sparse Triangular solvers.

How about your try bigger C matrix, like 10000 and see if there are openmp thread in it?

Best Regards,

Ying

Roger_X_ · ‎06-15-2016

I tried a bigger C matrix. Like 120x120. That use 9% of the 8Gb memory and still only use one cpu.

After that, I try 1200x1200, that nearly crashed the node. I ran top before the node froze, it gave the following

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

25462 rxu 18 0 13.7g 7.6g 648 R 57 98.2 0:25.04 python

It seems it is still using one cpu.

Ying_H_Intel · ‎06-17-2016

Hi Roger,

No sure if you know that . Intel provide Python package which have numpy and scripy, and mkl integration. You may refer to

https://software.intel.com/zh-cn/forums/intel-distribution-for-python

for the issue about scsrmultcsr, it’s threaded but for small sizes most of time could be spent on sequential part – I will ask related developer to check this..

Best Regards,

Ying

Roger_X_ · ‎06-17-2016

I manually compiled scipy and numpy linked with mkl on the machine before knowing that intel provides python package with numpy and scipy. Scipy's sparse matrix multiplication routine doesn't use mkl, and always allocates new memory for the result. Does the scipy package in intel's python package have better implementation of sparse matrix multiplication?

I would like to learn more about scsrmultcsr. Why the routine spends most of the on the sequential part? Is the calculation limited by memory bandwidth, implementation of parallelization, the cpu cache, or the data structure? For a sufficiently large problem size, what is the column and row dimension for the A and B sparse matrices input of scsrmultcsr? When does scsrmultcsr decide to use multiple threads?

Does the multiplication between coo matrices or sparse matrices in other formats have better parallization and/or use less memory for the multiplication?

Is there a better way to calculate A dot A.T where dot is matrix product and .T is transpose? Currently, I transpose A first, then calculate ((A.T).T) dot (A.T), which is same as A dot A.T. A in my actual problem has b rows and 750,000 columns. Every 750,000 entries have about 2000 entries of one and the remaining entries are all zeros. The sparsity of the matrix is 748/750. b is 9800 now and the larger the b the better. Each node only has 8G memory. The result is b rows x b columns.

Thank you.