<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic I manually compiled scipy and in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Directly-calling-mkl-from-python-and-try-to-use-more-than-one/m-p/1096102#M23573</link>
    <description>&lt;P&gt;I manually compiled scipy and numpy linked with mkl on the machine before knowing that intel provides python package with numpy and scipy.&amp;nbsp;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Scipy's sparse matrix multiplication routine doesn't use mkl, and always allocates new memory for the result.&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Does the scipy package in intel's python package have better implementation of sparse matrix multiplication?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;I would like to learn more about scsrmultcsr. W&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;hy the routine spends most of the on the sequential part?&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Is the calculation limited by memory bandwidth, implementation of parallelization, the cpu cache, or the data structure? For a sufficiently large problem size, what is the column and row dimension for the A and B sparse matrices input of scsrmultcsr? When does scsrmultcsr decide to use multiple threads?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Does the multiplication between coo matrices or sparse matrices in other formats have better parallization and/or use less memory for the multiplication?&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Is there a better way to calculate A dot A.T where dot is matrix product and .T is transpose? Currently, I transpose A first, then calculate ((A.T).T) dot (A.T), which is same as A dot A.T. A in my actual problem has b rows and 750,000 columns. Every 750,000 entries have about 2000 entries of one &amp;nbsp;and the remaining entries are all zeros. The sparsity of the matrix is 748/750. b is 9800 now and the larger the b the better. Each node only has 8G memory. The result is b rows x b columns.&lt;/P&gt;

&lt;P&gt;Thank you.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 17 Jun 2016 13:11:00 GMT</pubDate>
    <dc:creator>Roger_X_</dc:creator>
    <dc:date>2016-06-17T13:11:00Z</dc:date>
    <item>
      <title>Directly calling mkl from python, and try to use more than one thread.</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Directly-calling-mkl-from-python-and-try-to-use-more-than-one/m-p/1096098#M23569</link>
      <description>&lt;P&gt;I am doing some sparse matrix calculation, and called mkl directly from python.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;That worked, but only a single thread is used. When I use the top command, one of the cpu core has 100% usage, other cpu cores has about 0% usage.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;How to make the mkl function use multiple threads?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;I have tried setting the&amp;nbsp;&lt;/SPAN&gt;OMP_NUM_THREADS,&amp;nbsp;&lt;SPAN style="color: rgb(102, 102, 102); font-family: Arial, Tahoma, Helvetica, sans-serif; font-size: 14px; line-height: 16.8px;"&gt;MKL_NUM_THREADS,&amp;nbsp;MKL_DOMAIN_NUM_THREADS environmental variables to 12.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;FONT color="#666666" face="Arial, Tahoma, Helvetica, sans-serif"&gt;&lt;SPAN style="font-size: 14px; line-height: 16.8px;"&gt;The code also try to set number of mkl threads to 12 by&amp;nbsp;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;SPAN style="font-family: Consolas, &amp;quot;Lucida Console&amp;quot;, Menlo, Monaco, &amp;quot;DejaVu Sans Mono&amp;quot;, monospace, sans-serif; font-size: 13.008px; line-height: 19.512px;"&gt;mkl.mkl_set_num_threads(byref(c_int(num_cpu)))&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Does the sparse matrix routines of mkl support multithreading calculation?&lt;/P&gt;

&lt;P&gt;The mkl is the 2016 version.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Thank you.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;The code is below:&lt;/P&gt;

&lt;PRE class="brush:python;"&gt;from ctypes import *
import scipy.sparse as spsp
import numpy as np
import multiprocessing as mp

# Load the share library
mkl = cdll.LoadLibrary("libmkl_rt.so")


def get_csr_handle2(data, indices, indptr, shape):
	a_pointer   = data.ctypes.data_as(POINTER(c_float))
	ja_pointer  = indices.ctypes.data_as(POINTER(c_int))
	ia_pointer  = indptr.ctypes.data_as(POINTER(c_int))
	return (a_pointer, ja_pointer, ia_pointer, shape)


def get_csr_handle(A,clear=False):
	if clear == True:
		A.indptr[:] = 0
		A.indices[:] = 0
		A.data[:] = 0
	return get_csr_handle2(A.data, A.indices, A.indptr, A.shape)


def csr_t_dot_csr(A_handle, C_handle, nz=None):
	# Calculate (A.T).dot(A) and put result into C
	#
	# This uses one-based indexing
	#
	# Both C.data and A.data must be in np.float32 type.
	#
	# Number of nonzero elements in C must be greater than
	#     or equal to the size of C.data
	#
	# size of C.indptr must be greater than or equal to
	#     1 + (num rows of A).
	#
	# C_data    = np.zeros((nz), dtype=np.single)
	# C_indices = np.zeros((nz), dtype=np.int32)
	# C_indptr  = np.zeros((m+1),dtype=np.int32)

	(a_pointer, ja_pointer, ia_pointer, A_shape) = A_handle
	(c_pointer, jc_pointer, ic_pointer, C_shape) = C_handle

	trans_pointer   = byref(c_char('T'))
	sort_pointer    = byref(c_int(0))

	(m, n)          = A_shape
	sort_pointer        = byref(c_int(0))
	m_pointer           = byref(c_int(m))     # Number of rows of matrix A
	n_pointer           = byref(c_int(n))     # Number of columns of matrix A
	k_pointer           = byref(c_int(n))     # Number of columns of matrix B
	                                          # should be n when trans='T'
						  # Otherwise, I guess should be m
	###
	b_pointer   = a_pointer
	jb_pointer  = ja_pointer
	ib_pointer  = ia_pointer
	###
	if nz == None:
		nz = n*n #*n # m*m # Number of nonzero elements expected
			 # probably can use lower value for sparse
			 # matrices.
	nzmax_pointer   = byref(c_int(nz))
	 # length of arrays c and jc. (which are data and
	 # indices of csr_matrix). So this is the number of
	 # nonzero elements of matrix C
	 #
	 # This parameter is used only if request=0.
	 # The routine stops calculation if the number of
	 # elements in the result matrix C exceeds the
	 # specified value of nzmax.

	info = c_int(-3)
	info_pointer = byref(info)
	request_pointer_list = [byref(c_int(0)), byref(c_int(1)), byref(c_int(2))]
	return_list = []
	for ii in [0]:
		request_pointer = request_pointer_list[ii]
		ret = mkl.mkl_scsrmultcsr(trans_pointer, request_pointer, sort_pointer,
				    m_pointer, n_pointer, k_pointer,
				    a_pointer, ja_pointer, ia_pointer,
				    b_pointer, jb_pointer, ib_pointer,
				    c_pointer, jc_pointer, ic_pointer,
				    nzmax_pointer, info_pointer)
		info_val = info.value
		return_list += [ (ret,info_val) ]
	return return_list

def test():
	num_cpu = 12
	mkl.mkl_set_num_threads(byref(c_int(num_cpu))) # try to set number of mkl threads
	print "mkl get max thread:", mkl.mkl_get_max_threads()
	test_csr_t_dot_csr()

def test_csr_t_dot_csr():
	AA = np.random.choice([0,1], size=(12,750000), replace=True, p=[0.99,0.01])
	A_original = spsp.csr_matrix(AA)
	A = A_original.astype(np.float32).tocsc()
	A = spsp.csr_matrix( (A.data, A.indices, A.indptr) )

	A.indptr  += 1 # convert to 1-based indexing
	A.indices += 1 # convert to 1-based indexing
	A_ptrs = get_csr_handle(A)

	C = spsp.csr_matrix( np.ones((12,12)), dtype=np.float32)
	C_ptrs = get_csr_handle(C, clear=True)

	print "=call mkl function="	
	
	while (True):
		return_list = csr_t_dot_csr(A_ptrs, C_ptrs)

if __name__ == "__main__":
	test()&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;So far, numpy linked with mkl can use multiple threads in the following code without setting any environment variables.&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:python;"&gt;import ctypes
mkl = ctypes.cdll.LoadLibrary("libmkl_rt.so")
print mkl.mkl_get_max_threads()
import numpy as np

a = np.random.normal( 0,1, (100,1000))

while True:
        a.dot(a.T)
&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 14 Jun 2016 19:32:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Directly-calling-mkl-from-python-and-try-to-use-more-than-one/m-p/1096098#M23569</guid>
      <dc:creator>Roger_X_</dc:creator>
      <dc:date>2016-06-14T19:32:46Z</dc:date>
    </item>
    <item>
      <title>Hi Roger, </title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Directly-calling-mkl-from-python-and-try-to-use-more-than-one/m-p/1096099#M23570</link>
      <description>&lt;P&gt;Hi Roger,&amp;nbsp;&lt;/P&gt;

&lt;P&gt;The sparse matrix should be multithreaded as MKL user guide mentioned:&amp;nbsp;&lt;/P&gt;

&lt;P&gt;OpenMP* Threaded Functions and Problems&lt;/P&gt;

&lt;P&gt;All Level 3 BLAS and all Sparse BLAS routines except Level 2 Sparse Triangular solvers.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;How about your try bigger C matrix, like 10000 and see if there are openmp thread in it?&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Best Regards,&lt;/P&gt;

&lt;P&gt;Ying&lt;/P&gt;</description>
      <pubDate>Wed, 15 Jun 2016 08:48:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Directly-calling-mkl-from-python-and-try-to-use-more-than-one/m-p/1096099#M23570</guid>
      <dc:creator>Ying_H_Intel</dc:creator>
      <dc:date>2016-06-15T08:48:27Z</dc:date>
    </item>
    <item>
      <title>I tried a bigger C matrix.</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Directly-calling-mkl-from-python-and-try-to-use-more-than-one/m-p/1096100#M23571</link>
      <description>&lt;P&gt;I tried a bigger C matrix. Like 120x120. That use 9% of the 8Gb memory and still only use one cpu.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;After that, I try 1200x1200, that nearly crashed the node. I ran top before the node froze, it gave the following&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; PID USER &amp;nbsp; &amp;nbsp; &amp;nbsp;PR &amp;nbsp;NI &amp;nbsp;VIRT &amp;nbsp;RES &amp;nbsp;SHR S %CPU %MEM &amp;nbsp; &amp;nbsp;TIME+ &amp;nbsp;COMMAND &amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;25462 rxu &amp;nbsp; &amp;nbsp; &amp;nbsp; 18 &amp;nbsp; 0 13.7g 7.6g &amp;nbsp;648 R &amp;nbsp; 57 98.2 &amp;nbsp; 0:25.04 python &amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;It seems it is still using one cpu.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 15 Jun 2016 15:15:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Directly-calling-mkl-from-python-and-try-to-use-more-than-one/m-p/1096100#M23571</guid>
      <dc:creator>Roger_X_</dc:creator>
      <dc:date>2016-06-15T15:15:00Z</dc:date>
    </item>
    <item>
      <title>Hi Roger, </title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Directly-calling-mkl-from-python-and-try-to-use-more-than-one/m-p/1096101#M23572</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 13.008px; line-height: 19.512px;"&gt;Hi Roger,&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="font-size: 13.008px; line-height: 19.512px;"&gt;No sure if you know that . Intel provide Python package which have numpy and scripy, and mkl integration. You may refer to&amp;nbsp;&lt;/P&gt;

&lt;P style="font-size: 13.008px; line-height: 19.512px;"&gt;&lt;A href="https://software.intel.com/zh-cn/forums/intel-distribution-for-python" target="_blank"&gt;https://software.intel.com/zh-cn/forums/intel-distribution-for-python&lt;/A&gt;&lt;/P&gt;

&lt;P style="font-size: 13.008px; line-height: 19.512px;"&gt;for the issue about&amp;nbsp;&lt;SPAN style="color: rgb(0, 0, 0); font-family: Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace; font-size: 13.008px; line-height: 14.3088px; background-color: rgb(248, 248, 248);"&gt;scsrmultcsr, &amp;nbsp;i&lt;/SPAN&gt;t’s threaded but for small sizes most of time could be spent on sequential part – &amp;nbsp;I will ask related developer to check this..&amp;nbsp;&lt;/P&gt;

&lt;P style="font-size: 13.008px; line-height: 19.512px;"&gt;Best Regards,&lt;/P&gt;

&lt;P style="font-size: 13.008px; line-height: 19.512px;"&gt;Ying&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 17 Jun 2016 08:11:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Directly-calling-mkl-from-python-and-try-to-use-more-than-one/m-p/1096101#M23572</guid>
      <dc:creator>Ying_H_Intel</dc:creator>
      <dc:date>2016-06-17T08:11:48Z</dc:date>
    </item>
    <item>
      <title>I manually compiled scipy and</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Directly-calling-mkl-from-python-and-try-to-use-more-than-one/m-p/1096102#M23573</link>
      <description>&lt;P&gt;I manually compiled scipy and numpy linked with mkl on the machine before knowing that intel provides python package with numpy and scipy.&amp;nbsp;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Scipy's sparse matrix multiplication routine doesn't use mkl, and always allocates new memory for the result.&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Does the scipy package in intel's python package have better implementation of sparse matrix multiplication?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;I would like to learn more about scsrmultcsr. W&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;hy the routine spends most of the on the sequential part?&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Is the calculation limited by memory bandwidth, implementation of parallelization, the cpu cache, or the data structure? For a sufficiently large problem size, what is the column and row dimension for the A and B sparse matrices input of scsrmultcsr? When does scsrmultcsr decide to use multiple threads?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Does the multiplication between coo matrices or sparse matrices in other formats have better parallization and/or use less memory for the multiplication?&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Is there a better way to calculate A dot A.T where dot is matrix product and .T is transpose? Currently, I transpose A first, then calculate ((A.T).T) dot (A.T), which is same as A dot A.T. A in my actual problem has b rows and 750,000 columns. Every 750,000 entries have about 2000 entries of one &amp;nbsp;and the remaining entries are all zeros. The sparsity of the matrix is 748/750. b is 9800 now and the larger the b the better. Each node only has 8G memory. The result is b rows x b columns.&lt;/P&gt;

&lt;P&gt;Thank you.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 17 Jun 2016 13:11:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Directly-calling-mkl-from-python-and-try-to-use-more-than-one/m-p/1096102#M23573</guid>
      <dc:creator>Roger_X_</dc:creator>
      <dc:date>2016-06-17T13:11:00Z</dc:date>
    </item>
  </channel>
</rss>

