Solved: Mixed-programming with CUDA C to create DLL for Excel

Sampath_Vanimisetti · ‎11-14-2014

Hello All,

In the past, I have successfully created Fortran DLLs with OpenMP for use with Excel VBA. However, I would now like to integrate some CUDA C GPU code. I am trying to use the Fortran 2003 C interoperability features to make Intel Fortran talk to CUDA C. I have been able to create an executable which shows the expected behavior. However, when I compile it as a DLL and use inside Excel, it crashes without warning. There is no diagnostic information whatsoever. If anyone has observed this behavior and found a workaround, I would be glad to get any kind of help. My development configuration and test code are as follows.

Thanks in advance,

Sam V

Build setup: Win 6 x64; Microsoft Excel 2010 VBA; Intel Composer XE 2013 IA-32 with Visual Studio 2008; NVIDIA CUDA C v5.5

Example code:

Fortran code (excelcuda.f90)
uncommenting/commenting relevant lines for compilation as an executable)

!program main
!implicit none
!real*4::xx(4),yy(4)
!xx=1.D0
!yy=2.D0
!write(*,*) xx, yy
!call myarrtest(xx,yy,4)
!write(*,*) xx, yy
!end program


subroutine myarrtest(arrin,arrout,sz1)

!DEC$ ATTRIBUTES DLLEXPORT,STDCALL,REFERENCE,DECORATE,ALIAS:'myarrtest'::myarrtest
!DEC$ ATTRIBUTES REFERENCE::arrin,arrout,sz1

USE, INTRINSIC :: ISO_C_BINDING
implicit none

INTERFACE
    SUBROUTINE kernel_wrapper (flt_a, flt_b, int_n) BIND(C)
    IMPORT
    INTEGER(C_INT), INTENT(IN) :: int_n
    REAL(C_FLOAT), INTENT(IN) :: flt_a(int_n), flt_b(int_n)
    END SUBROUTINE kernel_wrapper
END INTERFACE

integer*4::i
integer*4,intent(in)::sz1
real*4,dimension(sz1),intent(in)::arrin
real*4,dimension(sz1),intent(out)::arrout

!do i=1,sz1
!arrout(i)=arrin(i)+arrout(i)
!end do

CALL kernel_wrapper(arrout, arrin, sz1)

end subroutine

CUDA C kernel (cudakernel.cu)

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <cuda.h>
#include <cuda_runtime.h>


// simple kernel function that adds two vectors
__global__ void vect_add(float *a, float *b, int N)
{
   int idx = threadIdx.x;
   if (idx<N) a[idx] = a[idx] + b[idx];
}

// function called from main fortran program
extern "C" void kernel_wrapper(float *a, float *b, int *Np)
{
   float  *a_d, *b_d;  // declare GPU vector copies
   
   int blocks = 1;     // uses 1 block of
   int N = *Np;        // N threads on GPU

   // Allocate memory on GPU
   cudaMalloc( (void **)&a_d, sizeof(float) * N );
   cudaMalloc( (void **)&b_d, sizeof(float) * N );

   // copy vectors from CPU to GPU
   cudaMemcpy( a_d, a, sizeof(float) * N, cudaMemcpyHostToDevice );
   cudaMemcpy( b_d, b, sizeof(float) * N, cudaMemcpyHostToDevice );

   // call function on GPU
   vect_add<<< blocks, N >>>( a_d, b_d, N);

   // copy vectors back from GPU to CPU
   cudaMemcpy( a, a_d, sizeof(float) * N, cudaMemcpyDeviceToHost );
   cudaMemcpy( b, b_d, sizeof(float) * N, cudaMemcpyDeviceToHost );

   // free GPU memory
   cudaFree(a_d);
   cudaFree(b_d);
   return;
}

The above pieces of code was compiled using the following commands

nvcc -c -m32 -O3 cudakernel.cu
ifort -dll -libs:dll -iface:stdcall excelcuda.f90 cudakernal.obj cuda.lib cudart.lib

The resulting DLL is used within Excel VBA using the following statements

Declare Sub myarrtest Lib "excelcuda.dll" (ByRef x As Single, ByRef y As Single, ByRef n As Long)
...
...
Call myarrtest(vbarr(1), fortarr(1), n1)
...
...

Steven_L_Intel1 · ‎11-14-2014

You don't want to use -iface stdcall - you have the ATTRIBUTES for the DLL routine and that is sufficient.

You can debug this by specifying Excel as the program to run for your DLL project under Debugging and set a breakpoint at your DLL routine. This may give you a clue as to where the problem occurs. You may also want to create an executable that links to the DLL (specifying STDCALL for the DLL routine) and see how that works.

View solution in original post

Steven_L_Intel1 · ‎11-14-2014

You don't want to use -iface stdcall - you have the ATTRIBUTES for the DLL routine and that is sufficient.

You can debug this by specifying Excel as the program to run for your DLL project under Debugging and set a breakpoint at your DLL routine. This may give you a clue as to where the problem occurs. You may also want to create an executable that links to the DLL (specifying STDCALL for the DLL routine) and see how that works.

Sampath_Vanimisetti1 · ‎11-15-2014

Hello Steve,

Thanks for your prompt response. Removing the -iface:stdcall option fixed the issue. This comes as a little surprise, as I was able to compile a DLL using Fortran+C (without CUDA) and there was not issue at all. Also, if I compiled an EXE, there was no problem. Only when I try to compile the DLL with the -iface:stdcall (in addition to the attribute), does Excel crash. Nonetheless, the issue is solved. I am able to sucessfully integrate the CUDA routines with Excel now - thanks to you.

On a different note, I tried debugging by calling excel as an exe (devenv /debugexe <path-to-excel.exe> <workbook.xlsm>) - however it appear that once cannot debug the CUDA side without NVIDIA Nsight plug-in in Visual Studio. But I got the general idea. Thanks for the tip.

Regards,

Sam