Intel® oneAPI Data Analytics Library
Learn from community members on how to build compute-intensive applications that run efficiently on Intel® architecture.

Questions about PCA size limit.

Hanxi_F_
Beginner
699 Views

Hi,

I wrote a randomly generated file program and generate three files filled by random numbers (0, 1)

The sizes of these files are: 5000 * 1000, 10000 * 3000, 10000 * 5000.

I just run two example with these files.

And I use cout << eigenvectors->getNumberOfRows() << " " << eigenvectors->getNumberOfColumns() << endl; to check if the size of the eigenvector is correct.

The first two file can generate eigenvector normally, But both example programs just broke when running pca using 10000 * 5000 file.

Here's the error message.

Unhandled exception at 0x7588C54F in pca_cor_dense_batch.exe: Microsoft C++ exception: daal::services::interface1::Exception at memory location 0x0035F8E8.

I wonder if there's a limit on the number of features? What's the limit of the number of samples and features?
Or do I need to use distributed computation instead of batch computation for large size file computation? 

Really appreciate for your help.

0 Kudos
1 Solution
Ying_H_Intel
Employee
699 Views

Hi Hanxi,

I found one Win7 64bit machine (2 core, 4 HT) and install 32bit DAAL there.  I test both the PCA_Cor and PCA_SVD.  using Win32, Debug.dynamic.threaded.  as the total memory used by the program is less than 2G (3.1x of data set), the code can run without the large memory address on.

If using Debug.static.threaded, the total memory of PCA_SVD is beyond 2G (~4x of dataset(0.55G)), switch on the option. It seems both of them run ok on the machine. I attached the screencopy for your reference.   Could you show how many thread and memory used by task memory?

the test code is as below:

/* Input data set parameters */
//const string dataFileName = "../data/batch/pca_normalized.csv";
const string dataFileName = "pca_dataset_5000_10000_0.csv";
const size_t nVectors = 10000;

 services::SharedPtr<pca::Result> result = algorithm.getResult();
    printNumericTable(result->get(pca::eigenvalues), "Eigenvalues:",1,1);
 //   printNumericTable(result->get(pca::eigenvectors), "Eigenvectors:");

    return 0;

And as you see,  when the size required by application on win32 will limited by memory and address space allowed by Window platform. So yes, please consider to distributed computation.

Best Regards

Ying

View solution in original post

0 Kudos
9 Replies
Ying_H_Intel
Employee
699 Views

Hi Hanxi,

I'm afraid there is limitation about  2GB limitation, either in memory buffer or in stack.  What OS and 32bit or Intel 64 bit  program are you working and the exception happened on which source code line?

Best Regards,
Ying

0 Kudos
Hanxi_F_
Beginner
699 Views

Thanks, the file is about 336MB.

My OS is Windows 7 64-bit and I'm using 32-bit library. 

Exception happened here:  algorithm.compute();

0 Kudos
Ying_H_Intel
Employee
699 Views

Then how about if using X64 platform, and 64bit library?

Best Regards,

Ying

 

0 Kudos
Hanxi_F_
Beginner
699 Views

WOW, I've tested again with 10000 * 5000 and 20000 * 5000 and 10000 * 10000,
all of them work just fine with 64-bit library.

Would you tell me the reason? Thanks.

0 Kudos
Ying_H_Intel
Employee
699 Views

Hi Hanxi

thanks for the reply. I just tried the pca_svd_dense_batch.cpp with data set 10000x5000 on Linux machine. It seems both ia32 and intel64 run fine.

I haven't 32bit install under window, just try X64bit, which run fine too. So there is problem with 32bit.  we need check with developer team. 

Do you build in MSVC environment? You may try to add  Property Linker System increase the   Heap Reserve Size and Stack Reserve Size see if it can workaround the problem.

Best Regards,

Ying

 

 

0 Kudos
Ying_H_Intel
Employee
699 Views
Hi Hanxi, Please see the answer from our developer: in case of 32 bit applications on windows we have default limit of 2GB for process pace, while in case of 32bit application we have 4GB limit for user space. This can explain, why the problems occurs on windows only. There are options for enabling 4GB address space for 32 bit application on windows, please refer to https://msdn.microsoft.com/en-us/library/windows/desktop/aa366778(v=vs.85).aspx#memory_limits Kind Regards, Ying
0 Kudos
Hanxi_F_
Beginner
699 Views

Hi Ying,

I've followed the steps of your comment. I use visual studio 2015 and here's my steps:
Configuration Properties -> Linker -> System -> Enable Large Address (Yes)

But the error message is still the same. 
Unhandled exception at 0x76E1C54F in pca_cor_dense_batch.exe: Microsoft C++ exception: daal::services::interface1::Exception at memory location 0x003BF790.

Thanks for your help.

0 Kudos
Ying_H_Intel
Employee
700 Views

Hi Hanxi,

I found one Win7 64bit machine (2 core, 4 HT) and install 32bit DAAL there.  I test both the PCA_Cor and PCA_SVD.  using Win32, Debug.dynamic.threaded.  as the total memory used by the program is less than 2G (3.1x of data set), the code can run without the large memory address on.

If using Debug.static.threaded, the total memory of PCA_SVD is beyond 2G (~4x of dataset(0.55G)), switch on the option. It seems both of them run ok on the machine. I attached the screencopy for your reference.   Could you show how many thread and memory used by task memory?

the test code is as below:

/* Input data set parameters */
//const string dataFileName = "../data/batch/pca_normalized.csv";
const string dataFileName = "pca_dataset_5000_10000_0.csv";
const size_t nVectors = 10000;

 services::SharedPtr<pca::Result> result = algorithm.getResult();
    printNumericTable(result->get(pca::eigenvalues), "Eigenvalues:",1,1);
 //   printNumericTable(result->get(pca::eigenvectors), "Eigenvectors:");

    return 0;

And as you see,  when the size required by application on win32 will limited by memory and address space allowed by Window platform. So yes, please consider to distributed computation.

Best Regards

Ying

0 Kudos
Hanxi_F_
Beginner
699 Views

Hi Ying,

Thanks a lot.

Now I'm clear about this problem.

0 Kudos
Reply