Parallel run of LinRegQRDenseBatch performance

Evgeniy_i_ · ‎11-17-2016

Hello,

I'm experimenting with Intel DAAL library and specifically with LinRegQRDenseBatch example (https://github.com/01org/daal/blob/daal_2017_update1/examples/java/com/intel/daal/examples/linear_regression/LinRegQRDenseBatch.java )

The only difference is that I'm using a bigger file for input.

While running this example I see that only one core of my PC is utilized, and time for linearRegressionTrain.compute() is ~4 seconds (for my input data). My suggestion was that if I start two such processes (independent linux processes) then two cores would be utilized and the time should not be affected much (i.e. it should remain approximately the same 4 seconds, or increase a bit). But actually the time increased up to 7.5 seconds - almost twice as for a one-process run. Even though only two cores are utilized according to linux htop command (out of eight)

I thought it might be connected with some shared state between the processes, like some lock or semaphore, so I've wrapped those processes into docker containers for complete isolation. But results remain the same 7.5 seconds for two processes even when they are in docker containers.

As I have a multicore server (Intel Core i7 4770) and my use case connected with running multiple independent trainings, I thought it is reasonable to run them using batch version of algorithm and multiple processes. However they affect each other a lot.

Could you please provide some guesses what this behavior is connected? I.e. how two processes can affect each other running on two cores in two independent docker containers? Maybe there are some settings I'm missing?

Regards,

Evgenii Ismailov

Andrey_N_Intel · ‎11-18-2016

Hi Evgenii,

Can you please provide additional details on the experiment you run including OS and input matric dimensions/number of dependent variables? If you could share the dataset, it would help us reproduce and analyze this behavior.

Thanks,

Andrey

Evgeniy_i_ · ‎11-18-2016

Hi Andrey,

Thank you for your reply.

My OS is Ubuntu 16.04. PC has 32GB of memory (4x8GB), Intel Core i7 4770.

The input file is the same as that is used for initial LinRegQRDenseBatch example, but replicated multiple times:

https://raw.githubusercontent.com/01org/daal/daal_2017_update1/examples/data/batch/linear_regression_train.csv

Command to replicate (bash): $ for i in {1..10000}; do cat linear_regression_train.csv >> lr_big.csv; done

Resulting lr_big.csv is 1.2GB

Attaching slightly modified version of LinReg example (with input changed to lr_big and timings added)

530983

Steps to reproduce:

1. Run single process of LinRegQRDenseBatch. Time per train should be X ms (depending on your machine). Only one core is utilized while running (see htop, or top bash command if you're using linux)

2. Run two processes at the same time. Time per train for each process becomes ~2X ms. Two cores are utilized.

530983

Andrey_N_Intel · ‎11-18-2016

Thank you, Evgeniy, for the test case and the detailed instructions! We will have a look. Andrey

Daria_K_Intel · ‎11-24-2016

Hi Evgeniy,

I used your dataset in our environment and failed to reproduce the behavior you described earlier: all cores on the server used for the experiment are fully utilized by the test case. I used this HW&SW Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz 4-cores, Ubuntu 16.04.1 Java 1.8.0, Intel DAAL 2017 U1.

The steps I took to run the test case are below:

cd <dir>/daal/examples/java

Leave only one example in daal.lst as following:

Java_example_list=( linear_regression/LinRegQRDenseBatch )

source ./<dir>/daal/bin/daalvars.sh intel64

./launcher.sh intel64 ./<path_to_javac>

I noticed that LinRegQRDenseBatch example loads your dataset for ~12 seconds and utilizes only one core at this stage.

When the example trains the model, the ‘top’ utility shows ~400% of CPU utilization. Run time of the linearRegressionTrain.compute() method is ~0.94 seconds.

Attached screenshot contains the output of the example. I also ran two processes, and each of them trained Linear Regression model. The training time is 1.9 seconds for each of the processes what is expected.

To help us reproduce the behavior of the example you observe in your environment can you please provide the additional details such as how you set the environment, build and run the example, java version you use, if hyper-threading enabled or not, and any other useful details?

-Daria

Evgeniy_i_ · ‎11-25-2016

Hi Daria,

Thank you very much for your answer!

It seems the problem was in the library version I used. It was l_daal_oss_p_2017.0.007 which I downloaded the last time I checked DAAL (June 2016)

Updating to the latest version indeed solved the problem. For now all the cores are 100% busy and computation time inproved from 4.2 seconds (previous version) to 600 ms.

I've also checked that when setting Environment.setNumberOfThreads(1) and running two processes in parallel, they do not affect each other's performance. In this case 2 cores are used (as expected) and running time is 3.2 seconds, which is the same as when running only one process.

Regards,

Evgeniy