Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Hanxi_F_
Beginner
123 Views

Some questions about time cost when changing files into Tensor.

Hi,

Recently I check the time cost on transferring csv file into Tensor.

I check the implement of function readTensorFromCSV() and rewrite it into readTensorFromString().

daal::services::SharedPtr<Tensor> readTensorFromString(const string& dataset)
{
	byte *data = (byte *)dataset.c_str();

	StringDataSource<CSVFeatureManager> dataSource(data, DataSource::doAllocateNumericTable, DataSource::doDictionaryFromContext);
	dataSource.loadDataBlock();

	daal::services::SharedPtr<HomogenNumericTable<double> > ntPtr =
		daal::services::staticPointerCast<HomogenNumericTable<double>, NumericTable>(dataSource.getNumericTable());

	daal::services::Collection<size_t> dims;
	dims.push_back(ntPtr->getNumberOfRows());
	size_t size = dims[0];
	if (ntPtr->getNumberOfColumns() > 1)
	{
		dims.push_back(ntPtr->getNumberOfColumns());
		size *= dims[1];
	}

	HomogenTensor<float> *tensor = new HomogenTensor<float>(dims, Tensor::doAllocate);
	float *tensorData = tensor->getArray();
	double *ntData = ntPtr->getArray();

	for (size_t i = 0; i < size; i++)
	{
		tensorData = (float)ntData;
	}

	daal::services::SharedPtr<Tensor> tensorPtr(tensor);

	return tensorPtr;
}

I use this in your neural_net_dense_batch example and check the time spent on each step.

void trainModel()
{
    /* Read training data set from a .csv file and create a tensor to store input data */
	clock_t time1 = clock();
    TensorPtr trainingData1 = readTensorFromCSV(trainDatasetFile);
	clock_t time2 = clock();
	TensorPtr trainingGroundTruth1 = readTensorFromCSV(trainGroundTruthFile);
	clock_t time3 = clock();

	ifstream t1(trainDatasetFile);
	stringstream buffer1;
	buffer1 << t1.rdbuf();

	ifstream t2(trainDatasetFile);
	stringstream buffer2;
	buffer2 << t2.rdbuf();

	string string1 = buffer1.str();
	string string2 = buffer2.str();

	clock_t time4 = clock();
	TensorPtr trainingData2 = readTensorFromString(string1);
	clock_t time5 = clock();
	TensorPtr trainingGroundTruth2 = readTensorFromString(string2);
	clock_t time6 = clock();

	cout << "from csv read train dataset: " << time2 - time1 << endl;
	cout << "from csv read train ground truth: " << time3 - time2 << endl;
	cout << "from byte array read train dataset: " << time5 - time4 << endl;
	cout << "from byte array read train ground truth: " << time6 - time5 << endl;
}

The result is as below:

from csv read train dataset: 57
from csv read train ground truth: 23
from byte array read train dataset: 80
from byte array read train ground truth: 87
 

I wonder if there's a way to improve the performance of reading data from byte array.

Thanks a lot.

0 Kudos
7 Replies
Ilya_B_Intel
Employee
123 Views

Hi Hanxi,

We added optimzations for StringDataSource in our latest release Intel(R) DAAL 2017 Update 1 and you can try that version. With those optimizations your function readTensorFromString() should become faster.

We are considering adding dedicated DataSources for Tensor data in the future. Do you have feedback on what kind of raw data format might be interesting to you?

Hanxi_F_
Beginner
123 Views

Hi ILYA,

I've already update the DAAL library to update 1, and the running result is posted.

Here's my question:

I've got a 2D-array as the input dataset or weights or biases, and I found that maybe StringDataSource is the most suitable data format for this situation? Or will you give me some advice on how to transfer the 2D-array into Tensor?

Thanks a lot!

Ilya_B_Intel
Employee
123 Views

Hanxi,

Thank you, we will see for further opportunities with StringDataSource optimizations in future releases.

Meanwhile, I would recommend to focus on the format of the data you have initially. One should try to make as little conversions as possible.

If you have a choice, I would really recommend to avoid storing binary floating point data in string format, because convertation between string decimal and binary representation is a pricey operation. If you have an opportunity to store data in binary format - that might be the fastest way to load. You may loose in portability here a bit, but in most cases that will not be a problem.

If you can load data in binary format, you can construct Tensor the same way it is shown in this example:
examples/cpp/source/datasource/datastructures_homogentensor.cpp

Hanxi_F_
Beginner
123 Views

ILYA,

Thanks for your advice! I'm working on it and I'll give you a feedback about the performance once I finished the job.

123 Views

Hi Hanxi,

We made our example based on your code and our dataset. Got the following results:

from csv read train dataset: 16.14 ms
from csv read train ground truth: 1.79 ms
from byte array read train dataset: 13.65 ms
from byte array read train ground truth: 1.69 ms

We used gettimeofday instead of clock to get more precise results.

Also, trainDatasetFile is read twice using string data source in your example (instead of reading trainDatasetFile + trainGroundTruthfile).

Can you share more details on OS, HW (CPU and HD/SSD) used in your measurements?

Hanxi_F_
Beginner
123 Views

Hi Vladislav,

Thanks for your comment.

Here's the details of my running environment:

  • OS:        Windows 7 Enterprise 64bit
  • CPU:      Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
  • RAM:     16.0GB
  • Storage: SATA 6 Gb/s ST31000524AS

​According to my OS is windows 7, there's no function gettimeofday(). So I searched an implementation of this function and here's my result.

from csv read train dataset: 53 ms
from csv read train ground truth: 24 ms
from byte array read train dataset: 63 ms
from byte array read train ground truth: 28 ms

 

Gennady_F_Intel
Moderator
123 Views

Hanxi, thanks for the update. We have already investigated the cause of this performance issue. We are planning to provide the fix of the problem the next ( nearest ) update of DAAL. We will keep you updated with the status.  --Gennady

Reply