Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

CPU hotspot - TRIM

DavidWhite
Valued Contributor II
770 Views

I use a tab-delimited text format to store data for an application.  This format is useful since the data can also be opened easily in Excel.

My datasets have around 5000 lines of data, but a varying number of columns, up to 1000.  Consequently, the string variables I need to use are very large.  However, when I write back to the dataset, I use TRIM(STRING) in the write statement, to stop the file size being controlled by the maximum number of columns.

What I find is that TRIM statement is consuming 15% of the total CPU time for my application (admittedly, this includes rewriting the 5000 lines for each of the 25 cases in the file, so around 125,000 writes.

Is there a faster way of working with this data? 

Thanks,

David

0 Kudos
15 Replies
mecej4
Honored Contributor III
770 Views

If you have kept track of the number of active columns in each row, and all the active columns precede all the inactive columns in each row, you could use a substring expression instead of TRIM. 

0 Kudos
andrew_4619
Honored Contributor II
770 Views

When you write trim(string) maybe it makes a temp copy of the result before writing. Does L = len_trim(string) and then write string(1:L) work any faster? 

0 Kudos
Steven_L_Intel1
Employee
770 Views

I think Andrew has it right. You could simplify it to string(1:len_trim(string))

0 Kudos
DavidWhite
Valued Contributor II
770 Views

The TRIM version reports 2.629 seconds

Using STRING(1:LEN_TRIM(STRING)) reports a hotspot in LEN_TRIM of 2.007 seconds.

So there is an improvement, but this still represents about 10% of the run time.

Regards,

David

0 Kudos
jimdempseyatthecove
Honored Contributor III
770 Views

If your STRING is much longer than your average contents (for worst case scenario), and if you know that the "expected data" has no more than one space (or n spaces) separating non-space text, Then you could selectively probe the STRING for two (or n+1) spaces.

split = LEN(string) / 2
split2 = split / 2
do while(split2 .gt  0)
  if(string(split:split+1) .eq. '  ') then
    split = split - split2
  else
    split = split + split2
  endif
  split2 = split2 / 2
end do
do while(string(split:split) .eq. ' ')
  split = split - 1
  if(split .eq. 0) exit
end do
if(split .eq. 0) then
  ! empty line
  ...
else
  ! string(1:split) has data
endif

If your string buffer is quite long, and your contents on average is quite short. then the above may yield better performance. (also requires known number of interior spaces between tokens).

Jim Dempseyh

0 Kudos
IanH
Honored Contributor II
770 Views

It is probably worthwhile trying to verify that the CPU consumption is actually associated with the TRIM/substring operation, and not the closely associated output operation.
 

0 Kudos
DavidWhite
Valued Contributor II
770 Views

My database is tab-delimited, and I aim to handle up to 999 cases, which results in 1003 output fields per line, with a maximum string length of around 31000 char.

For the test case, I am only working with 25 cases, so only 29 output fields, and a maximum string length of 850 char. (Since the file can be written from Excel, I have no control over the actual length of each record).

My code originally defined the string length based on the maximum permitted, but all my testing using VTune Amplifier has been on the same file with the same data.

Ian, further to your comment - the actual data being written in all these cases is identical, so the write and file operations should be identical.  The resultant file sizes and contents are identical.

When STRING is defined as 31000 characters, I get TRIM 2.660 seconds; LEN_TRIM 2.007 seconds

When STRING is defined as 850 characters, I get TRIM 0.031 seconds; LEN_TRIM 0.031 seconds.

Does TRIM or LEN_TRIM do something different for very large strings?  Is there a length above which their performance degrades?  Or is it simply a case of having to search backwards through some 30000 characters to find the end of the string?

David

 

0 Kudos
IanH
Honored Contributor II
770 Views

Are you assembling an entire output record before you write it to the file, or are you writing each field in turn?

0 Kudos
DavidWhite
Valued Contributor II
770 Views

Ian,

The number of cases in the file is not known until it is opened.  Once opened, any individual case(column) can be read in, processed and then written back to the file.  In my test case, there are 25 cases, which I sequentially process (this involves reading the file, and updating it 25 times).

At the file save stage, I open the file, read each record in turn, find the column that needs updating, replace that section of the string (which could be a different length to what was read in), and then write the whole string back out to a temporary copy of the file.  As noted earlier, there are more than 5000 records in a file.

Regards,

David

0 Kudos
jimdempseyatthecove
Honored Contributor III
770 Views

>>tab-delimited

split = YourPragmaBufferLength / 2
split2 = split / 2

do while(split2 .gt  0)
  if(string(split:split) .eq. ' ') split2 = -split2
  split = split + split2
  split2 = abs(split2 / 2)
end do
if(string(split:split) .eq. ' ') split = split - 1
if(split .eq. 0) then
  ! empty line
  ...
else
  ! string(1:split) has data
endif

Then the split code need only look for single space (which would reside on the rhs of the string, unless you null-filled the string).Essentially you are performing a binary search of the string to locate the transition from non-space (data) and space (end of data inside rhs of string).

Assume for example that the average record length was 500 bytes. The search loop would iterate 14 or 15 times.

Knowing the buffer size in advance, you could unroll the loop

Jim Dempsey

0 Kudos
Roman1
New Contributor I
770 Views

Have you tried using a function SCAN instead of TRIM or LEN_TRIM? 

ii = SCAN( buffer, ' ')
write(*,*)  buffer(1:ii-1)

I could be wrong on this, but SCAN would search forward instead of backward.  So, if the buffer is very large compared with string size, it might be faster.

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
770 Views

Roman,

Thanks for posting an alternative.

For relatively short strings, scan would be faster than the binary search. At some point in size, the binary search will take over in performance. David would have to test both methods on his typical data.

Jim Dempsey

0 Kudos
DavidWhite
Valued Contributor II
770 Views

Thanks for your suggestions.

Since the file is tab-delimited, I can readily determine the number of cases by counting the number of tabs in the first record of the file.

I am then able to determine a reasonable maximum string length based on the number of actual cases, rather than the maximum possible number of cases.  This produced for me an 80 times increase in speed (which is twice as much as expected, since the string length was reduced by a factor of around 40).  Trying anything more elaborate is unlikely to improve the efficiency much more.

Incidentally, I always use INDEX rather than SCAN - how do these compare?  Is one preferred over the other.

Thanks,

David

0 Kudos
andrew_4619
Honored Contributor II
770 Views

Sound like you have reasonable solution. Just an additional thought, if you can calculate the max string length then maybe you could also estimate the minimum length as you only need to look for the string end between the two which might be faster still, or maybe not....

ilen = iminl + index(string(iminl:imaxl),' ') - 1 

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
770 Views

Andrew,

You do not need an imaxl, you can run to the end of the string as the index will stop on the first ' '.

David, you did not state in your description that for each file, that each line has the same number of tokens (cases).

minimumLineSize = (minimumTokenSize + 1) * numberOfTokens ! Andrews iminl

Also, because you are processing each line, and identifying token positions of each line, you also know the position of the last token (of each line processed). This can be the start search point for the ' ' (of each line processed).

Jim Dempsey

0 Kudos
Reply