I use a tab-delimited text format to store data for an application. This format is useful since the data can also be opened easily in Excel.
My datasets have around 5000 lines of data, but a varying number of columns, up to 1000. Consequently, the string variables I need to use are very large. However, when I write back to the dataset, I use TRIM(STRING) in the write statement, to stop the file size being controlled by the maximum number of columns.
What I find is that TRIM statement is consuming 15% of the total CPU time for my application (admittedly, this includes rewriting the 5000 lines for each of the 25 cases in the file, so around 125,000 writes.
Is there a faster way of working with this data?
If you have kept track of the number of active columns in each row, and all the active columns precede all the inactive columns in each row, you could use a substring expression instead of TRIM.
The TRIM version reports 2.629 seconds
Using STRING(1:LEN_TRIM(STRING)) reports a hotspot in LEN_TRIM of 2.007 seconds.
So there is an improvement, but this still represents about 10% of the run time.
If your STRING is much longer than your average contents (for worst case scenario), and if you know that the "expected data" has no more than one space (or n spaces) separating non-space text, Then you could selectively probe the STRING for two (or n+1) spaces.
split = LEN(string) / 2 split2 = split / 2 do while(split2 .gt 0) if(string(split:split+1) .eq. ' ') then split = split - split2 else split = split + split2 endif split2 = split2 / 2 end do do while(string(split:split) .eq. ' ') split = split - 1 if(split .eq. 0) exit end do if(split .eq. 0) then ! empty line ... else ! string(1:split) has data endif
If your string buffer is quite long, and your contents on average is quite short. then the above may yield better performance. (also requires known number of interior spaces between tokens).
My database is tab-delimited, and I aim to handle up to 999 cases, which results in 1003 output fields per line, with a maximum string length of around 31000 char.
For the test case, I am only working with 25 cases, so only 29 output fields, and a maximum string length of 850 char. (Since the file can be written from Excel, I have no control over the actual length of each record).
My code originally defined the string length based on the maximum permitted, but all my testing using VTune Amplifier has been on the same file with the same data.
Ian, further to your comment - the actual data being written in all these cases is identical, so the write and file operations should be identical. The resultant file sizes and contents are identical.
When STRING is defined as 31000 characters, I get TRIM 2.660 seconds; LEN_TRIM 2.007 seconds
When STRING is defined as 850 characters, I get TRIM 0.031 seconds; LEN_TRIM 0.031 seconds.
Does TRIM or LEN_TRIM do something different for very large strings? Is there a length above which their performance degrades? Or is it simply a case of having to search backwards through some 30000 characters to find the end of the string?
The number of cases in the file is not known until it is opened. Once opened, any individual case(column) can be read in, processed and then written back to the file. In my test case, there are 25 cases, which I sequentially process (this involves reading the file, and updating it 25 times).
At the file save stage, I open the file, read each record in turn, find the column that needs updating, replace that section of the string (which could be a different length to what was read in), and then write the whole string back out to a temporary copy of the file. As noted earlier, there are more than 5000 records in a file.
split = YourPragmaBufferLength / 2 split2 = split / 2 do while(split2 .gt 0) if(string(split:split) .eq. ' ') split2 = -split2 split = split + split2 split2 = abs(split2 / 2) end do if(string(split:split) .eq. ' ') split = split - 1 if(split .eq. 0) then ! empty line ... else ! string(1:split) has data endif
Then the split code need only look for single space (which would reside on the rhs of the string, unless you null-filled the string).Essentially you are performing a binary search of the string to locate the transition from non-space (data) and space (end of data inside rhs of string).
Assume for example that the average record length was 500 bytes. The search loop would iterate 14 or 15 times.
Knowing the buffer size in advance, you could unroll the loop
Have you tried using a function SCAN instead of TRIM or LEN_TRIM?
ii = SCAN( buffer, ' ') write(*,*) buffer(1:ii-1)
I could be wrong on this, but SCAN would search forward instead of backward. So, if the buffer is very large compared with string size, it might be faster.
Thanks for posting an alternative.
For relatively short strings, scan would be faster than the binary search. At some point in size, the binary search will take over in performance. David would have to test both methods on his typical data.
Thanks for your suggestions.
Since the file is tab-delimited, I can readily determine the number of cases by counting the number of tabs in the first record of the file.
I am then able to determine a reasonable maximum string length based on the number of actual cases, rather than the maximum possible number of cases. This produced for me an 80 times increase in speed (which is twice as much as expected, since the string length was reduced by a factor of around 40). Trying anything more elaborate is unlikely to improve the efficiency much more.
Incidentally, I always use INDEX rather than SCAN - how do these compare? Is one preferred over the other.
Sound like you have reasonable solution. Just an additional thought, if you can calculate the max string length then maybe you could also estimate the minimum length as you only need to look for the string end between the two which might be faster still, or maybe not....
ilen = iminl + index(string(iminl:imaxl),' ') - 1
You do not need an imaxl, you can run to the end of the string as the index will stop on the first ' '.
David, you did not state in your description that for each file, that each line has the same number of tokens (cases).
minimumLineSize = (minimumTokenSize + 1) * numberOfTokens ! Andrews iminl
Also, because you are processing each line, and identifying token positions of each line, you also know the position of the last token (of each line processed). This can be the start search point for the ' ' (of each line processed).