reading and buffering a text file efficiently

andrew_4619 · ‎11-16-2023

I had a bug recently in some of my code that was reading STEP CAD exchange files. STEP is a structured text file and we read the file line by line appending the lines to a large allocatable text buffer for later processing.

This has worked just fine but we recently discovered that the line buffer size was not big enough and that indeed in some (not that common) instances lines in a step file can be many thousands of characters long.

Attempt 2 was to open as stream access file and read in unformatted chunks e.g. 20K bytes. This does not work as the last chunk is usually a partial chunk which generates an EOF error so that last chunk is not read and the file is now at EOF so the only recovery option would be to start again but do it differently the second time.

The third and so far successful attempt gets the files size, decided how many whole chunks that is, reads (as stream) that number of chunks and buffers them and then read the rest of the file one byte at a time until EOF is reached.

My question really is if there is any neater way to to it and maybe read the whole file into a buffer in one hit. there was a thread back in 2015 that discussed a similar topic but it doesn't resolve this issue. I suspect someone here has an "oven ready" definitive answer to this question!

JohnNichols · ‎11-16-2023

No matter how big you make a buffer, some one will make a file larger, that is why I have VEDIT as often it is the only program that will open these large files, although I use DXF and STRAND Formats, STEP is not one I used.

With these large files I would set up a MySQL data base and push the data into that structure so your data it is not part of your program and when you need it, read it back, all you need is an array to say ITEM1 went to LINE X in the database. You want a small program working on a large single data structure.

Or you do the work on the line of data, use it and then discard it, the STEP stuff will likely be discrete units of information.

Your only problem is FORTAN is lousy with MySQL and that code is better in C#.

andrew_4619 · ‎11-16-2023

Buffer size and file size are not the problem. Handling the step file is not a problem. If it is too big to handle don't do it or get a better computer! Given in most CAD the exported step can be limited to entities of interest that is all under user control if the hardware is not man enough. Once the required data is extracted from step the buffer is discarded.

I am specifically interested in the narrow topic of my question.

On another note I find DXF quite awful to work with and haven't worked on that for a long time. It is very limited in what it supports, documentation was always poor to non existent and those AutoCAD chaps has a bad habit of changings things from AutoCAD version to version thus breaking your code. STEP (and IGES) are proper standards and are fully documented.

JohnNichols · ‎11-16-2023

we read the file line by line appending the lines to a large allocatable text buffer for later processing.

I would not do this step, you have to pull the buffer apart, a wasted step and a pain in the neck.

Whatever format you use for the geometric data it all comes down to the same geometric stuff invented by the Greeks and other properties, I agree DXF is a beast, but in 1988 when I was designing gravity pipelines, DXF was well documented and simple and VEDIT could take it apart and my drafting people understood AUTOCAD. The last decent AUTOCAD dates from 1995 and the first Windows version, after that it become a program designed to make money, but up until the end of the books it was good.

Plus the first client asked for such a program and I had a weekend to write it. I was just looking for it, but it is a bit hard to find in all the dross.

Use an allocated data structure to hold each line, actually take each line apart and just the store the data and not all the dividers.

But I would avoid the buffer.

andrew_4619 · ‎11-16-2023

"I would not do this step, you have to pull the buffer apart, a wasted step and a pain in the neck." Then you would not be processing an IGES or STEP file which are hierarchical. You parse a high level entity which references lower level entities, you locate and parse the lower level entities which reference more lower entities etc. This can be several levels deep and the form of the entities can depend on top level things such as the SCHEMA type. If you worked with files you would be scanning the file many many times as the entities can occur in any order. I recall years back some quite top end CAD systems taking for ever to import data for this reason. By far the best way for me is to read once and buffer it. You can then search to find a thing you are interested in, parse it into a data structure, then the find and recursively parse referenced entities until the entity is fully defined.

DXF is very simplistic it doesn't support many entity types and if for example you want an entity such as a polyline you will find the polyline as a single block of data between start and end tokens. It is geometrically fully defined by what lies between these tokens. That is why you can extract something meaningful on a single pass of the file and don't need to buffer it. Quite different beast, but also why as a format it is rubbish for higher level geometric entities. It was designed for a 2D drafting package.

JohnNichols · ‎11-16-2023

Possibly the only advantage of STEP files is that they are widely adopted in many CAD software. On the other hand, its format, and specially the EXPRESS data modelling language has a few disadvantages:[opinion]

the specification is not freely available (you have to pay for it)
it is not possible to sequentially read a STEP file. Entities can be in any order and can reference other entities forwards and backwards in the file (see entity #14 in the example above). Therefore the entire file has to be read into memory and tokenized before parsing.[citation needed]
the format is not storage-efficient. For example assigning an RGB color code to an edge requires at least 6 other entities, and specifying a transformation requires at least 5 additional entities (PLANE, AXIS2_PLACEMENT_3D, a CARTESIAN_POINT, and 2 DIRECTION entities)[citation needed]
the format is not well-defined. For example the same triangle can be encoded in a STEP file in many different ways (with FACET_BREP, ADVANCED_FACE, POLY_LOOP, EDGE_LOOP, as a MANIFOLD_SOLID_REPRESENTATION or as a SHELL_BASED_REPRESENTATION, etc.). An importer needs to recognize all variants in order to read a STEP file consistently. Most CAD software does not support the full set of STEP entries, and as such, are limited to a specific subset of STEP entities. For example Autodesk Knowledge Base, list of supported STEP entities.[citation needed]
As a result, most CAD software have some sort of "repair geometry data after import" feature, which may or may not work.[citation needed]

I have been creating million element FE models for years,

Classic example. There can be no errors in the geometrical formation, which means I use a file format that is consistent and can be developed in a simple format. If I use STEP or IGES or other formats to import from say RHINO to FEM then by the time the model gets to a moderate size there will be data errors in the FE model, like duplicate faces that need to be cleaned, but most cleaners cannot get it all, so your analysis is stopped, usually when developing the stiffness matrix.

I understand your issue with the STEP file, but I would still develop a data structure to hold all the required info for the model, and then in a single pass load the data, and then resweep the data structure to look for errors, you cannot write a STEP file without errors. The STEP file had to be written by a forward running program.

The advantage of my method is everything is self contained and in a STRAND7 model format. But can I say I feel your pain and I am glad it is not me.

jimdempseyatthecove · ‎11-16-2023

Try this:

!  TestStream.f90 
program TestStream
    implicit none
    integer, parameter :: BigBufferSize = 100   ! Your number will vary
    character(len=BigBufferSize) :: BigBuffer
    integer :: i
    integer(8) :: fileSize, position, charactersRead
    open(10,file="testData", access="stream", action="write", status="unknown")
    do i = 1, 123   ! some number not multiple of BigBufferSize
        write(10) "x"
    end do
    close(10)
    open(10,file="testData", access="stream", action="read", status="old")
    inquire(10, size=filesize)
    position = 0
    do while(position < fileSize)
        charactersRead = min(fileSize - position, BigBufferSize)
        read(10, end=100) BigBuffer(1:charactersRead)
100     if(charactersRead < BigBufferSize) BigBuffer(charactersRead+1:) = " "
        position = position + charactersRead
        print *, BigBuffer
    end do
    close(10)
end program TestStream

Jim Dempsey

JohnNichols · ‎11-16-2023

Jim:

Results from your program

I put in a file called a.in and put it into a solution and a.in is the standard STEP file from wikipedia, the elements are separated by ;, why would you destroy this structure and then recreate it, read until you get a ; and then deal with the blob you created - first. It is LISP like, so do a tree structure on it.

ISO-10303-21;
HEADER;
FILE_DESCRIPTION(
/* description */ ('A minimal AP214 example with a single part'),
/* implementation_level */ '2;1');
FILE_NAME(
/* name */ 'demo',
/* time_stamp */ '2003-12-27T11:57:53',
/* author */ ('Lothar Klein'),
/* organization */ ('LKSoft'),
/* preprocessor_version */ ' ',
/* originating_system */ 'IDA-STEP',
/* authorization */ ' ');
FILE_SCHEMA (('AUTOMOTIVE_DESIGN { 1 0 10303 214 2 1 1}'));
ENDSEC;
DATA;
#10=ORGANIZATION('O0001','LKSoft','company');
#11=PRODUCT_DEFINITION_CONTEXT('part definition',#12,'manufacturing');
#12=APPLICATION_CONTEXT('mechanical design');
#13=APPLICATION_PROTOCOL_DEFINITION('','automotive_design',2003,#12);
#14=PRODUCT_DEFINITION('0',$,#15,#11);
#15=PRODUCT_DEFINITION_FORMATION('1',$,#16);
#16=PRODUCT('A0001','Test Part 1','',(#18));
#17=PRODUCT_RELATED_PRODUCT_CATEGORY('part',$,(#16));
#18=PRODUCT_CONTEXT('',#12,'');
#19=APPLIED_ORGANIZATION_ASSIGNMENT(#10,#20,(#16));
#20=ORGANIZATION_ROLE('id owner');
ENDSEC;
END-ISO-10303-21;

It is a clumsy Euler graph - tree.

#20=ORGANIZATION_ROLE('id owner'); the only important thing in the first bit is either the 20 or the ORGANIZATION_ROLE, if it is the 20 use that and get rid of the 20 wasted letters or if it is ORGANIZATION_ROLE, create your own look up table.

We had terrible problems with land surveyors using different names, BM, bm, TBM, BM20 etc for a bench mark, we told them from then on a BM of all forms was coded 20, minimized our errors.

If the surveyor had two 20's you can compare the locations to see if they are the same one. It is tree, recreate the mental tree and get rid off all the wasted text. I do these sorts of structures that are 700MB in size on an average day.

JohnNichols · ‎11-16-2023

You could never write a generic STEP reader, you have to have the context built into your program.

JohnNichols · ‎11-16-2023

Standard bridge method as a case in point:

If you calculate all of the points as (X,Y) the original will contain errors, guaranteed, the people who build this by hand coped with the errors, we used to do this all the time, but it is simpler to send the critical information

2 parabolas, a shape with 14 hangers and the hanger spacing == on a 121.2 m length. In creating the parabolic data, which is six reals, you can do an error analysis, so for one bridge the maximum error was 45 mm, but the parabola meant it is smooth when built which is what I want, if the parabola gives 16825 and that is a better fit to make a smooth differential then who cares, in this instance, as long as you check all the other dimensions.

JohnNichols · ‎11-16-2023

I used to get paid a lot of money to sort out these errors.

jimdempseyatthecove · ‎11-17-2023

>>why would you destroy this structure and then recreate it, read until you get a ; and then deal with the blob you created - first.

@andrew_4619 listed his requirement to read a large chunk of his input file, which can be of size less than the size of his input buffer, without receiving an I/O error (and with receiving all the data).

I provided that solution (he can adjust the buffer size). Although Andrew has some suspicions about always INQUIRE returning the correct file size.

@andrew_4619 When file is opened for stream, the units of file size are bytes. Other modes may return units of sizeof(real).

If you do encounter problems, then you can modify the code I provided to attempt to read 1 byte at a time after it reaches the inquire file size.

@JohnNichols The purpose of reading a large chunk, it is much faster than reading one byte at a time. Then use INDEX(BigBuffer(pos,:),';') to find (or not find) the next ';'.

You wouldn't destroy the structure, instead, you would find the structure within the big buffer.

.OR.

deal with structures larger than your big buffer in a different way.

Jim Dempsey

andrew_4619 · ‎11-16-2023

Thanks Jim I considered that approach and adopted something similar. One problem is that I ask the file size but you are at the mercy of the OS and file system and I am not 100% convinced that the number can be 100% relied on. I think in some instances you may get something that is an integer multiple of the data block size. Anyway to guard against that this last chunk gets read a byte at a time, that works and may be a tad slower but the difference is not a real world problem so I will close this and move on. It seems strange that there is no simple ( i.e. one read) way to suck a whole file into a data buffer that was the crux of the question I suppose but enough time has been spent.

John, I have no pain! This is not a new project I have been happy parsing data from a step files for a couple of years it isn't a big problem, I have a system that works but as with all things you from time to time find an issue that needs fixing. You are also assuming many things that are not in fact relevant to me:

1) I do not in general want ALL the geometry from the file just specific pieces of geometry so line by line parsing each entity into a data structure could be quite wasteful and would involve implementing every possible data type. A LOT more work most of it unnecessary.

2) reading it line by line is also not as easy as you might think because the line length is variable and a "line" can in fact be thousands of characters long. How big do you make it? Also, depending on the originating system you may or may not have CRLF record delimitators you get on windows. Yes you can work around in many ways but that is also work....

3) I am not destroying the data by reading it I am just taking it from disk to memory the structure is still there and can be searched in fast ways. We can also at the same time weed out the not significant white space that can exist between tokens and delimitators. Simple queries can be used on the data buffer to locate specific pieces of data that can then be parsed.

4) I think your problems are different to the problems in my world. STEP is the most accurate and complete method of exchanging CAD data between systems that cannot read each others natively binary data. For complex freeform curves and surfaces it beats the pants of every other system. The pain of importing automotive body data or complex mouldings via IGES with gaps and cracks where surfaces do not meet due to tolerance or thin micro sliver surfaces that fill a gap, with high end CAD systems such as Catia of NX those problems are really quick small in STEP by comparison.

Anyway enough, far to much digression.

JohnNichols · ‎11-16-2023

I agree we have different problems, but the note that you only require certain entities is what is normal for all people, no one ever needs everything.

I look at old bridges and model them to look at measured data.

The only thing of importance is efficiency and accuracy.

The only way to do it is Fortran.

andrew_4619 · ‎11-17-2023

"The only thing of importance is efficiency and accuracy." Accuracy isn't an issue here, nothing is lost, efficiency well on my measures the method you are suggesting is less efficient. I only process the entities that are needed for the rest of the entities the only action is putting them in a temporary buffer. Which part off that is inefficient and inefficient in what? Time is the only measure if the resources (e.g. memory) are up to the job?

Looking at my DXF implementation it does indeed process on a line by line basis the structure of DXF lends itself to that as an entities is self contained in a series of consecutive records so it is a linear process.

JohnNichols · ‎11-17-2023

We are talking about entirely different uses for geometric data. The STEP file is for machining, it is not for exchanging FEM data, which has very specific limits, that do not include splines etc , but include specific limited numbering systems, yours does not.

A STEP file is a translation of a database from one program to another, it is complete, it is not efficient but it is agreed by all which is the main trick.

The idea is a minimum of duplications, but it gets down to point, spline = series of points, face etc... etc.. You need to recreate the original database structure to do anything with it, hence your read and assemble, and it includes all the legal data that has nothing to do with the part manufacture, but is process control.

If I had to read a step file I would use a C# program, it has several useful commands not found in Fortran that will take a STEP file apart quickly and once. I do this sort of stuff day in and day out like you, but with structures. We are in a different world and the structural engineers are still in the 1970s compared to your stuff, but they lack the skill to get to your level.

andrew_4619 · ‎11-17-2023

Indeed quite different applications. If you needs to go full Monty on step there is a company that specialises in step tools and libraries for that purpose but that is big money stuff for big companies, the likes of Catia , Airbus, Boeing etc use that stuff for data integrations/exchange. Though those business like all the supply food chain to be using the same software. The issue comes when you get to the CAM stage in manufacturing and want to direct diverse machine tools, robots and other automation..

I do actually work with FEM data also but only import shell geometry in the form of Bulk data, LS-Dyna keyword and Abaqus inp file but that is all simple stuff to work with. I do create and export complete ready to run data decks in native format for a number of solvers (MSC/NX Nastran, Altair Optistruct, Abaqus, LS-Dyna and a few other things) but that is either shell or solid model with orthotropic materials (composites), BCs, Loads, contact surfaces etc. I dropped doing actual FEA analysis many years ago.

Anyway enjoy your bridges, they are quite useful things. Engineering doesn't seem that highly valued in the UK these days, much better to be a City number shoveler!?!

JohnNichols · ‎11-17-2023

I would not go full Monty on step, one cannot fix errors in other people's programs, better to fix your own.

STEP has the interesting structure based on a unit like

#8=UNCERTAINTY_MEASURE_WITH_UNIT(LENGTH_MEASURE(0.0),#5,'','');

Jim: You need to just read to the end of the line or the ;, if it fits on a line then pull the line apart, somewhat simple in C# and a bit of a pain in FORTRAN, otherwise make a long string from multiple lines, deleting the line breaks and then pull the string apart into an array.

Then deal with the array. The #8 are used as references or pointers in one form.

Andrew: I prefer to use programs I write for structural analysis. The big boys are to complicated and slow.

I do work in England but not allowed to talk about it.

Kjell-Bengtsson-Jotne · ‎04-27-2024

Jotne has a STEP Object Database and there you can store any size of STEP files.

www.jotneconnect.com