When i think to divide the

giuseppe500 · ‎10-04-2013

hello.
I have this problem:
I'm implementing an importer from ifc ( an architectural xml open format).
I wish use tbb for sped up the importation.
The bottle neck is the file, is clear , but may be that i can use my four cores:
I get the data from an xml and the data is composed by blocks:
for example column, wall, beam ecc....
Each of these type(column,wall and beam are classes)must be processed differently by a lot of operations and iterations.
The operations are different for each type but for the same type are equals.
I think that the "parallel for" is not indicated, peraphs the solution is a pipeline ,i read about:
http://software.intel.com/en-us/blogs/2011/09/14/how-to-make-a-pipeline-with-an-intel-threading-building-blocks-flow-graph
but i'm a newbe , i wish open a study tbb and this is the first experiment,
can you advice me some links or a better explanation of this problem or books for the tbb study more in depth?

Thanks.

RafSchietekat · ‎10-04-2013

A parallel parser for XML would be interesting, but I haven't followed things lately so I'm not aware of any such support. Skipping content and even random access was considered for the compact format (Accelerated Sequential Access, Random Access), and that would be ideal for such a purpose, but the winning proposal Efficient XML Interchange unfortunately does not provide such features, if you're even using that yet. Xerces for C++ doesn't provide concurrent access even to DOM, which is disappointing.

So it seems that you will at least have to translate the content to some intermediate format that does allow concurrent access, and then recursive parallelism could be applied to process it, maybe destructively to avoid having to reallocate data. You could even use parallel_for() or parallel_reduce() with a special Range implementation to avoid dealing directly with tasks yourself.

Starting parallel work concurrently with sequential parsing would be a lot more challenging, and as a self-declared "newbie" you probably shouldn't go there until you've built some experience first and have clearly identified any possible gains. I don't see much use for pipeline(), because the restrictions on concurrency would force you to have a rather heavy first stage, and then I don't see how it would allow you to combine events.

If anybody has other insights, I'd also like to hear about them, as I might be tempted to do something in that area again (but probably not as a hobby project).

jiri · ‎10-06-2013

Before you start parallelizing, you should try to evaluate, if parallelization will have some real benefits in your situation. In other words, you should try to "guess" if it will be worth it. For example, you should always consider effects of Amdahl's law (to put it shortly: if the serial part takes considerable portion of the total time, then even huge speed-up in parallel part won't improve the total performance so much). For example, if just loading the file from disk takes 1 second and then it takes 0.1 seconds to process, there is not much you can do. A few simple timing experiments could help judge this.

You should also think about the maximum degree of parallelism inherent to the problem you are trying to solve and potential bottlenecks.

On the other hand, if this your first such project, then I would suggest that you go ahead and just try out some of your ideas to "get the feeling". I probably wouldn't start by trying to do a parallel XML parser, since that is really hard.

giuseppe500 · ‎10-06-2013

thanks.

jimdempseyatthecove · ‎10-06-2013

RE: parallel XML processing: http://www.balisage.net/Proceedings/vol1/html/Wu01/BalisageVol1-Wu01.html

Jim Dempsey

RafSchietekat · ‎10-06-2013

Thanks for that link!

From a cursory glance at this document, the document is subdivided into chunks and each is parsed "speculatively" (although I wouldn't call it that if text cannot be mistaken for a tag, and I'm not sure that XML has inherited that vulnerability from SGML?), and subtrees are parsed before the full ancestry of a subroot is known (including namespace-related information of course). The processing API is DOM, at first sight only after parsing is complete (I don't support the dogma that a DOM should be complete before you start using it, but that doesn't seem to be a widely held view), and apparently the whole document is then held in memory until it is processed (a caveat if the biggest speedup is for the largest documents).

I would still like a format with additional information, so that less time and RAM has to be invested if only part of the information is going to be used etc., but this is certainly a potential improvement over what you get with Xerces. Perhaps also with a setting to allow the parser to be lazier about the lower-level stuff, processing that can run concurrently with parsing, ...

jiri · ‎10-07-2013

I think you can't mistake tag for text in XML, but you can mistake anything for a comment. The point in XML document that you choose as start of a chunk may be in the middle of a comment (scanning for "<" does not help here, obviously). The contents of the comment may look exactly like a normal XML document. As a result of this, you cannot provide the user with access to the currently parsed chunk, since there is no way to tell what the real content is until all preceding tasks have been processed.

giuseppe500 · ‎10-07-2013

When i think to divide the chunks of an xml, i think to a sort of map reduce pattern of google(http://research.google.com/archive/mapreduce.html) where one , or more processor do the mapping of the key/values(typeofblock/xmlblockchunk)and one or more pipeline do the post parsing work.
for example in ifc if i find a <ifc:window>...</ifc:window> chunk , I create a pair where key is ifc::window and value is <ifc:window>...</ifc:window>, then a pipeline take this block and do all the parsing and create for example an opengl rappresentation of the window in parallel
when i find <ifc:door>....</ifc:door> another pipeline do the job eccc....
Is not so simple simply because i'm a newbe and because in many many formats there are nested type and block , but the document and the link of map reduce explains well.
by.

ps. sorry for my fantasy i'm always not be able to create something of real, now i return to php and my sites.

RafSchietekat · ‎10-07-2013

jiri wrote:

I think you can't mistake tag for text in XML, but you can mistake anything for a comment. The point in XML document that you choose as start of a chunk may be in the middle of a comment (scanning for "<" does not help here, obviously). The contents of the comment may look exactly like a normal XML document. As a result of this, you cannot provide the user with access to the currently parsed chunk, since there is no way to tell what the real content is until all preceding tasks have been processed.

If I'm not mistaken, you mean "mistake part of a comment for a tag" rather than the other way around, right (non-native speaker here)?

You're right, of course, so it is indeed "speculative" parsing. Even the CDATA thing is still there, it's not just something from XML's SGML past. I guess I am a bit out of touch!

It seems a pity that the format was not better designed for better navigation and parallel processing, especially in its EXI reincarnation, when those criteria were being considered, as mentioned earlier.

RafSchietekat · ‎10-07-2013

giuseppe500 wrote:

When i think to divide the chunks of an xml, i think to a sort of map reduce pattern of google(http://research.google.com/archive/mapreduce.html)

I think that's a bit overkill, because it's about an implementation for really big data sets that require many separate computers to work together.

I've already mentioned that you could use parallel_reduce() with a special Range from a DOM API, but if you want to integrate the "map" from the serialised format I think you would have to outsmart that Intel research team (whose product imposes a synchronisation point between parsing and processing), or do your own partitioning at a high level in the tree (separate data streams containing fragments of the full document), or constrain the XML so that you can "synchronise" without risk of being mistaken about context (you would still have to control the producer for that), or...

jiri · ‎10-08-2013

Raf Schietekat wrote:

Quote:

jiriwrote:
I think you can't mistake tag for text in XML, but you can mistake anything for a comment. The point in XML document that you choose as start of a chunk may be in the middle of a comment (scanning for "<" does not help here, obviously). The contents of the comment may look exactly like a normal XML document. As a result of this, you cannot provide the user with access to the currently parsed chunk, since there is no way to tell what the real content is until all preceding tasks have been processed.

If I'm not mistaken, you mean "mistake part of a comment for a tag" rather than the other way around, right (non-native speaker here)?

Non-native speaker here as well :-). You got the meaning right, of course.

I've actually seen a parallel XML parser (it was a result of a master thesis). It parsed the whole document (in parallel, using chunks similar to the link mentioned earlier) and provided DOM-like interface. Depending on the document, it could get speedups between 8 and 14 on 16 threads. It could beat serial libxml2 and xerces-c by a significant margin, even though single thread performance was worse. So, it clearly demonstrates that you really can get decent performance from parallel XML processing. Unfortunatelly, the software was probably abandoned after the thesis was defended and never reached production quality.

pipeline and ifc importation