- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are attempting to use the rapidjson project (https://code.google.com/p/rapidjson/) to parse through a json file, but we get a segfault when running it natively on the Xeon Phi Linux enviroment. It runs fine on the Xeon chip, however. The segfault occurs in the following lines of code in the rapidjson project:
reader.h
[cpp]
d *= internal::Pow10(exp + expFrac);
handler.Double(minus ? -d : d);[/cpp]
document.h
[cpp]
//this is the function that is called above
void Double(double d) { new (stack_.template Push<ValueType>()) ValueType(d); }
[/cpp]
I don't have a clue what the .template Push is doing, in fact i have never even seen syntax like that. Regardless, i don't believe this to be the source of the issue, as when i use gdb to debug through it, the error is occurring because in document.h, "d" being passed in to void Double is some random double (like 3.34155418345e-317) with the address of 0x0. GDB output is below:
Breakpoint 1, rapidjson::GenericReader<rapidjson::UTF8<char>, rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator> >::ParseNumber<1u, rapidjson::GenericInsituStringStream<rapidjson::UTF8<char> >, rapidjson::GenericDocument<rapidjson::UTF8<char>, rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator> > > (this=0x7fffffffcbe8,
stream=..., handler=...) at /home/jjekeli/workspace/com.src.ewir.keystonert_xeonphi/src/rapidjson/reader.h:636
636 d *= internal::Pow10(exp + expFrac);
(gdb) print d$1 = 45,000
(gdb) next
637 handler.Double(minus ? -d : d);
(gdb) print d
$2 = 4500
(gdb) print /a d
$3 = 0x1194
(gdb) print minus$4 = false
(gdb) step
0x000000000044e7e4 in rapidjson::GenericDocument<rapidjson::UTF8<char>, rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator> >::Double (this=0x83df98,
d=4.2699109613558058e-317) at /home/jjekeli/workspace/com.src.ewir.keystonert_xeonphi/src/rapidjson/document.h:776
776 void Double(double d) { new (stack_.template Push<ValueType>()) ValueType(d); }
(gdb) print d
$4 = 4.2699109613558058e-317
(gdb) print /a d
$5 = 0x0
As you cane see, before passing into the Double function, d has a valid value and a valid memory address, but upon entering the Double function, it no longer has a valid memory address or a valid value, and causes a seg fault.
Any thoughts as to why this may be occurring?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you print out the assembly code for the function call and entry point and send them for us? In addition, you might want to try compiling with a lower optimization level to see if the code works in that case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How can i determine the assembly code for the function call and entry point? Also, we had already tried setting to the lowest optimization level with no luck.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A long shot, but have you checked that you have enough stack space allocated?
The values I see on the card are
% ulimit -a
-f: file size (blocks) unlimited
-t: cpu time (seconds) unlimited
-d: data seg size (kb) unlimited
-s: stack size (kb) 8192
-c: core file size (blocks) 0
-m: resident set size (kb) unlimited
-l: locked memory (kb) 64
-p: processes 61357
-n: file descriptors 10240
-v: address space (kb) unlimited
-w: locks unlimited
It might be worth trying with ulimit -s unlimited
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Somewhat worried about setting the stack size to unlimited... isn't that a good way to smash one's stack and destroy the kernel?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Nothing you do with the stack should be able to cause the kernel to crash.
However if you are paranoid, by all means just double the size and see if that affects what happens; if it does, that's a strong signal that the amount of stack space may be the problem. (You may need to change the pthread stack size, rather than just using ulimit, of course, if your code is threaded).
You should also be able to work out how much stack you're using from the debugger. Look at the value of %rsp at the point of the crash, and then go back to the top of the thread's stack and look at the address of a local variable there. The difference is (close to) the stack usage. If it's nearly the stack limit that's a strong hint.
Also, if %rsp at the point of the crash is just below a page boundary, that's suspicious. If you then look at /proc/pid/maps for the process you shouldbe able to see if %rsp is pointing at valid memory or not.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Setting the stack size to unlimited did not alleviate the problem. Sorry.
The stack pointer at the beginning of the thread was 0x7fffffffe100. At the point of the crash, the stack pointer was at 0x7fffffffc2f0, a difference of 7696. That is near-ish the original stack size, but setting ulimit -s unlimited did not help.
Wasn't able to access the maps through proc/pid/maps? The pid was 7497, but there was no 7497 directory?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It doesn't seem like it is the stack, then, since the limit is in KB, not bytes, so you seem to be a long way away from the limit.
I'll crawl back under my stone :-)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page