double parameter not copying correctly

Jonathan_J_ · ‎07-02-2013

We are attempting to use the rapidjson project (https://code.google.com/p/rapidjson/) to parse through a json file, but we get a segfault when running it natively on the Xeon Phi Linux enviroment. It runs fine on the Xeon chip, however. The segfault occurs in the following lines of code in the rapidjson project:

reader.h

[cpp]

d *= internal::Pow10(exp + expFrac);
handler.Double(minus ? -d : d);

[/cpp]

document.h

[cpp]

//this is the function that is called above

void Double(double d) { new (stack_.template Push<ValueType>()) ValueType(d); }

[/cpp]

I don't have a clue what the .template Push is doing, in fact i have never even seen syntax like that. Regardless, i don't believe this to be the source of the issue, as when i use gdb to debug through it, the error is occurring because in document.h, "d" being passed in to void Double is some random double (like 3.34155418345e-317) with the address of 0x0. GDB output is below:

Breakpoint 1, rapidjson::GenericReader<rapidjson::UTF8<char>, rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator> >::ParseNumber<1u, rapidjson::GenericInsituStringStream<rapidjson::UTF8<char> >, rapidjson::GenericDocument<rapidjson::UTF8<char>, rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator> > > (this=0x7fffffffcbe8,
stream=..., handler=...) at /home/jjekeli/workspace/com.src.ewir.keystonert_xeonphi/src/rapidjson/reader.h:636
636 d *= internal::Pow10(exp + expFrac);
(gdb) print d

$1 = 45,000

(gdb) next
637 handler.Double(minus ? -d : d);
(gdb) print d
$2 = 4500
(gdb) print /a d
$3 = 0x1194
(gdb) print minus

$4 = false
(gdb) step
0x000000000044e7e4 in rapidjson::GenericDocument<rapidjson::UTF8<char>, rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator> >::Double (this=0x83df98,
d=4.2699109613558058e-317) at /home/jjekeli/workspace/com.src.ewir.keystonert_xeonphi/src/rapidjson/document.h:776
776 void Double(double d) { new (stack_.template Push<ValueType>()) ValueType(d); }
(gdb) print d
$4 = 4.2699109613558058e-317
(gdb) print /a d
$5 = 0x0

As you cane see, before passing into the Double function, d has a valid value and a valid memory address, but upon entering the Double function, it no longer has a valid memory address or a valid value, and causes a seg fault.

Any thoughts as to why this may be occurring?

Frances_R_Intel · ‎07-08-2013

Could you print out the assembly code for the function call and entry point and send them for us? In addition, you might want to try compiling with a lower optimization level to see if the code works in that case.

Jonathan_J_ · ‎07-09-2013

How can i determine the assembly code for the function call and entry point? Also, we had already tried setting to the lowest optimization level with no luck.

James_C_Intel2 · ‎07-09-2013

A long shot, but have you checked that you have enough stack space allocated?

The values I see on the card are

% ulimit -a
-f: file size (blocks) unlimited
-t: cpu time (seconds) unlimited
-d: data seg size (kb) unlimited
-s: stack size (kb) 8192
-c: core file size (blocks) 0
-m: resident set size (kb) unlimited
-l: locked memory (kb) 64
-p: processes 61357
-n: file descriptors 10240
-v: address space (kb) unlimited
-w: locks unlimited

It might be worth trying with ulimit -s unlimited

Jonathan_J_ · ‎07-09-2013

Somewhat worried about setting the stack size to unlimited... isn't that a good way to smash one's stack and destroy the kernel?

James_C_Intel2 · ‎07-09-2013

Nothing you do with the stack should be able to cause the kernel to crash.

However if you are paranoid, by all means just double the size and see if that affects what happens; if it does, that's a strong signal that the amount of stack space may be the problem. (You may need to change the pthread stack size, rather than just using ulimit, of course, if your code is threaded).

You should also be able to work out how much stack you're using from the debugger. Look at the value of %rsp at the point of the crash, and then go back to the top of the thread's stack and look at the address of a local variable there. The difference is (close to) the stack usage. If it's nearly the stack limit that's a strong hint.

Also, if %rsp at the point of the crash is just below a page boundary, that's suspicious. If you then look at /proc/pid/maps for the process you shouldbe able to see if %rsp is pointing at valid memory or not.

Jonathan_J_ · ‎07-09-2013

Setting the stack size to unlimited did not alleviate the problem. Sorry.

The stack pointer at the beginning of the thread was 0x7fffffffe100. At the point of the crash, the stack pointer was at 0x7fffffffc2f0, a difference of 7696. That is near-ish the original stack size, but setting ulimit -s unlimited did not help.

Wasn't able to access the maps through proc/pid/maps? The pid was 7497, but there was no 7497 directory?

James_C_Intel2 · ‎07-09-2013

It doesn't seem like it is the stack, then, since the limit is in KB, not bytes, so you seem to be a long way away from the limit.

I'll crawl back under my stone :-)