Intel® Distribution for Python*
Engage in discussions with community peers related to Python* applications and core computational packages.

segmentation fault

Pet222
Beginner
2,541 Views

Hi!

 

I am using the 3.9.12 version of the python optimized for intel and I am facing a segmentation fault when logging data to tensorboard...not sure what the root cause is (not much info). 

I run python in WSL2 under ubuntu 20.04.

The CPU is 13700KF.

 

The following is logged when the segfault occurs:

Thread 0x00007fd9317fa700 (most recent call first):
File "/home/pet/miniconda3/envs/idp_py3.9_new/lib/python3.9/threading.py", line 316 in wait
File "/home/pet/miniconda3/envs/idp_py3.9_new/lib/python3.9/queue.py", line 180 in get
File "/home/pet/miniconda3/envs/idp_py3.9_new/lib/python3.9/site-packages/tensorboard/summary/writer/event_file_writer.py", line 227 in run
File "/home/pet/miniconda3/envs/idp_py3.9_new/lib/python3.9/threading.py", line 973 in _bootstrap_inner
File "/home/pet/miniconda3/envs/idp_py3.9_new/lib/python3.9/threading.py", line 930 in _bootstrap

 

The code sometimes run quite long (a day) sometimes it just cannot be started (all the time after the segfault crash happened) and in those cases the python error points to other modules which had no issues previously.....after a while python can be started. It seems as if sg were somehow remained messed up after the crash of python which messes up the re-start of python application.

 

Any known issues or hint that could help to solve this issue or come closer to the root cause?

VS code runs already with:

            "pythonArgs": ["-v", "-q", "-X", "dev"],
             "PYDEVD_THREAD_DUMP_ON_WARN_EVALUATION_TIMEOUT": "1",
             "PYTHONFAULTHANDLER": "1",

 

ulimit -c has been increased to unlimited....but python does not seem to report any core logs when the segfault happens.....RAM seems to be sufficient....

 

Thank you!

 

0 Kudos
12 Replies
AayushiR_Intel
Moderator
2,398 Views

Hi,

 

Thanks for posting in intel communities.

We tried from our end and we are are not getting any segmentation fault. We installed WSL in our system (supported 10th generation processor) & tried on ubuntu 20.04 and created python environment & tried some ML models and we are able to generate output in a form of a log file on Tensorboard.

We assume that you are using latest processor with older version of Intel distribution for python 3.9.12 because of that you are getting segmentation fault. We suggest you to upgrade your Intel distribution python with latest version and try again. After that If you will face any issue please let us know.

You can see the below screenshots.

AayushiR_Intel_0-1673255576903.png

 

AayushiR_Intel_1-1673255599525.png

 

Thanks,

Aayushi

 

0 Kudos
Pet222
Beginner
2,387 Views

Thank you for your efforts! 

The issue is quite sporadic and the SW has to run at sometimes a day to be reproducible. I made some other tests since then: I run with NOTintel optimized python as well and the segfault hits there, too. Now I am using GDB to narrow down the issue but since I am not really familiar with it it will take some time. This is what I get currently (I am debugging on the intel optimized py3.9 lib)

 

I share the log when the command "bt" is called in GDB (I would also like to get the log from the python-dbg if possible).

#0 0x0000000000000000 in ?? ()
#1 0x0000555555696024 in _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x5556845cffa0, throwflag=<optimized out>)
at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Python/ceval.c:3309
#2 0x00005555556a6d6b in _PyEval_EvalFrame (throwflag=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
f=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
tstate=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>)
at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Include/internal/pycore_ceval.h:40
#3 function_code_fastcall (tstate=0x55555a482d30, co=<optimized out>, args=<optimized out>, nargs=<optimized out>, globals=0x7fff67eac640)
at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Objects/call.c:330
#4 0x0000555555695b07 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x5557ac019ca0, callable=0x7fff67ebce50, tstate=0x55555a482d30)
at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Include/cpython/abstract.h:118
#5 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x5557ac019ca0, callable=0x7fff67ebce50)
at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Include/cpython/abstract.h:127
#6 call_function (kwnames=0x0, oparg=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, pp_stack=<synthetic pointer>, tstate=0x55555a482d30)
at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Python/ceval.c:5078
#7 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x5557ac019b20, throwflag=<optimized out>)
at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Python/ceval.c:3521
#8 0x00005555556a6d6b in _PyEval_EvalFrame (throwflag=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
f=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
tstate=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>)
at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Include/internal/pycore_ceval.h:40
#9 function_code_fastcall (tstate=0x55555a482d30, co=<optimized out>, args=<optimized out>, nargs=<optimized out>, globals=0x7fff67eac640)
at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Objects/call.c:330
#10 0x0000555555695b07 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7ffc08001168, callable=0x7fff67ddda60, tstate=0x55555a482d30)
at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Include/cpython/abstract.h:118
#11 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7ffc08001168, callable=0x7fff67ddda60)
at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Include/cpython/abstract.h:127
#12 call_function (kwnames=0x0, oparg=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, pp_stack=<synthetic pointer>, tstate=0x55555a482d30)
at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Python/ceval.c:5078
#13 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x7ffc08000fe0, throwflag=<optimized out>)
at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Python/ceval.c:3521
#14 0x00005555556a6d6b in _PyEval_EvalFrame (throwflag=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
f=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
tstate=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>)
at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Include/internal/pycore_ceval.h:40
#15 function_code_fastcall (tstate=0x55555a482d30, co=<optimized out>, args=<optimized out>, nargs=<optimized out>, globals=0x7fff67eac640)
at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Objects/call.c:330
#16 0x0000555555695b07 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7fff40af23e0, callable=0x7fff67ebc9d0, tstate=0x55555a482d30)
at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Include/cpython/abstract.h:118
--Type <RET> for more, q to quit, c to continue without paging--c
#17 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7fff40af23e0, callable=0x7fff67ebc9d0) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Include/cpython/abstract.h:127
#18 call_function (kwnames=0x0, oparg=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, pp_stack=<synthetic pointer>, tstate=0x55555a482d30) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Python/ceval.c:5078
#19 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x7fff40af2230, throwflag=<optimized out>) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Python/ceval.c:3521
#20 0x00005555556a6d6b in _PyEval_EvalFrame (throwflag=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, f=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, tstate=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Include/internal/pycore_ceval.h:40
#21 function_code_fastcall (tstate=0x55555a482d30, co=<optimized out>, args=<optimized out>, nargs=<optimized out>, globals=0x7fff6ebf5400) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Objects/call.c:330
#22 0x0000555555695dba in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7fff51b50950, callable=0x7fff6ebee940, tstate=0x55555a482d30) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Include/cpython/abstract.h:118
#23 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7fff51b50950, callable=0x7fff6ebee940) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Include/cpython/abstract.h:127
#24 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x55555a482d30) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Python/ceval.c:5078
#25 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x7fff51b507c0, throwflag=<optimized out>) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Python/ceval.c:3507
#26 0x00005555556a6d6b in _PyEval_EvalFrame (throwflag=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, f=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, tstate=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Include/internal/pycore_ceval.h:40
#27 function_code_fastcall (tstate=0x55555a482d30, co=<optimized out>, args=<optimized out>, nargs=<optimized out>, globals=0x7fff6ebeccc0) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Objects/call.c:330
#28 0x0000555555695dba in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7fff51b389b8, callable=0x7fff6ebf8280, tstate=0x55555a482d30) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Include/cpython/abstract.h:118
#29 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7fff51b389b8, callable=0x7fff6ebf8280) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Include/cpython/abstract.h:127
#30 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x55555a482d30) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Python/ceval.c:5078
#31 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x7fff51b38840, throwflag=<optimized out>) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Python/ceval.c:3507
#32 0x00005555556a6d6b in _PyEval_EvalFrame (throwflag=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, f=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, tstate=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Include/internal/pycore_ceval.h:40
#33 function_code_fastcall (tstate=0x55555a482d30, co=<optimized out>, args=<optimized out>, nargs=<optimized out>, globals=0x7ffff75d3c80) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Objects/call.c:330
#34 0x0000555555695dba in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7fff51df56f8, callable=0x7ffff75f0e50, tstate=0x55555a482d30) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Include/cpython/abstract.h:118
#35 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7fff51df56f8, callable=0x7ffff75f0e50) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Include/cpython/abstract.h:127
#36 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x55555a482d30) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Python/ceval.c:5078
#37 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x7fff51df5580, throwflag=<optimized out>) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Python/ceval.c:3507
#38 0x00005555556a6d6b in _PyEval_EvalFrame (throwflag=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, f=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, tstate=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Include/internal/pycore_ceval.h:40
#39 function_code_fastcall (tstate=0x55555a482d30, co=<optimized out>, args=<optimized out>, nargs=<optimized out>, globals=0x7ffff75d3c80) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Objects/call.c:330
#40 0x00005555556b5725 in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=<optimized out>, tstate=<optimized out>) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Include/cpython/abstract.h:103
#41 method_vectorcall (method=<optimized out>, args=0x7ffff78b9058, nargsf=<optimized out>, kwnames=<optimized out>) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Objects/classobject.c:83
#42 0x00005555557a5c69 in t_bootstrap (boot_raw=0x7fff51b542d0) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Modules/_threadmodule.c:1040
#43 0x00005555557a5b57 in pythread_wrapper (arg=<optimized out>) at /home/sat_bot/base/conda-bld/python-split_1661733805548/work/Python/thread_pthread.h:245
#44 0x00007ffff7f66609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#45 0x00007ffff7d31133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

0 Kudos
Pet222
Beginner
2,385 Views

One question and remark:

1. I was able to reproduce the issue with "Python 3.9.13 :: Intel Corporation" as well and other non intel optimized previously perfectly running conda run. I'am setting up the python environment proposed but based on the fact that I was able to reproduce the issue with python envs which is non intel optimized I think the issue is not related to intel python itself....

2. Meanwhile I was thinking that I've changed the CPU from 12700k to 13700kf lately (the rest is the same). Since than I have sometimes issues.... Can a CPU issue or CPU XMP mode in theory cause a segmentation fault in python without the OS providing any warning or crash? ... I will run the "Intel Processor Diag tool" in loop mode over night to see if the CPU remains stable.

 

0 Kudos
AayushiR_Intel
Moderator
2,370 Views

Hi,


Thanks for letting us know. Please try the diagnostic results from your end and share the observations with us.


Thanks,

Aayushi


0 Kudos
Pet222
Beginner
2,345 Views

 

Hi!

 

I think I found the root cause.

In my motherboard from MSI there there is a CPU booster option in BIOS which was ON.

After this part was turned OFF the code was running a day+ without any issues and now the 2nd run is running 103 hours which.

When I turned this CPU booster back ON theads stopped (remained in waiting state under WSL) or other unrealistic CUDA error was thrown by Pytorch.....

 

In case I ran the Intel CPU stability test with the booster mode OFF it works stable. However I noticed that in this case the CPU is under a heavy load and the internal temp regulator reduces the clock frequency of the CPU. In my python "issue" case with the CPU boost ON the temp of the CPU remains constant around 80 degree Celsius so the temp protection does not reduce the frequency of the CPU core (it is kept at 5,2 GHz). Also the intel CPU test is stable when the CPU boost is ON (but in this case the clock of the CPU is also reduced due to heavy load and temp protection).

 

So  I assume: 

a, due to the reduced freq no issue comes when the intel test is executed

b, due to the fact that the CPU test is in windows and not under WSL the issue is not triggered.

 

I will buy today a much stronger cooler having seen how much the temp went on when the stress test of the intel tool was executed....

and will run the stress test with this cooler again.

Regarding the boost issue with normal load when the CPU remains under 80 degree Celsius what is your proposal? Is this something that shall be reported to Intel or is this rather a motherboard issue which shall be reported to MSI?

0 Kudos
AayushiR_Intel
Moderator
2,316 Views

Hi,


We are working on this internally. We will get back to you with an update.


Thanks,

Aayushi


0 Kudos
Aditya18
Moderator
2,303 Views

Hi,

Thanks for providing all the details. As this is a little unusual issue, let me check more on this where we should look into it as there are multiple things involved in it. We will get back soon.

Thanks


0 Kudos
Jocelyn_Intel
Moderator
2,249 Views

Hello, @Pet222

 

Thank you for your time.  

 

Please be aware that the CPU booster is a type of feature that some motherboards have that can cause overclocking, so we recommend you to load BIOS defaults and let us know if the issue persists. 

 

Also, I would like to let you know that altering clock frequency or voltage may damage or reduce the useful life of the processor and other system components and may reduce system stability and performance. Make sure to use the processor within its supported specifications

 

Best regards,  

Jocelyn M.   

Intel Customer Support Technician. 


0 Kudos
Pet222
Beginner
2,241 Views

Hi Jocelyn!

 

"In case I ran the Intel CPU stability test with the booster mode OFF it works stable."

 

I checked the CPU temp + core frequency with the MSI Center tool. When the CPU was overclocked the main paramseters seemed to be in spec (80 degree Celsius + frequency of the core was in spec).....Yes a solution would be to disable the booster......however I assume the idea behind is the same like in intel's XTU: boost the performance meanwhile the system remains stable....

0 Kudos
Jocelyn_Intel
Moderator
2,232 Views

Hello, @Pet222

 

Thank you for your response. 

 

You mentioned that your issue began when the CPU booster is On, that's why we suggested loading BIOS defaults, Does the issue persists after this step? 

 

Best regards,  

Jocelyn M.   

Intel Customer Support Technician. 


0 Kudos
Pet222
Beginner
2,228 Views

Hello Jocelyn M,

 

I ran the script for a day+ and with the motherboard booster OFF setting during this time I was not able to reproduce the issue. Each time it hit within a day or within hours when the booster was ON....

It seems MSI updated the BIOS firmware last week with the following:  "Support exFAT file system. - Update CPU Micro code. - Improve memory compatibility." So today I also updated the BIOS.

So far it seems the issue is solved.

Regards,

Péter

0 Kudos
Jocelyn_Intel
Moderator
2,203 Views

Hello, @Pet222

 

Thank you for your response. 

 

We are glad that the issue is solved now, we appreciate you letting us know how you solved this. Since the issue was solved, we will proceed to close this thread now.  

 

If you need further assistance, please submit a new question as this thread will no longer be monitored. Have a great day. 

 

Best regards,  

Jocelyn M.   

Intel Customer Support Technician. 


0 Kudos
Reply