Direct Raspa interfacing - intermittent crashes

Hello,

I have an application that I cross-compile for Elk on RasPi + HifiBerry DAC ADC+ Pro, which is interfacing Raspa directly (include raspa.h, link with Raspa and Cobalt provided in the SDK. Largely, I have followed the example set by the raspa/apps code and the Sushi Xenomai raspa frontend.

The app works in the meaning that it creates sound, however I get intermittent (after 2-15 minutes of running) hard crashes that I cannot figure out. When this happens I can’t Ctrl+C, I can’t ssh into the machine - the entire OS locks up and I need to power cycle. Running the process through gdb gives the same result.

I have previously monitored mode switches and successfully removed some using gdb according to the instructions, but this doesn’t seem to the same kind of issue. Same code compiled with sound.io on Mac does not have the same issue.

I have also disabled all sound generating code that would be called from

Is there a way I can get data on what actually happened prior to the crash?
Anyone with any ideas I could try?

Thanks!

Gustav

Hi Gustav. My best guess is that you still have mode switches/syscalls from somewhere in the audio thread.

Timing calls in particular, can cause this behaviour, we’ve seen that for instance clock_gettime() or gettimeofday() can cause hard lockups if called from a realtime thread. It could be that it deadlocks something in the kernel, I don’t know. But you should avoid those calls and use our twine library instead if you need a current timestamp from the audio thread.

Sometimes you find these calls deep down in things like logging libraries, where it’s not immediately obvious that they’re doing something unsafe, and the risk of a single call crashing the system is quite low, which is why it seems quite random.

1 Like

Thanks for your feedback, you set me off on the right mental track again. :slight_smile:

I had been searching for the wrong thing in the wrong place, and now found a timing call hiding in plain sight. I’ve been running the app over night, seems stable!

1 Like

Hello,

I am facing the same issue when running onnxruntime session.Run() in the realtime thread. I am running a headless plugin on Raspberry Pi 4 with HifiBerry DAC ADC Hat. Since I’m applying an ML model on audio buffers, I employed onnxruntime. The plugin works fine for around 1 to 3 minutes, but it crashes hard, and I have no choice except to restart the raspberry pi.

Do you have any idea how I can figure out a specific line or part of the code that causes this problem?

Thanks.

Hi @Alireza ,
do you get crashes or freezes?

If you get a crash, my suggestion would be to enable core dumps (or run under gdb) to see what is causing the crash. If you instead get a system freeze, typically the most common cause for those is the usage of timer-related Linux syscalls from a RT callback.

Thank you for your reply.

It freezes, and ctrl+c or ctrl+z doesn’t work. I removed the onnxruntime part from my code, and everything works fine without crashes. Do you have any suggestions on how I can avoid any timer-related Linux syscalls from onnxruntime or locate the code line that causes this issue?

Thank you.

Hi @Alireza,
the timer syscalls are typically those like clock_gettime and similar, they might be abstracted by a few layers if you are using a third-party library or the modern C++ equivalents.

One way to spot them might be to run the same code in normal Linux and use strace or similar tools that trace syscalls.

You can safely use as alternatives the equivalent libevl functions, check the examples here:

Hi Stefano,

Thanks for your reply. I have used strace to trace any timer-related syscalls in my program. More specifically, I used this command:

strace -e trace=clock_gettime,timer_create,timer_settime,timer_gettime,timer_getoverrun,timer_delete,alarm,getitimer,setitimer,nanosleep,clock_nanosleep,gettimeofday,settimeofday,adjtimex,clock_adjtime,timerfd_create,timerfd_settime,timerfd_gettime,clock_settime,clock_getres,clock_adjtime,clock_nanosleep,alarm,setitimer,getitimer,timerfd_create,timerfd_settime,timerfd_gettime,adjtimex,nanosleep,times -f ./run

And this is the output:

strace: Process 15372 attached
strace: Process 15373 attached
strace: Process 15374 attached
2023-08-03 18:02:16.675895945 [W:onnxruntime:, inference_session.cc:1591 Initialize] Serializing optimized model with Graph Optimization level greater than ORT_ENABLE_EXTENDED and the NchwcTransformer enabled. The generated model may contain hardware specific optimizations, and should only be used in the same environment the model was optimized in.
[pid 15372] +++ exited with 0 +++
[pid 15373] +++ exited with 0 +++
[pid 15374] +++ exited with 0 +++
+++ exited with 0 +++

I also compiled my code with the -g flag to make sure your program is compiled with debugging information. It seems that there are no timer-related syscalls in my program. Is it possible that strace couldn’t capture the internal syscalls by the onnxruntime?