Mode switches on threads

rcohen · 27 2020 14:32

Hi,

We had some code in our product that uses std::thread and other related classes to create and run new real-time threads. Rather than change all of our code to use the Twine worker pool, I changed all of the calls from std::thread to the pthread-compatible calls in Xenomai (__cobalt_pthread_create and related). I’ve got this up and running, but I’m still seeing the MSW field climbing when running sushi.

Here’s an example of the output for the stat:

elk-pi:~$ cat /proc/xenomai/sched/stat
CPU  PID    MSW        CSW        XSC        PF    STAT       %CPU  NAME
  0  0      0          2484812    0          0     00018000   64.3  [ROOT/0]
  1  0      0          2746       0          0     00018000   94.8  [ROOT/1]
  2  0      0          4175       0          0     00018000   94.8  [ROOT/2]
  3  0      0          5043       0          0     00018000  100.0  [ROOT/3]
  0  866    20         20         79         0     000600c0    0.0  sushi_b64
  0  871    2          19577      40717      0     00048042   34.2  sushi_b64
  1  881    150        300        450        0     00068042    5.1  sushi_b64
  2  882    150        300        450        0     00068042    5.2  sushi_b64
  3  883    1          2          3          0     00068042    0.0  sushi_b64
  3  884    1          2          3          0     00068042    0.0  sushi_b64
  1  0      0          3471958    0          0     00000000    0.0  [IRQ17: [timer]]
  2  0      0          9383301    0          0     00000000    0.0  [IRQ17: [timer]]
  3  0      0          1646033    0          0     00000000    0.0  [IRQ17: [timer]]
  2  0      0          0          0          0     00000000    0.0  [IRQ55: DMA TX IRQ]
  3  0      0          0          0          0     00000000    0.0  [IRQ55: DMA TX IRQ]
  1  0      0          0          0          0     00000000    0.0  [IRQ56: DMA TX IRQ]
  2  0      0          0          0          0     00000000    0.0  [IRQ56: DMA TX IRQ]
  3  0      0          0          0          0     00000000    0.0  [IRQ56: DMA TX IRQ]

I am assuming that the “MSW” field is “mode switches” and CSW might be “context switches”, and I don’t know what the XSC field is supposed to represent. On my threads, they stay in lock-step. CSW seems to occur twice for every MSW, and XSC seems to occur 3 times for every MSW. The numbers will continue to climb while the program is running.

I don’t see that we are making any system calls on the additional threads, and I’m assuming that “wait”-ing will not cause a mode switch. Should I be concerned about these mode switches, and do you have any other ideas what may be causing them?

Thanks, and stay well,

Rick

Stefano · 27 2020 15:23

Hi Rick!

Yes, you should alwasy be concerned about mode switches

However, they are pretty easy to spot if you use the --debug-mode-sw of SUSHI in combination with gdb. Here are the instructions:
https://elk-audio.github.io/elk-docs/html/documents/building_plugins_for_elk.html#analyzing-and-resolving-xenomai-mode-switches

Basically, you will have the debugger halted exactly when the MSW happened and from there you can get a backtrace and see the incriminating line.

rcohen · 27 2020 18:31

Hi @Stefano,

Thanks. Yes, we’ve been using those instructions already. Unfortunately, gdb does not break until I play a note (the application is a virtual instrument). And yet, when I run outside of gdb, the MSW field is incrementing even without playing a note.

Any suggestions as to why we are not hitting the SIGXCPU break during this period?

When I play a note, the program hits some other one-time mode switches, which cause a gdb break. I have not been successful in using the “ignore” command in gdb to skip breaking on the first few mode switches. I did this:

(gdb) catch signal SIGXCPU
Catchpoint 1 (signal SIGXCPU)
(gdb) ignore 1 2
Will ignore next 2 crossings of breakpoint 1.
(gdb) run -r --debug-mode-sw -c ./config_play.json 
Starting program: /usr/bin/sushi_b64 -r --debug-mode-sw -c ./config_play.json

The program runs, and I connect my USB MIDI keyboard by typing “aconnect 16 128” in a separate terminal.

But when I go back to the 1st terminal, where gdb is running, I see this:

Program terminated with signal SIGXCPU, CPU time limit exceeded.
The program no longer exists.

I’m not sure what I’m doing wrong, can you give me any advice?

Thanks,

Rick

rcohen · 30 2020 19:01

Hi,

In https://elk-audio.github.io/elk-docs/html/documents/building_plugins_for_elk.html#analyzing-and-resolving-xenomai-mode-switches

it states: you are not allowed to call any operating system functions from your real-time processing callback.
one of the items that is mentioned as disallowed is
Any thread synchronization primitives, like mutexes, semaphores, etc.

Does this include the cobalt calls for mutexes wait/signal, etc.?

Thanks,

Rick

rcohen · 31 2020 14:10

Hi @Stefano,

Yesterday I tried to go a little deeper into my mode switching issue. I removed all of the “worker” code from my additional real-time thread, leaving only the bare minimum for wait/signal handling. The mode switches continue to accumulate.

I looked at a few other sites to see if I could find out what might be going on. I found this site:

This page gives the advice of adding this call for the current thread:
pthread_set_mode_np(0, PTHREAD_WARNSW, nullptr);

So I figured I would try that. Do you think it would be necessary to add this call for new real-time threads?

Spoiler alert - it made no difference in gdb.

Take care,

Rick

Stefano · 31 2020 14:33

HI @rcohen,
ah that could be the issue!

That call to set the PTHREAD_WARNSW flag is actually done by SUSHI when you activate the --debug-mode-sw flag, but only on its main thread. So you need to call it on any new RT threads if you want to spote MSWs there.

Stefano · 31 2020 14:34

Regarding this:

No, all the __cobalt_ calls are fine to be called from an RT context.

rcohen · 31 2020 17:18

HI @Stefano,

Thanks for the confirmation that this code is required for all real-time Xenomai threads, in order to catch mode switches in the debugger.

Unfortunately, when I tried adding this line of code in my threads, and setting catch signal SIGXCPU in gdb, the debugger was still not breaking at the mode switches, which are still accumulating according to the stat file.

So, I guess it is not the “magic bullet” I was looking for.

Maybe there is something else I can try? If you have any other suggestions, please let me know.

Take care,

Rick

Stefano · 31 2020 17:55

Hi @rick,
this is weird, we never came to a situation where MSWs were not detected by gdb in that way.

A couple of suggestions:

Make a deliberate syscall into one of your threads (e.g. write to file) so that you can verify that you’re properly setting the PTHREAD_WARNSW flag for each thread
As an alternative to spot syscalls, you can run the plugin with the dummy frontend. In that case, you would need to specify MIDI notes in the Events section. With the dummy frontend, you can attach strace to one of the RT threads and monitor syscalls happening there. You won’t be able to access any Xenomai functions with this frontend, though.

rcohen · 1 2020 18:41

Hello @Stefano,

I resolved an issue with the pthread_setmode_np() call, and now I can use gdb to catch the SIGXCPU signal.

I have found that our product has a few mutexes which are current used across the audio and the ui threads (to prevent shared access). Is there any approved solution for having a mutex like that? Or is the answer simply “don’t do it”!

Rick

Stefano · 1 2020 19:59

Hi Rick!

Shared Mutexes between non-RT and RT should be a “don’t do it!” also for normal Desktop OSes, to avoid priority inversion.

Alternatives you might consider are message passing through lock-free FIFOs, atomic variables and spinlocks.

Stefano · 1 2020 20:14

I recommend checking these two excellent videos from the last ADC + the companion repository for a bag of tricks & solutions possible:

rcohen · 1 2020 21:06

Hello @Stefano,

I see now that my previous success was not real. The breakpoint that was caught earlier (mutex between audio and ui threads) with SIGXCPU was on the main sushi thread.

In my own cobalt threads I have tried adding code with the following, and none of them trigger a break:
malloc/free
fopen/fprintf/fclose
std::cout << writing something to standard out.

And so it seems to me that the SIGXCPU mechanism is not working for my own cobalt threads. Do you think that maybe I need to add more than simply a call to pthread_set_mode_np(0, PTHREAD_WARNSW, nullptr); on each thread?

Also, do you know whether std::atomic might trigger a mode switch?

Thanks, and sorry for all of the questions.

Rick

Stefano · 2 2020 09:04

Hi @rcohen,

That one should be enough, of course it has to be called after the thread has started in the thread local context. Can you try to check the return status (verify it’s 0) and, in case, pass a pointer to an int variable for the last argument instead of nullptr and verify that the mode bit is set on that one?

That depends on which type the std::atomic is guarding and on the architecture. You can verify it at compile time using std::atomic_is_lock_free(). Typically, primitive types such as int, floats and even structs with two of them sometimes, are lock-free. More complex types are implemented by the compiler using mutexes, and those will cause mode switches.

No worries! This is an interesting deep discussion, happy to find out what is the culprit.

rcohen · 2 2020 19:11

Hello @Stefano,

That one should be enough, of course it has to be called after the thread has started in the thread local context. Can you try to check the return status (verify it’s 0) and, in case, pass a pointer to an int variable for the last argument instead of nullptr and verify that the mode bit is set on that one?

The return status is always 0. I printed out the value of the mode and it is 0x0000. That’s because the mode returned is the previous value of the mode. If I run the setmode command again, then the value returned is 0x4000, which is PTHREAD_WARNSW.

I had tried to add my deliberate syscall test code after the setmode() but before the typical wait() loop in my thread. If, instead, I add my code after the wait loop which contains __cobalt_pthread_cond_wait() [meaning that the thread has been signaled and is finished waiting], then gdb gets the SIGXCPU signal and breaks properly. So, that is a small victory.

I also verified that this signal is hit in gdb if I only call setmode() once, rather than twice.

The next step, of course, is to remove the test code altogether and run again. In this case, the signal is never hit, and if I watch /proc/xenomai/sched/stat I can see the MSW numbers incrementing.

The mystery continues…

Take care,

Rick

rcohen · 6 2020 17:54

Just wondering about this command line switch for sushi:

-m , --multicore-processing= Process audio multithreaded with n cores
[default n=1 (off)].

I have not been setting this, but I’ve been creating my own additional real-time threads. What is the purpose of this option?

Thanks

Rick

Stefano · 6 2020 18:41

Hi Rick,
if you activate this one, SUSHI will schedule parallel tracks on parallel cores automatically. Internally it makes use of the TWINE library.