Debug story
July 6, 2024

You're programming quietly, until suddenly the Window of Death pops up.

You'll probably look through recent code changes and find something suspicious. But if you still don't know what might have caused our problem, we'll be left with good old classic debugging.

Let's forebug, then, all the way to the source of our problem.

Let's pay attention - where we stopped:

It makes sense because, Ozone automatically set breakpoints on faults (Vector Catch).

In Call Stack, if we are lucky, we will see which function was called earlier.

Double click on it and we are probably where the error originated.

That is, we are knocking ourselves out here:

The address of the instruction is 0x205AE. This was reported to us by the Window of Death at the very beginning:

It comes in handy when, for example, for some reason we do not have full information in the Call Stack.

W Dissasembly od razu widać, że ładujemy tu do rejestru R0 wartość spod adresu (R3+R0<<2), która pewnie jest nieodpowiednia. Niewiele to mówi, ale ponieważ wartość rejestrów raczej się nie zmieniła jeszcze, to możemy sprawdzić wartości w kontekście języka C. Local Data mówi, że nasz log_type to zdecydowanie podejrzana wartość.
Tym bardziej, że Quick Watch na tabeli pokazuje rozmiar 4:

Well the question is where do we get this strange value in 'log_type'?

Via Call Stack we jump into the "higher"/previous function. And what we see here:

Well, first by analyzing the code:
our problematic parameter is taken from data derived from ctx which comes from backend->cb.
Well, unfortunately, the backend itself is already out of scope. 

Well, now we have two options (besides using Trace):

  1. You can still jump up like this to find out where this backend is coming from and who is passing it here by analyzing the code;
  2. You can set a breakpoint at this point and let go all over again. It's best to turn on the breakpoint just before the error occurs if the function is used frequently (I knew the shell command was causing the error). Script-wise you can probably also turn on breakpoint when some other one triggers. You can also use breakpoint conditions. Let's only stop if ctx == 0x0000 0000 (because it's a suspicious value);

We put breakpoint, right click and Edit Breakpoint:

We call out the cause of the error and wait.

Now you can actually see that we are passing some abstractly large data->log_format_current.
Probably because our ctx at this point is "0". Could it be that someone/something overwrote our backend->cd->ctx?

Now all you have to do is set the data breakpoint for each write to address 0x2000 0B9C and let go all over again.

Right click on this ctx and Set Data Breakpoint:

For global variables it would probably be enough, but Ozone sometimes gets lost here, so we switch to manual - we could have entered the address manually, but in this clever way Ozone will do it for us:

Well: reset and from scratch.
And the first stop:

You would expect that the structure is initialized at some point, so so far everything is ok.
We continue debugging.
If more often correct values are typed into our cb->ctx, let's call it, and every now and then we get stuck because of typing a "correct" value, we just need to re-edit our data breakpoint so that it catches only when "0" is typed:

After calling the cause of the error, we will stop just after "0" overwrites the previous correct value of cb->ctx:

Searching the Call Stack backwards, we will quickly come to a place where the context is gleefully directly given as NULL.

Voila!
The cause of the crash found... Time to look for the culprit ;) 

And this could be completed, but the same story when using VSC "nRF Connect" is no longer so simple.
Debugging in VSC our program will stop in a completely different place:

Long story short: this is a secondary bug, caused by the handling of our source bug.

The cause of our crash can no longer be found in CALL STACK, or even by analyzing the baktrace (-exec bt in DEBUG CONSOLE).
The reason is that by default the Zephyr environment sets breakpoints in the part of the code already after Zephyr error handling.
This works great in most cases, but here the error is in the logging mechanism that is used to handle the original error, so by trying to display the error description in the log we cause another error.

So let's include the early warning mechanism that was so useful with Ozone.

We add a breakpoint on the z_arm_fault function call in the BREAKPOINTS section:

You can also via DEBUG CONSOLE: -exec break z_arm_fault

And after restarting, we will stop right after the first, source problem. 

From here, via CALL STACK, we've already reached the first symptom: a large log_type value in the log_format_func_t_get() function.

Next, you already know the story.

But, honestly, it's more convenient to set traps in Ozone.