Embedded RTOS performance testing
March 26, 2024

It's time to compare RTOSes in battle. We will check how they perform in terms of performance. We'll measure how they behave with elementary operations like context switching under no load, periodicity or return from interrupt. At the end, each of them will face an activity that requires busyness - a real application.

After reviewing the theoretical issues of RTOS selection, it was time for the more practical part. All tests were performed on the STM32F429-DISC1 with the main clock set to 84 MHz. Most of the RTOSes were integrated into the CubeIDE using the appropriate middleware provided by ST. Zephyr, due to its peculiarities and lack of support in CubeIDE, was run separately.
Measurements were made using applications designed to analyze the performance of such systems - SystemView for FreeRTOS and embOS, and TraceX for ThreadX.

Test #1 - Context switch

According to the literature treating RTOSes, the first important bastion in analyzing the performance of a system is the time it takes for the system to change context. Which will directly answer the question - how long does the Scheduler of a given RTOS have to run in order to do the work native to it. It will also be interesting to see if the number of tasks (shuffles) will make a difference.

To start with, a simple test:
+ shuffles are empty;
+ expropriation enabled;
+ SysTick set to generate interrupts every 1ms;

Switching times measured for:
+ 2 tasks;
+ 5 tasks;
+ 10 tasks:

According to the above configuration.

Measurement results

Observations
Surprisingly large differences between ThreadX/Zephyr and embOS (almost 6x longer to make context switch). A yellow light has come on for us. Time to see if this problem will increase with the number of tasks and system load.

Test #2 - Periodicity

In hard RTOS applications, calling tasks and responding in a timely manner are the most important aspects of the system.
In a "soft RTOS," we can count on a given system to drop a task as soon as possible into an execution queue.
In this test, we are looking for the answer to the question "How close to the set deadline does the task start executing?".

Settings:
+ Task is empty;
+ expropriation enabled;
+ SysTick set to generate 100ms cycles;

Results:

Periodicity of calling shuffles

Observations‍

As you can see, the RTOS error oscillates around 0.1% which is a very small value and may allow the inaccuracy to be considered negligible. Interestingly, the embOS again deviates - it is not a large value but the distaste remains. When changing SysTick to 10ms, this error increases proportionally (almost by an order). The RTOS charge is approximately constant.

Test #3 - Context switch from ISR

Another thing worth checking is the amount of time spent on interrupt output. This is important because interrupts are not part of the RTOS kernel, they are handled outside its control, and each of the tested systems handles the input to the ISR slightly differently.
To create test conditions, you can use a semaphore and an external interrupt as follows:

  • The RTOS initializes the semaphore as busy;
  • The main task starts working and locks up waiting for the semaphore to be released;
  • An external interruption (in our case a button press) releases the semaphore;
  • task takes over the semaphore;

We measure the time it takes to release the semaphore and change the context. Below is the idea:

FreeRTOS

The FreeRTOS API provides separate interrupt-safe versions of frequently used functions. They can be recognized by the suffix "FromISR". This is needed to inform the kernel that the used function was called in an interrupt, allowing it to react accordingly. Using a standard version of a function in interrupts generally causes errors, most often HardFault. The API when a semaphore is returned in an interrupt also requires a direct call to a function that tells the kernel that the interrupt has been abandoned and forces the abandoned task to continue. The code is attached below, as is the visualization of the result in SystemView (markers are good to use for measurement).

void HAL_GPIO_EXTI_Callback ( uint16_t GPIO_Pin ) { SEGGER_SYSVIEW_MarkStart (1); BaseType_t xHigherPriorityTaskWoken = pdFALSE; xSemaphoreGiveFromISR ( LED_semaphore , &xHigherPriorityTaskWoken ); SEGGER_SYSVIEW_MarkStop (1); portYIELD_FROM_ISR ( xHigherPriorityTaskWoken ); }

And this is how the results look in SystemView:

When releasing a semaphore in an interrupt, FreeRTOS actually calls the function xQueueGiveFromISR - in it it checks such things as:
+ whether the queue is not empty,
+ whether it is blocking some task,
+ whether the context will have to be switched.
FreeRTOS introduces the concept of priority limit which does not allow to call its functions in interrupts of higher priority - this condition is checked at the beginning of the function. Before exiting, xQueueGiveFromISR also checks whether a task with a higher priority than the current one has been unlocked in the meantime - this would force a context switch. In addition, any kernel function called during an interrupt also has the -FromISR suffix.

ThreadX

Microsoft's RTOS (Ecliipse Foundation more recently) offers a much simpler approach to handling interrupts. The API doesn't require informing the kernel of anything, just a simple function call as during normal RTOS operation. TraceX itself didn't require the use of any markers either, as it directly provided the time spent in handling the interrupt.

void HAL_GPIO_EXTI_Callback ( uint16_t GPIO_Pin ) { tx_semaphore_put ( &LED_semaphore ); }

And this is how the results look in TraceX:

At the start of the function, interrupts are disabled using a macro. This is followed by traditional semaphore handling - incrementing, checking if a task is blocked. Before exiting - interrupts are turned back on.

embOS

The embOS approach is relatively similar to that used by FreeRTOS. The kernel needs information that a function is called from an interrupt. The API provides special functions for this purpose with which to surround the RTOS function call area. Omitting them causes similar problems as in FreeRTOS.

void HAL_GPIO_EXTI_Callback ( uint16_t GPIO_Pin ) { SEGGER_SYSVIEW_MarkStart (1); OS_INT_Enter (); OS_SEMAPHORE_Give (&LED_semaphore ); SEGGER_SYSVIEW_MarkStop (1); OS_INT_Leave (); }

And this is how the results look in SystemView:

EmbOS uses OS_INT_Enter, which calls out a series of macros:
+ check if the kernel is initialized,
+ put a flag that control has moved from the interrupt,
+ disable low-level interrupts and enable high-level ones.

Zephyr

Zephyr approaches the ISR issue in a very similar way to ThreadX - kernel functions can be called directly in an interrupt.

void ISR_callback(const struct device *dev, struct gpio_callback *cb, uint32_t pins) { k_sem_give(&LED_semaphore); }

Zephyr and SystemView:

In the graphic above, the end of the darker rectangle in ISR22 is the moment the k_sem_give function is called - the time is counted from that moment until the thread_a starts. As you can see, Zephyr "lost" a few microseconds by switching to Idle for a moment.

Zephyr gives the corresponding functions the isr-ok attribute if they can be used inside an interrupt. The k_sem_give implementation itself, on the other hand, is based on the so-called spinlock - so on the input to the function it is guaranteed that the calling task will not be expropriated or interrupted.

Context switch from ISR - summary

Contex switch from ISR

The trend, as you can see, is in line with the results from the first test - ThreadX leads, followed by Zephyr, far further FreeRTOS and finally embOS. The time required for the basic context switch, i.e. scheduler handling, is still overlaid by the one spent in interrupts releasing the semaphore. Relating this to the aforementioned previous results, we can find out that for FreeRTOS the time increased by about 80 µs, and for Zephyr, ThreadX and embOS by about 30 µs.

As for RTOS vs interrupts - the smallest changes relative to "normal" code, occur in ThreadX. This has a positive effect on speed. It can carry negative effects because, disabling interrupts prevents handling nested higher priority interrupts. Zephyr copes similarly by calling out spinlock. FreeRTOS tries not to miss any interrupts, but the fact that every function called is of type -FromISR can cause overlapping delays. EmbOS seems to stay in between, disabling low-level interrupts and making sure the high-level ones remain serviced.

Test #4 - Real life application

After the previous 3 tests, which touched on heavily isolated cases, it's interesting to see how the RTOS will behave in a situation more like a real-world workload.

After examining the performance of four popular RTOSes in the context of context switching, task periodicity and return time from interrupts, it's time to test with the application. Here we will use a fast Fourier transform (FFT) with a computational complexity of O(n log n), which will put a significant load on the microcontroller (even with the DSP module). We send the result of the transform in a second shuffle to the UART and preview it on a logic analyzer.

Configuration details:
+ library for FFT calculations - ARM CMSIS-DSP;
+ ADC 12bit with sampling rate of 1kHz (sampled signal comes from generator);
+ UART 115200kb/s, classic setting;

The task responsible for UART waits for the queue to receive the frequency calculated by FFT_task. When this happens, the value is sent raw to the UART. 

void FFT_task_entry () { uint32_t received_from_ADC; uint32_t maxFFTValueIndex = 0; float32_t maxFFTValue = 0; uint16_t freqBufferIndex = 0; //sampling frequency as set in .ioc file - 84MHz / 8400 / 10 = 1 kHz float32_t f_s = APB1_CLOCK / ( htim3.Init.Prescaler + 1) / (htim3.Init.Period + 1); //prepare frequency value (y axis) for each sample FFT_init( f_s ); //initialization function for the floating-point real FFT; FFTHandler is provided by library //number of samples must be 2^n arm_rfft_fast_init_f32( &FFTHandler , FFT_SAMPLES ); while (1) { queue_get( ADC_queue , &received_from_ADC ); HAL_GPIO_TogglePin( LD3_GPIO_Port , LD3_Pin ); FFT_input[ freqBufferIndex ] = (float32_t)received_from_ADC; freqBufferIndex ++; if( freqBufferIndex >= FFT_SAMPLES ) { SEGGER_SYSVIEW_MarkStart(0); HAL_GPIO_WritePin( FFT_pin_GPIO_Port , FFT_pin_Pin , GPIO_PIN_SET ); freqBufferIndex = 0; //processing function for the floating-point real FFT; outputs complex values arm_rfft_fast_f32( &FFTHandler , FFT_input , FFT_output , 0); //calculate complex magnitude arm_cmplx_mag_f32( FFT_output , freqTable , FFT_SIZE ); / /retrieve max value to obtain base frequency arm_max_f32( freqTable + 1 , FFT_SIZE - 1, & maxFFTValue , &maxFFTValueIndex ); baseFreq = freqOrder[ maxFFTValueIndex ]; queue_put( UART_queue , (void*)&baseFreq ); HAL_GPIO_WritePin( FFT_pin_GPIO_Port , FFT_pin_Pin , GPIO_PIN_RESET ); SEGGER_SYSVIEW_MarkStop(0); } delay(1); } }

UART_task waits for the DMA to queue a ready measurement of an external signal and then places it in a buffer. When the buffer reaches a sufficient size for the calculation, functions from the ARM library allow the frequency of the signal to be calculated. This, in turn, is placed in the queue that UART_task blocks.

void HAL_ADC_ConvCpltCallback( ADC_HandleTypeDef * AdcHandle )

The way the RTOS API is used in interrupt handling differs in each system.
In addition, we added a PIN Toggle in the code to check individual results on the signal analyzer (Salea Logic).

Measurement results.

RTOSes vs FFTs

Observations

Analysis of the results from SystemView compared to the data from the logic analyzer shows a discrepancy of no more than 2%.
We take the analyzer as a benchmark because of its ability to accurately determine the start and end times of measurements.

ThreadX and Zpehyr again stand out as leaders, although their advantage is small.
It seems that in a real application, when the system is under load, RTOS kernel operations have a marginal impact on the performance of the entire system.