How should you do concurrency in an embedded no_std
application? There's no built-in support for time-sliced threads in core
; that abstraction is only available in std
(see std::thread
). The latest stable release brought the async/await feature to no_std
thanks to our compiler work. Should you always use that instead? Is cooperative scheduling appropriate for all kind of embedded applications? What about applications with soft or hard real-time requirements?
If you have asked yourself these questions before then this post is for you! We'll look into a few commonly used embedded concurrency patterns and then into some less common ones. We'll compare their runtime characteristics in practical terms so you get a better feel for which one is appropriate for your application.
Time-sliced threads
This is by far the most widely known and used concurrency pattern. If you have written a concurrent hosted application, an application that runs on top of an Operating System (OS), then you have directly or indirectly use time-sliced threads as they are a concurrency / multitasking construct that virtually all OSes provide.
In embedded applications threads are commonly used for multitasking with one task mapped to one thread. Embedded devices usually have only one CPU so adding threads does not result in parallelism (parallel execution of threads). Example usage below:
// thread / task 1 uses the radio interface
let (mut rtx, mut rrx) = radio::claim();
thread::spawn(move || loop {
let mut buffer = [0; 64];
let n = rrx.recv(&mut buffer);
// omitted: prepare a response
rtx.send(buffer[..m]);
});
// thread / task 2 uses the serial interface
let (mut stx, mut srx) = serial::claim();
thread::spawn(move || loop {
let mut buffer = [0; 64];
let n = srx.read(&mut buffer);
// omitted: prepare a response
stx.write(buffer[..m]);
});
In this example two independent tasks are executed concurrently by running each one in a separate thread. In the example the threads do not need to communicate but inter-thread communication is possible via atomics, mutexes or channels.
Apart from familiarity the other big advantage of threads is that you can use existing blocking APIs in threads without modification and still end up with a concurrent / multitasking application.
Currently, most of the existing Hardware Abstraction Layers (HAL) in the embedded crate ecosystem have only blocking APIs. One would then think that threads are the preferred way to do concurrency as those blocking HALs can be used "as they are" in threaded applications. However, that's not the case; there aren't pure-Rust thread runtimes and although there are Rust bindings to embedded OSes written in C that do provide a thread abstraction they are not that widely used.
This lack of use of threads in embedded Rust may have to do with their main disadvantage: inefficient stack usage due to each thread having its own separate call stack. A chunk of RAM must be allocated to each thread as call stack space; the chunk must be large enough so that the call stack does not grow beyond its reserved space and ends up colliding with the call stack of a different thread.
Memory layout of a multi-threaded program
The above image depicts the memory layout of multi-threaded program that uses the whole RAM available on an embedded device. At the bottom (smallest memory address) you'll see the .bss+.data
section; this section contains statically allocated variables. Right above that section you'll see the heap. The memory allocator uses the heap space to dynamically allocate data like vectors (growable arrays) and (growable) strings. Dynamic memory allocation is optional in embedded applications so the heap may not always be present; when it is present, it is usually assigned a fixed amount of memory. The rest of the memory is used for the call stacks of the threads. In this example the call stacks grow downwards, towards smaller addresses. As you can imagine the stack of T1 could grow large enough to collide with the stack of T2; likewise the stack of T3 could grow too large and collide with the heap. Some architectures provide runtime mechanisms to detect these collisions but when collisions are not runtime checked corruption of memory can occur.
Embedded devices have limited amounts of RAM, usually in the order of tens of KBs. This greatly limits the amount of threads can be spawned if allocated stack space is fixed to a large number like 4 KB. If alternatively, all the RAM is equally split among N threads this can lead to each thread having too little stack space which increases the risk of stack overflows.
In summary:
Pros of threads:
- Popular concurrency model with a familiar API (
std::thread
) - Existing blocking code can be made run concurrently without modification
Cons of threads:
- Inefficient stack usage: one separate call stack per thread. Having more threads results in greater risk of a stack overflow (collision between call stacks)
Interrupt handlers
Another approach to concurrency is to directly use the interrupt handlers. Interrupt handlers are callback functions that are executed by the hardware when some event, internal or external, occurs. Due to their callback nature – the handler runs from start to finish on each new event – interrupt handling code usually ends up containing explicit state machines.
// these functions are interrupt handlers
fn on_usb_events(state: &mut UsbHandlerState) {
for event in pending_usb_events() {
// advance the state machine according to current state and
// the incoming event
match (state, event) {
// ..
}
}
}
fn on_radio_events(state: &mut RadioHandlerState) {
// ..
}
The main advantage of using interrupt handlers is fast response times. This is because interrupt handlers are prioritized over code that's not running from interrupt context. This results in preemption: the lower priority code gets suspended while the interrupt handler code runs; once the handler returns the execution of the lower priority code is resumed.
With interrupt handlers external events like USB events can be handled within a few clock cycles of the event occurring. On the other hand, using a thread to handle USB events can result in higher processing latency if the thread needs to compete with other threads for CPU time.
Some architectures support prioritization of interrupt handlers; this means that higher priority interrupt handlers can preempt lower priority ones. This is particularly useful in real-time applications where multiple events need to be handled within a certain amount of time, a deadline, but some of them have tighter deadlines – those event handlers with tighter deadlines need to be prioritized.
After preemption there's no context switching from the higher priority context to the lower priority one. This means that the call stack of the lower priority context remains frozen during the whole execution of the higher priority interrupt handler. Due to this fact, the call stacks of interrupt handlers are laid next to each other in memory – this happens automatically at the hardware level. Compared to the call stack layout of threads, this results in a much more efficient use of RAM as there are no gaps between the call stack of each interrupt handler. The next image illustrates the idea.
Memory layout of a program that uses prioritized interrupts
This is the memory layout of a program using prioritized interrupt handlers. The .bss+.data
and heap sections remain the same as in the multi-threaded program. Call stacks also grow downwards in this case. main
is the non-interrupt context; its call stack starts at the top (highest address) of the RAM region. In the snapshot we observe that interrupt handler I1 has preempted main
and that interrupt handler I2 has preempted I1. In this state the call stack of I2 is allowed to grown downwards but the stacks of I1 and main
remain frozen in size until interrupt handler I2 returns and interrupt handler I1 is resumed.
Frameworks like RTIC (Real-Time on Interrupt Circuits) – previously known as RTFM (Real Time For the Masses) – are built on the idea that applications can be made completely reactive if they consist only of interrupt handling code. RTIC maps its tasks to interrupt handlers allowing prioritization of tasks with efficient stack usage. RTIC is a multitasking alternative to threads that features task prioritization and low latency handling of events.
In summary:
Pros of interrupt handlers:
- Minimal processing delay between the occurrence of an event and its event handler thanks to prioritization
- Hardware level support for prioritization (the number of priority levels is device dependent)
- Efficient stack usage: the call stacks of interrupt handlers are laid out next to each other ("compacted")
Cons of interrupt handlers:
- Non-linear code: explicit state machines
Cooperative tasks (async/await)
The newest concurrency model available in no_std
code is async/await. The async/await feature enables the construction of cooperative tasks: tasks that yield control back to the scheduler at determined points; it is only at these points where context switching between tasks occurs. This is quite different from threads; in threaded systems the scheduler can context switch from one thread to another at any time.
async code looks quite similar to threaded code. Task executors, the schedulers that run async tasks, will usually provide an API that resembles the std::thread
API. This eases the effort required to switch from working with threads to working with async tasks.
Here's the multi-threaded example we saw before ported to async tasks:
// async task 1 uses the radio interface
let (mut rtx, mut rrx) = radio::claim();
executor::spawn(async move {
// ^^^^^ async block
// the `.await` operator can be used within this block
loop {
let mut buffer = [0; 64];
let n = rrx.recv(&mut buffer).await;
// ^^^^^^ potential yield point
// omitted: prepare a response
rtx.send(buffer[..m]).await;
}
});
// task 2 uses the serial interface
let (mut stx, mut srx) = serial::claim();
executor::spawn(async move {
loop {
let mut buffer = [0; 64];
let n = srx.read(&mut buffer).await;
// omitted: prepare a response
stx.write(buffer[..m]).await;
}
});
// runs all previously spawned tasks
executor::run()
In this async variant, operations like Radio.recv
will suspend the caller task until data becomes available on the radio peripheral. A task that suspends itself yields control back to the async scheduler; the scheduler then resumes the next task that can make progress. If none of the tasks managed by the executor can make progress then the processor is put in lower power mode (sleep mode) until any task can make progress.
Cooperative tasks have two disadvantages:
Tasks are (usually) executed at the same priority; this means that a single task can use a disproportionate amount of CPU time before yielding; this can lead to delayed processing of events which is undesirable in systems with real-time requirements.
The other disadvantage is that async executors can only spawn asynchronous functions (async fn
) as tasks. Although it is possible to call blocking functions (plain fn
) from async functions this can lead to the problem described in the previous paragraph. Thus cooperative multitasking works best when all the code has been designed to be async
.
The overall stack usage of cooperative tasks can be hard to imagine so let's work through it in steps. First, to be able to resume a suspended task from where it left off some state needs to be preserved somewhere in memory. Consider these two async tasks, written as generators:
let t1 = || { // state A
yield; // state B
// ^^^^^ explicit yield point
yield; // state C
// implicit return
};
let t2 = || { // state A
let mut x = 0u16;
yield; // state B
let y = 42;
if some_condition(y) {
x += 1; // <- new value must be preserved across yields
}
yield; // state C
foo(x);
};
t1
has three resume states: A, B and C. Therefore, a 3-variant enum is required to keep track of the task / generator state; that's 1 byte of state. t2
also has 3 states but also has an u16
integer as part of its state; that's a total of 3 bytes of state.
Now consider the following memory layout:
Memory layout of a program that runs async tasks
In this example the task states are stored on the stack (but this is not the only option). Those states will remain pinned on the stack for as long as the tasks are alive and managed by the executor. Here the executor manages three tasks and you can see that their states stay on the stack regardless of which task is currently active – the image shows two potential states of the program: T1 active (left) and T3 active (right). Once a task is resumed it may use more stack space due to local variables (like y
in t2
) and function calls (like some_condition
and foo
in t2
).
It is possible to allocate task state in .bss+.data
or the heap. This does not have much effect on overall RAM usage (for example the stack may shrink by X bytes but then .bss+.data
will grow by X bytes); it simply changes the location of the task state.
In summary:
Pros of async tasks:
- Linear code like with threads
- Efficient stack usage: all tasks share the same stack space in turns.
Cons of async tasks:
- "Colored" functions:
fn
vsasync fn
- No task prioritization: long running task with no explicit suspension points (e.g. crypto operations) can greatly delay handling of events
Mixing the patterns
We have discussed three common concurrency models. Each one has its strengths and weaknesses. In this section we'll see how one may mix them to cover up for each model's individual weaknesses.
Kernel-like
Say you want to use Serial emulation over USB (TTY ACM interface) for logging purposes. This requires running a USB event loop that responds to host requests. The USB protocol, in practice at least, has soft real time requirements: devices needs to respond to the host within a few milliseconds or the device may be reset, or even power cycled, and re-enumerated by the host.
Here running the USB event loop as an async cooperative task would be a bad fit because other running tasks may delay the handling of host requests and lead to re-enumeration.
Running the USB event loop in a separate thread is also not a great fit because the time slices need to be short enough to keep the event loop responsive in presence of other competing threads. Choosing shorter time slices leads to more context switches which usually increases CPU load (overhead).
If you want to avoid using threads in this use case but still be able to write linear code one option is to run all the USB code from interrupt context as shown below.
The TTY ACM interface supports bidirectional communication between the USB host and the device. For simplicity we only support device-to-host communication in the example API.
// ttyacm.rs (part of the Hardware Abstraction Layer)
/// A handle to the TTY ACM interface
pub struct TtyAcm { /* .. */ }
impl TtyAcm {
/// Returns a handle to the TTY ACM interface
/// Panics if called more than once
pub fn claim() -> Self {
// ..
}
/// Sends `data` to the host
pub fn write(&mut self, data: &[u8]) {
// this could either overwrite old data when the buffer is
// full or could "block" until the buffer gets drained by
// the interrupt handler
unsafe { CIRC_BUFFER.write(data) }
}
}
/// A lock-free circular buffer
static mut CIRC_BUFFER: CircBuffer = CircBuffer::empty();
// this interrupt handler contains the USB event loop
fn on_usb_events(state: &mut UsbHandlerState) {
// handles enumeration as a TTY ACM device
// drains CIRC_BUFFER and sends the data to the USB host
}
Here on_usb_event
runs from interrupt context and is prioritized over TtyAcm
which runs from non-interrupt context. TtyAcm.write
pushes data into a lock-free circular buffer; on_usb_events
drains the buffer when the USB host asks for data.
Example application:
// app.rs
fn main() {
let mut logger = TtyAcm::claim();
logger.write(b"Hello, world!\n");
// ..
}
The application code remains linear even though there's a complex USB state machine running from interrupt context under the hood. Running the USB event loop from interrupt context makes it easier to meet the soft real time requirements of the USB protocol.
This API is architected like a traditional OS where the low level USB handling code runs as in kernel space, where it's prioritized over userspace processes; and the high level USB application code runs from userspace. In OS implementations, it is common for the userspace USB API to do a context switch into kernel space, through a system call, on every USB operation. In this particular case TtyAcm.write
does not need to context switch into interrupt context or disable interrupts ("claim a lock on the circular buffer") to work correctly (be memory safe).
async flavored
In the previous USB example TtyAcm.write
was a blocking function but it is also possible to use this pattern to write async APIs. One good example for an async variation of this pattern is an async USB HID API.
A minimal HID API consists of two operations: sending and receiving HID reports. An HID report is a packet of data which may include, or not, a report ID (single byte) as its header. These reports are limited in size to what the USB device reported in its endpoint descriptor.
The async API may look like this:
// hid.rs (part of Hardware Abstraction Layer)
/// Returns a handle to the HID interface
/// Panics if called more than once
pub fn claim() -> (HidIn, HidOut) {
// ..
}
/// Buffer (array) that holds an HID report
pub struct Report { /* .. */ }
// omitted: `Report` implements `Deref<Target=[u8]>`
/// HID handle for sending IN (Device -> Host) reports
pub struct HidIn { /* .. */ }
impl HidIn {
/// Sends the specified `report` to the host
pub async fn send(&mut self, report: &Report) {
// communicates with `on_usb_events` using atomics, MMIO
// registers, locks, etc.
}
}
/// HID handle for receiving OUT (Host -> Device) reports
pub struct HidOut { /* .. */ }
impl HidOut {
/// Receives a report from the host and writes its contents into
/// the given `buffer`
pub async fn recv(&mut self, buffer: &mut Report) {
// communicates with `on_usb_events` using atomics, MMIO
// registers, locks, etc.
}
}
// interrupt handler
fn on_usb_events(state: &mut UsbHandlerState) {
// handles enumeration as a TTY ACM + HID device;
// drains CIRC_BUFFER; communicates with Hid{In, Out}
}
As before, we do the bulk of handling the USB protocol in the on_usb_events
interrupt handler. The handler can continue to support the TTY ACM interface so you'll be able to use both TtyAcm
and Hid{In,Out}
from the same application and from different async tasks, even. Example usage below:
let (tx, rx) = hid::claim();
let logger = TtyAcm::claim();
executor::block_on(async {
let mut report = Report::empty();
loop {
rx.recv(&mut report).await;
logger.write(b"new HID report ({}B)", report.len());
// omitted: perform some action based on the request
// omitted: then prepare a response in the `report` buffer
tx.send(&report).await;
}
})
The implementation of the recv
method is much less complex than on_usb_events
. This method will notify on_usb_events
("the kernel") that it expects data in buffer
then yield to suspend the caller task. Eventually on_usb_events
notifies back that host data was copied into buffer
; this wakes up the task that was awaiting on HidOut.recv
.
async & long running operations
Let's say you'd like to write the main logic of your application as a few cooperative tasks using async / await but your app logic involves expensive cryptographic operations and you would like to avoid those "hogging" the executor's time as that would deteriorate the overall responsivity of your application. How can you avoid this issue?
// omitted: spawn some other tasks
let (tx, rx) = hid::claim();
executor::block_on(|| loop {
let mut packet = Packet::new();
rx.recv(&mut packet).await;
if is_certain_request(&packet) {
// `crypto` takes several milliseconds to complete
let resp = crypto(packet);
tx.send(&resp).await;
} else {
// ..
}
});
Chunk it
One option is to split your long crypto operation in chunks of roughly the same size and yield at the end of each chunk. So if your crypto function looked like this:
fn crypto(mut input: Packet) -> Packet {
for _ in 0..N_ITERATIONS {
// do math
}
}
Its chunked version may look like this:
async fn chunked_crypto(mut input: Packet) -> Packet {
for i in 0..N_ITERATIONS {
// do math
// but, say, every 10 loops ..
if i % 10 == 0 {
// .. we force it to yield
Yield::new().await
}
}_
}
And you would need to update the application code to use the chunked async version:
-let resp = crypto(packet);
+let resp = chunked_crypto(packet).await;
Although this would work it has the huge disadvantage that you need to write an async version of an existing function, which could be provided by a third party crate meaning that you'll need to duplicate third-party code. It may also not be obvious how to chunk the function in equally sized pieces or how to make sure that each chunk is completed say within one millisecond.
Thread it
Another option is to execute long running code in a time-sliced thread. Time-sliced threads are designed to context switch at a certain time interval so they already do the chunking we want. We would need the executor to provide the functionality to run an arbitrary closure in a separate thread.
// executor.rs
/// runs `f` in a separate thread; returns a future that will yield at
/// the specified `interval`
pub async fn blocking<T, F>(interval: Duration, f: F) -> T
// NOTE these bounds have not been audited and may not be
// sufficient for memory safety
where
F: FnOnce() -> T + Send + 'static,
T: Send + 'static,
{
// ..
}
But once that's in place you can use this blocking
function to transform non-async, long running functions into time-sliced async ones.
-let resp = crypto(packet);
+let resp = executor::blocking(Duration::from_millis(1), move || {
+ crypto(packet)
+}).await;
What the blocking
function does is context switch into thread mode (a separate call stack) and start executing the crypto
function. After running crypto
for 1 millisecond (the specified interval) the thread gets suspends and a context switch back to the async executor occurs. Back in the executor context the blocking(/* .. */).await
expression yields, its task gets suspended and the executor proceeds to execute any other pending task. After processing all pending task the executor resumes blocking(/* .. */).await
which context switches back into thread mode to continue the execution of the crypto
function. This process repeats until the crypto
function is complete; at that moment the blocking(/* .. */).await
expression returns the output of the crypto
function.
This approach does bring back the issue of requiring separate call stacks for the time-sliced threads which increases the risk of stack overflows. However, since the time-sliced threads will be used for a specific purpose rather than a general-purpose concurrency construct you could limit their number to a single thread and have all blocking
operations be queued for execution onto that single time-sliced thread. This way you only need two stacks: one for the async tasks and one for the time-sliced thread.
Furthermore, since you are likely only going to run a few function on this separate thread it becomes more feasible to perform stack usage analysis on those few functions. Doing the analysis let's make you make an informed decision about how much space to allocate for the call stack of the thread.
Deprioritize it
Yet another option is to support at least two priority levels in the async executor. With this approach long running operations can be executed as lower priority tasks; this way they won't block normal priority tasks. Let's see what this means with an example.
Let's say the executor gains the following function to run a task at lower priority.
// executor.rs
/// runs `f` as a lower priority task
pub async fn lowpriority<T, F>(f: F) -> T
// NOTE these bounds have not been audited and may not be
// sufficient for memory safety
where
F: FnOnce() -> T + Send + 'static,
T: Send + 'static,
{
// ..
}
Now we can update the example to run the crypto
function as a low priority task.
-let resp = crypto(packet);
+let resp = executor::lowpriority(move || crypto(packet)).await;
This is what will happen: lowpriority(/* .. */).await
will immediately yield and suspend the current task. Then the executor will resume any pending normal priority task. When no normal priority task can make any more progress the executor will start executing the low priority crypto
task. If at any point during the execution of the crypto
task a normal priority task moves to the pending, say due to an external event or because a DMA transfer finished, then the crypto
task will be suspended to prioritize the execution of the normal prioritize tasks; when all those are done the crypto
task resumes execution. This process continues until the crypto
task finishes; at that moment the lowpriority(/* .. */).await
expression returns the output of the crypto
task.
With this approach no time-sliced thread or second call stack is needed; this reduces the risk of stack overflows. It does, however, requires an async executor that supports task prioritization.
Conclusion
We have seen many alternatives for building concurrent applications. Each pattern has its pros and cons. They are many tradeoffs in terms of code complexity, real-time suitability (timeliness) and stack usage that you would need to evaluate against the requirements of your application to be able to make an informed decision.
Given that this is just a short blog post we have not covered all concurrency patterns out there; for instance RTIC lends itself quite well to the actor / CSP (Communicating Serial Processes) model but we have not covered that in this post. Also, we have only very superficially covered stack usage and timeliness. Those two non-functional properties require full program analysis to be able to draw confident conclusions.
If you're looking for support building the (async) HAL or async executor that your application needs to meets its multitasking and power savings requirements, contact us at Ferrous Systems. We're a Rust consultancy based in Berlin and will be happy to help. We also offer remote trainings on general Rust and embedded Rust, as well as rust-experts, a lightweight consultation service where you can ask us questions about (embedded) Rust. Contact us for a quote and consider subscribing to our Trainings newsletter!