[PART 2/3] A Rusty Module

In the first part we configured and built a Rust enabled kernel, and had a look at how rustc and related tools are integrated into the kernel build system. Now that we’re all set up we can finally hack on some in-kernel Rust code.

To illustrate the path to writing a basic module in Rust we will implement a simple sysctl that can be used to set or flip a bit.

Starting Point

Even at this point before Rust support is even merged, there is no shortage of code examples. Part of the Rust-for-Linux tree itsel are numerous modules under samples/rust/ that demonstrate how Rust code interfaces with a wide span of kernel facilities from firewalling to filesystems.

Upstream also provides a small out-of-tree module which doesn’t do much aside from allocating a Vec and emit a printk from the module’s Drop implementation. This will serve as the starting point for our further exploration of kernel side Rust.

Modules

A module is declared using the proc-macro module!() or one of its special-purpose variants, e. g. module_fs!() for filesystems.

module! {
    type: OxiMod,
    name: "oximod",
    author: "The Man from Ox",
    description: "Oxidize the Kernel",
    license: "GPL",
}

struct OxiMod;

Where type is actually the name of a struct that implements the trait kernel::Module which requires just one function init() similar to how module_init() defines the setup functions for modules written in C. Shouldn’t there be a corresponding exit() function? Not necessarily: The Rust for Linux authors chose the more “rustic” approach of using the Drop trait as the canonical way of executing code when the module is about to be removed. This maps nicely to the requirement that the function passed to module_exit() be infallible.

impl kernel::Module for OxiMod {
    fn init(name: &'static CStr, _module: &'static ThisModule) -> Result<Self> {
        pr_info!("Hello world, I am {}!\n", name);
        Ok(OxiMod)
    }
}

impl Drop for OxiMod {
    fn drop(&mut self) { pr_info!("Bye!\n"); }
}

At the Rust end, kernel modules behave like crates. The code forming part of the module is compiled as one translation unit even if it comprises multiple files. This has consequences for symbol visibility as well: pub(crate) implies module scope while an unqualified pub … results in a compile error:

error: unreachable `pub` item
 --> /root/src/linux/demo-sysctl/other.rs:6:5
  |
6 |     pub fn new() -> Self {
  |     ---^^^^^^^^^^^^^^^^^
  |     |
  |     help: consider restricting its visibility: `pub(crate)`

IOW, the function isn’t being exported so it is useless to declare it with global visibility. (pub use other::Foo; in the module root would export the symbol, but to whom?)

Fallibility

The APIs in std tend to treat allocation failure as something that can’t happen. E. g. Vec::push() will not inform the caller if a heap allocation fails while resizing the underlying element storage – it will simply panic!(). This “pseudo-infallibility” is baked into the function signature of push() which returns the unit value () and thus cannot communicate that something went wrong. As Rust lacks exceptions (ignoring std::panic::catch_unwind() for the sake of argument), it also provides no means of reintroducing in fallibility by bypassing the type system like C++ does for constructors with std::bad_alloc.

Given overcommit, this nonchalance towards allocation failure works reasonably well for userspace applications. In kernel space however, out of memory conditions need to be handled and panicking on OOM would be, in Linus’ words, “fundamentally wrong”. Consequently, the panicking APIs are disabled in kernel-side Rust by virtue of that --cfg no_global_oom_handling flag that the Makefile passes to rustc. Their fallible analogues like try_new(), try_push(), try_reserve(), try_with_capacity() serve as replacements, allowing Err(ENOMEM) to propagate up the stack. These APIs were added to the standard library as part of the larger fallible collection allocation effort outlined in RFC 2116. Despite the stabilization not being completed yet these methods are the bread and butter when dealing with dynamic allocation in the kernel.

Digression: Rust and the Kernel Allocator

Regarding standard library support, it came up earlier that not only do we have core at our disposal but also most of alloc. That is quite an accomplishment by the Rust-for-Linux devs as it provides most of your favorite containers from alloc::collections.

It was made feasible by hooking kernel allocation functions into the GlobalAlloc trait and which happens in rust/kernel/allocator.rs. kmalloc and friends are invoked with the GFP_KERNEL (“get free pages”) flag which means that the kernel may put the caller to sleep until it can satisfy the allocation. It imposes the requirement that the caller be reentrant but in Rust we’re not usually mutating shared state anyways. It also implies that alloc APIs which actually perform allocations must not happen from inside interrupt handlers or timers. (For an overview of the different GFP flags see Linux Device Drivers, chapter 8: Allocating Memory.)

Hopefully at some point in the future more allocation modes like GFP_ATOMIC become usable with the stabilization of the allocator_api.

Clean BUGs

So what does happen when Rust code panics? Let’s add this timebomb to our module init() function:

let boom = [2u8, 1u8, 0u8];
pr_info!("countdown ...\n");
for step in 0..4 {
    kernel::delay::coarse_sleep(core::time::Duration::from_millis(500));
    pr_info!(".. {}\n", boom[step]);
}

The unidiomatic iteration code is needed to trigger the out of bounds array access resulting in this crash log:

[207794.130518] oximod: Hello world, I am oximod!
[207794.131632] oximod: countdown ...
[207794.638197] oximod: .. 2
[207795.150174] oximod: .. 1
[207795.654220] oximod: .. 0
[207796.163462] rust_kernel: panicked at 'index out of bounds: the len is 3 but the index is 3', /root/src/linux/demo-sysctl/oximod.rs:31:31
[207796.171632] ------------[ cut here ]------------
[207796.171656] kernel BUG at rust/helpers.c:45!

This is rust_helper_BUG(), function exported for Rust to invoke the BUG() macro which is then mapped by bindgen to the function BUG on the Rust end:

extern "C" {
    #[link_name="rust_helper_BUG"]
    pub fn BUG();
}

and hooked into the panic!() handler:

#[cfg(not(any(testlib, test)))]
#[panic_handler]
fn panic(info: &core::panic::PanicInfo<'_>) -> ! {
    pr_emerg!("{}\n", info);
    unsafe { bindings::BUG() };
    loop {}
}

This triggers the unwinding process:

[207796.172629] invalid opcode: 0000 [#1] PREEMPT SMP
[207796.172881] CPU: 0 PID: 340812 Comm: insmod Tainted: G           O      5.19.0+ #13
[207796.172982] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[207796.173001] RIP: 0010:rust_helper_BUG+0x5/0x10
[207796.174022] Code: 24 30 00 00 00 00 48 8d 7c 24 08 48 c7 c6 88 fa 48 ae e8 0e 46 41 00 0f 0b 00 00 cc cc 00 00 cc cc 00 00 cc cc 0f 1f 44 00 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 48 89 fb e8 a2
[207796.174022] RSP: 0018:ffff9c7040803918 EFLAGS: 00010246
[207796.174022] RAX: 000000000000007c RBX: ffff9c7040803a18 RCX: a617124871ab2700
[207796.175937] RDX: ffff89e33ec294c0 RSI: ffffffffae66f12e RDI: 00000000ffffffff
<snip />
[207796.179916] Call Trace:
[207796.181595]  <TASK>
[207796.181595]  rust_begin_unwind+0x66/0x80

The Rust unwind handler is invoked and what follows is a bunch of v0-style mangled names until we hit init_module.

[207796.183038]  ? _RNvXsP_NtCs3yuwAp0waWO_4core3fmtRhNtB5_5Debug3fmtCsfATHBUcknU9_6kernel+0x50/0x50
[207796.185124]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[207796.185365]  ? _RNvNtCs3yuwAp0waWO_4core9panicking9panic_fmt+0x2c/0x30
[207796.185365]  ? _RNvNtCs3yuwAp0waWO_4core9panicking18panic_bounds_check+0x6f/0x80

There is the failed bounds check. Thanks to Rust the module enters this “controlled deorbit” procedure instead of triggering an invalid memory access as would be the case in C.

[207796.185365]  ? _RNvXs4_NtNtNtCs3yuwAp0waWO_4core3fmt3num3impxNtB9_7Display3fmt+0x20/0x20
[207796.185365]  ? _RNvXs4_NtNtNtCs3yuwAp0waWO_4core3fmt3num3impxNtB9_7Display3fmt+0x20/0x20
[207796.185365]  ? _RNvXCsf5oFb5M32wh_6oximodNtB2_6OxiModNtCsfATHBUcknU9_6kernel6Module4init+0x263/0x270 [oximod]
[207796.185365]  ? radix_tree_node_alloc+0x73/0xc0
[207796.189969]  ? _RNvXNtNtNtCs3yuwAp0waWO_4core3fmt3num3impaNtB6_7Display3fmt+0x30/0x30
[207796.189969]  ? _RNvXs_Csf5oFb5M32wh_6oximodNtB4_6OxiModNtNtNtCs3yuwAp0waWO_4core3ops4drop4Drop4drop+0x60/0x60 [oximod]
[207796.189969]  ? init_module+0x15/0x80 [oximod]

How is the Rust unwinding apparatus hooked into the kernel? Citing these compiler flags from the Makefile once more:

KBUILD_RUSTFLAGS := $(rust_common_flags) \
                --target=$(objtree)/rust/target.json \
                -Cpanic=abort -Cembed-bitcode=n -Clto=n \
===>            ^^^^^^^^^^^^^
                -Cforce-unwind-tables=n -Ccodegen-units=1 \
===>            ^^^^^^^^^^^^^^^^^^^^^^^
                -Csymbol-mangling-version=v0 \
                -Crelocation-model=static \
                -Zfunction-sections=n \
                -Dclippy::float_arithmetic

it would appear that rustc is instructed to neither insert unwinding code nor generate the unwind tables. IOW we shouldn’t a) see any unwinding and b) be getting any backtraces at all!

But we do get a backtrace. The reason for this, it appears, is due to how objtool generates the information for the ORC unwinder. According to the kernel docs this happens without the involvement of DWARF debug info, which would explain the -Cforce-unwind-tables=n. Instead, it relies on objtool “reverse engineering” the flow of the generated code, emitting ORC data derived from the stack metadata validation mechanism. No magic involved, let’s leave it at that.

[207796.189969]  ? selinux_kernfs_init_security+0x54/0x1c0
<snip />
[207796.194966]  ? load_module+0x1332/0x14a0
<snip />
[207796.194966]  ? entry_SYSCALL_64_after_hwframe+0x46/0xb0
[207796.199943]  </TASK>
[207796.199943] Modules linked in: oximod(O+) kvtcol(O) intel_rapl_msr intel_rapl_common <snip />
[207796.203378] ---[ end trace 0000000000000000 ]---

And the module hangs.

Accessing Kernel Internals

One can feel right at home hacking on the kernel in Rust as the environment is already surprisingly complete. Pre-made bindings to raw kernel APIs are available in the bindings:: namespace. In addition, there are idiomatic Rust wrappers for numerous core concepts like ioctls, struct file, struct sk_buff etc. Finally, symbols exported using EXPORT_SYMBOL{_GPL} are available. Functions declared inline on the C end can be a bit tricky to interact with, as is the case for example with kmalloc() which is worked around by using krealloc() in the allocator.

Thanks to its expressiveness Rust allows for abstractions that make the developer experience vastly more ergonomic than C, and that is true of kernel code as well. To give an example, user pointers are represented as struct UserSlicePtr on a type level instead of the ad-hoc __user tagging convention that is used in C. Its implementation defines just a single unsafe member function, the constructor whose invariants cannot be enforced by the compiler. The remainder of operations defined on user slices can be called in safe code – which is the rule rather than the exception as most of the time the kernel APIs one interacts with provide their data already wrapped in UserSlicePtr.

On the topic of ergonomics, the c_str!() macro deserves mention. It allows creating ffi::CStrs at compile time, making them almost as convenient to use as &str - it would make a worthy addition to std.

Sysctl

The Rust wrapper which we will be using in this module is kernel::sysctl::Sysctl with its companion trait SysctlStorage. The trait is used to define how read and write accesses are handled for this sysctl: on store_value() it receives a shared reference to the slice with the data written from userspace, on read_value() it receives a wrapped, mutable __user pointer to write to. At this stage SysctlStorage is only implemented for one type, core::sync::atomic::AtomicBool which is sufficient for this chapter.

pub trait SysctlStorage: Sync {
    fn store_value(&self, data: &[u8]) -> (usize, Result);
    fn read_value(&self, data: &mut UserSlicePtrWriter) -> (usize, Result);
}

pub struct Sysctl<T: SysctlStorage> {
    inner: Box<T>,
    _table: Box<[bindings::ctl_table]>,
    header: *mut bindings::ctl_table_header,
}

Firstly, in the module type constructor the sysctl is registered under the desired path:

struct OxiMod(Sysctl<AtomicBool>);

impl kernel::Module for OxiMod {
    fn init(name: &'static CStr, _module: &'static ThisModule) -> Result<Self> {
        let sysctl = Sysctl::register(
            c_str!("example"),
            c_str!("atom"),
            AtomicBool::new(false),
            Mode::from_int(0o644),
        )?;
        Ok(OxiMod(sysctl))
    }
}

This results in the creation of /proc/sys/example/atom at insmod time. Since AtomicBool already implements SysctlStorage, it can be read from and written to at that path:

# insmod oximod.ko
# sysctl example/atom
example.atom = 0
# ls -l /proc/sys/example/
total 0
-rw-r--r-- 1 root root 0 2022-09-02 21:48 atom
# cat /proc/sys/example/atom
0
# printf 1 >/proc/sys/example/atom
# cat /proc/sys/example/atom
1

This was quick, in just a few lines of code we went from zero to a new sysctl available as example/atom. As for correctness, the module author doesn’t really have a reason to worry about it as that is the compiler’s job. The Sync trait bound on SysctlStorage guarantees that access to the backing type is serialized.

Digression: It’s not my `EFAULT!`

In reality getting to the point of a working OxiMod was not as frictionless as the above would make you believe. While playing around with struct Sysctl I soon hit a roadblock:

# cat /proc/sys/example/atom
cat: /proc/sys/example/atom: Bad address

# printf 0 >/proc/sys/example/atom
-bash: printf: write error: Bad address

Confusing. “Bad address” is EFAULT, what could possibly cause that?

Looking at the implementation of SysctlStorage, safe access to the buffer is handled by wrapping it in UserSlicePtr, the tuple struct representing a user pointer. Read / write operations on that type are defined in its implementation of the IoBufferReader, which is where the calls to copy_{from,to}_user() happen and EFAULT may be returned on failure. In the case of SysctlStorage the failure originates in access_ok() on the source or destination buffer of copy_from_user() or copy_to_user(), respectively. Something about the pointer received by proc_handler() must not be right, causing access_ok() to reject it.

Why that happens is obvious from proc_sysctl.c where the call to proc_handler() happens: the buffer it passes to the function is not a user pointer but a freshly allocated kernel one! Turns out this was changed in kernel v5.7 with commit 32927393dc1c: "sysctl: pass kernel pointers to ->proc_handler". The fix is straightforward – get rid of the UserSlicePtr wrapper as we can safely operate on the kernel pointer directly. Amazingly, the pull request on the Rust-for-Linux repo received a blazing-fast review that ironed out some obvious flaws within a few hours of submission.

Given the novelty of the project it comes as a surprise that parts of the Rust support haven’t kept up with kernel changes that were merged more than two years ago. To put a positive spin on it, Rust for Linux has reached a point of maturity where it is already subject ot bitrot!

Conclusion

This part of the series described the process of creating a trivial out-of-tree module in a few lines of Rust code. Starting from a minimal example provided by upstream, we discussed allocating APIs and how to use them idiomatically in kernel side Rust and then had a look at what happens if the module triggers a BUG(). Finally, the actual sysctl was implemented by combining a few predefined parts.

The resulting module doesn’t do much at all, it basically only allows storing a bit in the kernel, modifying and retrieving it. Part three will expand the functionality to actually have an effect on the system.