In the first part we configured and built a Rust enabled kernel, and
had a look at how rustc
and related tools are integrated into the
kernel build system. Now that we’re all set up we can finally hack on
some in-kernel Rust code.
To illustrate the path to writing a basic module in Rust we will implement a simple sysctl that can be used to set or flip a bit.
Starting Point
Even at this point before Rust support is even merged, there is no
shortage of code examples. Part of the Rust-for-Linux tree itsel are
numerous modules under samples/rust/
that demonstrate how Rust code interfaces with a wide span of kernel
facilities from firewalling to filesystems.
Upstream also provides a small out-of-tree module
which doesn’t do much aside from allocating a Vec
and emit a
printk
from the module’s Drop
implementation.
This will serve as the starting point for our further exploration of
kernel side Rust.
Modules
A module is declared using the proc-macro module!()
or one of its
special-purpose variants, e. g. module_fs!()
for filesystems.
module! {
type: OxiMod,
name: "oximod",
author: "The Man from Ox",
description: "Oxidize the Kernel",
license: "GPL",
}
struct OxiMod;
Where type
is actually the name of a struct that implements the
trait kernel::Module
which requires just one function init()
similar to how
module_init()
defines the setup functions for modules written in C.
Shouldn’t there be a corresponding exit()
function?
Not necessarily: The Rust for Linux authors chose the more “rustic”
approach of using the Drop
trait as the canonical way of executing
code when the module is about to be removed.
This maps nicely to the requirement that the function passed to
module_exit()
be infallible.
impl kernel::Module for OxiMod {
fn init(name: &'static CStr, _module: &'static ThisModule) -> Result<Self> {
pr_info!("Hello world, I am {}!\n", name);
Ok(OxiMod)
}
}
impl Drop for OxiMod {
fn drop(&mut self) { pr_info!("Bye!\n"); }
}
At the Rust end, kernel modules behave like crates.
The code forming part of the module is compiled as one translation unit
even if it comprises multiple files.
This has consequences for symbol visibility as well:
pub(crate)
implies module scope while an unqualified pub
…
results in a compile error:
error: unreachable `pub` item
--> /root/src/linux/demo-sysctl/other.rs:6:5
|
6 | pub fn new() -> Self {
| ---^^^^^^^^^^^^^^^^^
| |
| help: consider restricting its visibility: `pub(crate)`
IOW, the function isn’t being exported so it is useless to declare
it with global visibility. (pub use other::Foo;
in the module root
would export the symbol, but to whom?)
Fallibility
The APIs in std
tend to treat allocation failure as something
that can’t happen.
E. g. Vec::push()
will not inform the caller if a heap allocation
fails while resizing the underlying element storage – it will simply
panic!()
.
This “pseudo-infallibility” is baked into the function signature of
push()
which returns the unit value ()
and thus cannot
communicate that something went wrong.
As Rust lacks exceptions (ignoring std::panic::catch_unwind()
for
the sake of argument), it also provides no means of reintroducing in
fallibility by bypassing the type system like C++ does for constructors
with std::bad_alloc
.
Given overcommit, this nonchalance towards allocation failure works
reasonably well for userspace applications.
In kernel space however, out of memory conditions need to be handled
and panicking on OOM would be, in Linus’ words,
“fundamentally wrong”.
Consequently, the panicking APIs are disabled in kernel-side
Rust
by virtue of that --cfg no_global_oom_handling
flag that the
Makefile
passes to rustc
.
Their fallible analogues like try_new()
, try_push()
,
try_reserve()
, try_with_capacity()
serve as replacements,
allowing Err(ENOMEM)
to propagate up the stack.
These APIs were added to the standard library as part of the larger
fallible collection allocation effort outlined in
RFC 2116.
Despite the stabilization not being completed yet
these methods are the bread and butter when dealing with dynamic
allocation in the kernel.
Digression: Rust and the Kernel Allocator
Regarding standard library support, it came up earlier that not only
do we have core
at our disposal but also most of alloc
.
That is quite an accomplishment by the Rust-for-Linux devs as it
provides most of your favorite containers from alloc::collections
.
It was made feasible by hooking kernel allocation functions into the
GlobalAlloc
trait and which happens in rust/kernel/allocator.rs
.
kmalloc
and friends are invoked with the GFP_KERNEL
(“get free
pages”) flag which means that the kernel may put the caller to sleep
until it can satisfy the allocation.
It imposes the requirement that the caller be reentrant but in Rust
we’re not usually mutating shared state anyways.
It also implies that alloc
APIs which actually perform allocations
must not happen from inside interrupt handlers or timers.
(For an overview of the different GFP flags see Linux Device
Drivers, chapter 8: Allocating
Memory.)
Hopefully at some point in the future more allocation modes like
GFP_ATOMIC
become usable with the stabilization of the
allocator_api
.
Clean BUGs
So what does happen when Rust code panics? Let’s add this timebomb to
our module init()
function:
let boom = [2u8, 1u8, 0u8];
pr_info!("countdown ...\n");
for step in 0..4 {
kernel::delay::coarse_sleep(core::time::Duration::from_millis(500));
pr_info!(".. {}\n", boom[step]);
}
The unidiomatic iteration code is needed to trigger the out of bounds array access resulting in this crash log:
[207794.130518] oximod: Hello world, I am oximod!
[207794.131632] oximod: countdown ...
[207794.638197] oximod: .. 2
[207795.150174] oximod: .. 1
[207795.654220] oximod: .. 0
[207796.163462] rust_kernel: panicked at 'index out of bounds: the len is 3 but the index is 3', /root/src/linux/demo-sysctl/oximod.rs:31:31
[207796.171632] ------------[ cut here ]------------
[207796.171656] kernel BUG at rust/helpers.c:45!
This is rust_helper_BUG()
, function exported for Rust to invoke the
BUG()
macro which is then mapped by bindgen
to the function
BUG
on the Rust end:
extern "C" {
#[link_name="rust_helper_BUG"]
pub fn BUG();
}
and hooked into the panic!()
handler:
#[cfg(not(any(testlib, test)))]
#[panic_handler]
fn panic(info: &core::panic::PanicInfo<'_>) -> ! {
pr_emerg!("{}\n", info);
unsafe { bindings::BUG() };
loop {}
}
This triggers the unwinding process:
[207796.172629] invalid opcode: 0000 [#1] PREEMPT SMP
[207796.172881] CPU: 0 PID: 340812 Comm: insmod Tainted: G O 5.19.0+ #13
[207796.172982] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[207796.173001] RIP: 0010:rust_helper_BUG+0x5/0x10
[207796.174022] Code: 24 30 00 00 00 00 48 8d 7c 24 08 48 c7 c6 88 fa 48 ae e8 0e 46 41 00 0f 0b 00 00 cc cc 00 00 cc cc 00 00 cc cc 0f 1f 44 00 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 48 89 fb e8 a2
[207796.174022] RSP: 0018:ffff9c7040803918 EFLAGS: 00010246
[207796.174022] RAX: 000000000000007c RBX: ffff9c7040803a18 RCX: a617124871ab2700
[207796.175937] RDX: ffff89e33ec294c0 RSI: ffffffffae66f12e RDI: 00000000ffffffff
<snip />
[207796.179916] Call Trace:
[207796.181595] <TASK>
[207796.181595] rust_begin_unwind+0x66/0x80
The Rust unwind handler is invoked and what follows is a bunch of
v0-style mangled names until we hit init_module
.
[207796.183038] ? _RNvXsP_NtCs3yuwAp0waWO_4core3fmtRhNtB5_5Debug3fmtCsfATHBUcknU9_6kernel+0x50/0x50
[207796.185124] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[207796.185365] ? _RNvNtCs3yuwAp0waWO_4core9panicking9panic_fmt+0x2c/0x30
[207796.185365] ? _RNvNtCs3yuwAp0waWO_4core9panicking18panic_bounds_check+0x6f/0x80
There is the failed bounds check. Thanks to Rust the module enters this “controlled deorbit” procedure instead of triggering an invalid memory access as would be the case in C.
[207796.185365] ? _RNvXs4_NtNtNtCs3yuwAp0waWO_4core3fmt3num3impxNtB9_7Display3fmt+0x20/0x20
[207796.185365] ? _RNvXs4_NtNtNtCs3yuwAp0waWO_4core3fmt3num3impxNtB9_7Display3fmt+0x20/0x20
[207796.185365] ? _RNvXCsf5oFb5M32wh_6oximodNtB2_6OxiModNtCsfATHBUcknU9_6kernel6Module4init+0x263/0x270 [oximod]
[207796.185365] ? radix_tree_node_alloc+0x73/0xc0
[207796.189969] ? _RNvXNtNtNtCs3yuwAp0waWO_4core3fmt3num3impaNtB6_7Display3fmt+0x30/0x30
[207796.189969] ? _RNvXs_Csf5oFb5M32wh_6oximodNtB4_6OxiModNtNtNtCs3yuwAp0waWO_4core3ops4drop4Drop4drop+0x60/0x60 [oximod]
[207796.189969] ? init_module+0x15/0x80 [oximod]
How is the Rust unwinding apparatus hooked into the kernel? Citing
these compiler flags from the Makefile
once more:
KBUILD_RUSTFLAGS := $(rust_common_flags) \
--target=$(objtree)/rust/target.json \
-Cpanic=abort -Cembed-bitcode=n -Clto=n \
===> ^^^^^^^^^^^^^
-Cforce-unwind-tables=n -Ccodegen-units=1 \
===> ^^^^^^^^^^^^^^^^^^^^^^^
-Csymbol-mangling-version=v0 \
-Crelocation-model=static \
-Zfunction-sections=n \
-Dclippy::float_arithmetic
it would appear that rustc
is instructed to neither insert
unwinding code nor generate the unwind tables.
IOW we shouldn’t a) see any unwinding and b) be getting any backtraces
at all!
But we do get a backtrace.
The reason for this, it appears, is due to how objtool
generates the
information for the ORC unwinder.
According to the kernel docs this happens without the involvement
of DWARF debug info,
which would explain the -Cforce-unwind-tables=n
.
Instead, it relies on objtool
“reverse engineering” the flow of
the generated code, emitting ORC data derived from the stack metadata
validation mechanism.
No magic involved, let’s leave it at that.
[207796.189969] ? selinux_kernfs_init_security+0x54/0x1c0
<snip />
[207796.194966] ? load_module+0x1332/0x14a0
<snip />
[207796.194966] ? entry_SYSCALL_64_after_hwframe+0x46/0xb0
[207796.199943] </TASK>
[207796.199943] Modules linked in: oximod(O+) kvtcol(O) intel_rapl_msr intel_rapl_common <snip />
[207796.203378] ---[ end trace 0000000000000000 ]---
And the module hangs.
Accessing Kernel Internals
One can feel right at home hacking on the kernel in Rust as the
environment is already surprisingly complete.
Pre-made bindings to raw kernel APIs are available in the
bindings::
namespace.
In addition, there are idiomatic Rust wrappers for numerous core
concepts like ioctls, struct file
, struct sk_buff
etc.
Finally, symbols exported using EXPORT_SYMBOL{_GPL}
are available.
Functions declared inline
on the C end can be a bit tricky to
interact with, as is the case for example with kmalloc()
which
is worked around by using krealloc()
in the allocator.
Thanks to its expressiveness Rust allows for abstractions that
make the developer experience vastly more ergonomic than C, and that
is true of kernel code as well.
To give an example, user pointers are represented as struct UserSlicePtr
on a type level instead of the ad-hoc __user
tagging convention
that is used in C.
Its implementation defines just a single unsafe
member function,
the constructor whose invariants cannot be enforced by the compiler.
The remainder of operations defined on user slices can be called
in safe code – which is the rule rather than the exception as most of
the time the kernel APIs one interacts with provide their data already
wrapped in UserSlicePtr
.
On the topic of ergonomics, the c_str!()
macro
deserves mention. It allows creating ffi::CStr
s at compile
time, making them almost as convenient to use as &str
- it would
make a worthy addition to std
.
Sysctl
The Rust wrapper which we will be using in this module is
kernel::sysctl::Sysctl
with its companion trait SysctlStorage
.
The trait is used to define how read and write accesses are handled
for this sysctl: on store_value()
it receives a shared reference
to the slice with the data written from userspace, on read_value()
it receives a wrapped, mutable __user
pointer to write to.
At this stage SysctlStorage
is only implemented for one type,
core::sync::atomic::AtomicBool
which is sufficient for this
chapter.
pub trait SysctlStorage: Sync {
fn store_value(&self, data: &[u8]) -> (usize, Result);
fn read_value(&self, data: &mut UserSlicePtrWriter) -> (usize, Result);
}
pub struct Sysctl<T: SysctlStorage> {
inner: Box<T>,
_table: Box<[bindings::ctl_table]>,
header: *mut bindings::ctl_table_header,
}
Firstly, in the module type constructor the sysctl is registered under the desired path:
struct OxiMod(Sysctl<AtomicBool>);
impl kernel::Module for OxiMod {
fn init(name: &'static CStr, _module: &'static ThisModule) -> Result<Self> {
let sysctl = Sysctl::register(
c_str!("example"),
c_str!("atom"),
AtomicBool::new(false),
Mode::from_int(0o644),
)?;
Ok(OxiMod(sysctl))
}
}
This results in the creation of /proc/sys/example/atom
at
insmod
time. Since AtomicBool
already implements
SysctlStorage
, it can be read from and written to at that path:
# insmod oximod.ko
# sysctl example/atom
example.atom = 0
# ls -l /proc/sys/example/
total 0
-rw-r--r-- 1 root root 0 2022-09-02 21:48 atom
# cat /proc/sys/example/atom
0
# printf 1 >/proc/sys/example/atom
# cat /proc/sys/example/atom
1
This was quick, in just a few lines of code we went from zero to a new
sysctl
available as example/atom
.
As for correctness, the module author doesn’t really have a reason to
worry about it as that is the compiler’s job.
The Sync
trait bound on SysctlStorage
guarantees that access
to the backing type is serialized.
Digression: It’s not my EFAULT!
In reality getting to the point of a working OxiMod
was not as
frictionless as the above would make you believe.
While playing around with struct Sysctl
I soon hit a roadblock:
# cat /proc/sys/example/atom
cat: /proc/sys/example/atom: Bad address
# printf 0 >/proc/sys/example/atom
-bash: printf: write error: Bad address
Confusing. “Bad address” is EFAULT, what could possibly cause that?
Looking at the implementation of SysctlStorage
, safe access to the
buffer is handled by wrapping it in UserSlicePtr
,
the tuple struct representing a user pointer.
Read / write operations on that type are defined in its implementation
of the IoBufferReader
, which is where the calls to
copy_{from,to}_user()
happen and EFAULT may be returned
on failure.
In the case of SysctlStorage
the failure originates in
access_ok()
on the source or destination buffer of
copy_from_user()
or copy_to_user()
, respectively.
Something about the pointer received by proc_handler()
must not
be right, causing access_ok()
to reject it.
Why that happens is obvious from proc_sysctl.c
where the call to
proc_handler()
happens: the buffer it passes to the function is
not a user pointer but a freshly allocated kernel one!
Turns out this was changed in kernel v5.7 with
commit 32927393dc1c: "sysctl: pass kernel pointers to ->proc_handler"
.
The fix is straightforward – get rid of the UserSlicePtr
wrapper
as we can safely operate on the kernel pointer directly.
Amazingly, the pull request on the Rust-for-Linux repo
received a blazing-fast review that ironed out some obvious flaws
within a few hours of submission.
Given the novelty of the project it comes as a surprise that parts of the Rust support haven’t kept up with kernel changes that were merged more than two years ago. To put a positive spin on it, Rust for Linux has reached a point of maturity where it is already subject ot bitrot!
Conclusion
This part of the series described the process of creating a trivial
out-of-tree module in a few lines of Rust code.
Starting from a minimal example provided by upstream, we discussed
allocating APIs and how to use them idiomatically in kernel side Rust
and then had a look at what happens if the module triggers a
BUG()
.
Finally, the actual sysctl was implemented by combining a few
predefined parts.
The resulting module doesn’t do much at all, it basically only allows storing a bit in the kernel, modifying and retrieving it. Part three will expand the functionality to actually have an effect on the system.