The basic bare metal code required to boot an AArch64 system is not terribly complicated, however, basic code will not do much. The code below handles most basic setup for a Raspberry 3 or 4 and has a few advanced features not found in other boot code examples. The code assumes RPi 3 or 4 HW, as elements in the BCM 283x or BCM2711 peripheral are initialized in the code. It is commented and the BCM specific code is generally abstracted out, so hopefully it is transparent enough that it can be adapted to different AArch64 platforms and different use cases. Features include:
- Ability to handle entry in EL2 or EL3
- Auto-detects Raspberry Pi version
- Sets up the RPi Physical Timer
- Sets up the General Interrupt Controller (GIC) for the RPi4
- Sets up the Stacks for EL1 and EL0 Exception Processing
- Initializes Environment for C Runtime
- Initializes Environment for C++ Runtime
There are few few limitations at this point:
- Single Core Only
- No Virtual Memory Management
- Semi-specific to RPi 3 & 4, though compatibles *should* work
This is a lengthy post but splitting it into multiple separate posts would probably be distracting. At the 100,000ft level, the idea is that the processor enters the top of this code in either EL3 or EL2, initializes the functions listed above and then exits in EL1 to the C++ code which performs the ‘kernel initialization’ and the kernel itself. All the code discussed in this post for the Raspberry Pi Bare Metal OS project can be found in my Github repository. Not all code in the kernel boot sequence is contained below, particularly the handful of subroutines which initialize different parts of the RPi hardware. Consulting Github for this code will be helpful.
RPi Boot Process Overview
Raspberry Pis have a somewhat unique boot process which works well to prevent bricking of the device. When powered on, it is actually the GPU in the BCM peripheral chip’s video core which starts to run boot code in an internal ROM and eventually starts the ARM processor. The ‘config.txt‘ file and the ‘command_line.txt‘ files are loaded by the video core, parsed and a variety of internal attributes are configured.
Once the two files are parsed and the video core is configured, the GPU loads the ‘armstub‘ file into the right spot in physical memory for the ARM processor to start executing it. On entry to the armstub the ARM core will be running in EL3. It is the armstub file which will eventually jump to the start of the kernel code.
The Raspberry Pi OS ships with ‘armstub8.bin‘ which is the default, but the armstub loaded by the video core can be changed in the ‘config.txt‘ file (consult Github for an example). The default armstub does some initialization before shifting the Exception Level down to EL2 prior to jumping into the kernel.
This project includes a custom ‘armstub’, named ‘armstub_minimal.bin‘. This minimal armstub does nothing more than jump into the kernel code – still at the EL3 Exception Level. This permits the startup code to handle any EL3 initialization that might be required for different use cases. There are a number elements of the HW that need to be configured in EL3, those appear in the boot code below. The boot code below can be used with the default armstub file shipped with the Raspberry Pi OS, as it will detect on entry if the core is running in EL3 or EL2 – and will skip all the EL3 initialization if the core is already running in EL2.
Exception Levels
There are four Exception Levels built into ARM 8 cores. I suspect the term ‘Exception Level’ comes from ARM 8 interrupt processing (interrupts are a subset of more general ‘exceptions’) where different hardware or software ‘exceptions’ (not to be confused with C++ or Java exceptions) are tied to different Exception Levels. Additionally, there are sets of instructions that are restricted to a specific Exception Level.
- EL3 – Highest Exception Level and the only level in which the processor can switch from ‘secure mode’ to ‘insecure mode’. Code running at EL3 is typically called a ‘Secure Monitor’. EL3 is optional in ARM processors.
- EL2 – Hypervisor Exception Level, virtualization code will run at this level and page fault exceptions generated by the memory manager when using 2 Stage address translation will be handled in this level. EL2 is also optional in ARM processors.
- EL1 – What used to be called ‘Ring 0’ in OS development. This is the level the kernel and most interrupt handlers should execute within.
- EL0 – What used to be called ‘Ring 3’ in OS development. This is the level within which application code will execute.
Exception Levels can change as a result of either (1) an exception which is handled at a specific (usually higher) exception level -or- (2) execution of the ‘eret‘ (exception return) call which permits the core (PE or ‘Processing Element’ in ARM documentation) to potentially drop to a lower exception level. Exceptions can leave the PE at the same EL or move to a higher level and conversely the exception return can leave the PE at the same EL or move it down.
The boot code in this post only supports execution in EL1 and EL0. Maybe in the future I will dabble in a lightweight hypervisor which would pull in EL2 but I doubt I will write a Secure Monitor for EL3. It should be noted that either EL3 or EL2 may be used – but not together. If running in Trusted or Secure Mode, EL2 is disabled.
Boot Code
The code below is a pretty complete Raspberry Pi Aarch64 boot up example which has been tested on RPi 3 and 4 but which should be generalizable to any AArch64 system with EL3, processors without EL3 would require more modifications to initialize subsystems correctly in EL1.
#include "cpu_part_nums.h"
#define BIT(x) (1 << (x))
#define SCR_RW BIT(10)
#define SCR_HCE BIT(8)
#define SCR_SMD BIT(7)
#define SCR_RES1_5 BIT(5)
#define SCR_RES1_4 BIT(4)
#define SCR_NS BIT(0)
#define SCR_VAL (SCR_RW | SCR_HCE | SCR_SMD | SCR_RES1_5 | SCR_RES1_4 | SCR_NS)
#define ACTLR_VAL (BIT(0) | BIT(1) | BIT(4) | BIT(5) | BIT(6))
#define CPUECTLR_EL1 S3_1_C15_C2_1
#define CPUECTLR_EL1_SMPEN BIT(6)
#define L2CTLR_EL1 S3_1_C11_C0_2
#define SPSR_EL3_D BIT(9)
#define SPSR_EL3_A BIT(8)
#define SPSR_EL3_I BIT(7)
#define SPSR_EL3_F BIT(6)
#define SPSR_EL3_MODE_EL2H 9
#define SPSR_EL3_VAL (SPSR_EL3_D | SPSR_EL3_A | SPSR_EL3_I | SPSR_EL3_F | SPSR_EL3_MODE_EL2H)
#define SPSR_EL2_MASK_ALL (7 << 6)
#define SPSR_EL2_EL1h (5 << 0)
#define SPSR_EL2_VALUE (SPSR_EL2_MASK_ALL | SPSR_EL2_EL1h)
//
// Register usage:
//
// x0 - return value from function calls and local scratchpad register
// x1 - local scratchpad register
// x2 - local scratchpad register
// x19 - local scratchpad register used for computing addresses
// x20 - local scratchpad register, holds the end address of C++ statics to be initialized
//
.section ".text.boot"
.global _start
_start:
//
// Set move to EL1
//
// This will require some config at EL3 and EL2, which is dependent on the EL level
// when we enter this code.
//
// The default armstub configures EL3 and then jumps to this code in EL2.
// The armstub_base.bin with this project does not change the EL at all,
// so we will enter here at EL3.
//
// Determine if we are running at EL3 - this can occur if the default ARMSTUB is replaced
mrs x0, CurrentEL
and x0, x0, #12 // clear reserved bits
cmp x0, #12
bne running_in_el2 // Not running on EL3, so skip to EL2 entry point
// Running at EL3
// Set L2 read/write cache latency to 3
mrs x0, L2CTLR_EL1
mov x1, #0x22
orr x0, x0, x1
msr L2CTLR_EL1, x0
// Enable Floating point instruction set and SIMD instructions
msr CPTR_EL3, xzr
// Set up SCR
mov x0, #SCR_VAL
msr SCR_EL3, x0
// Set up ACTLR
mov x0, #ACTLR_VAL
msr ACTLR_EL3, x0
// Set SMPEN
mov x0, #CPUECTLR_EL1_SMPEN
msr CPUECTLR_EL1, x0
// Identify the board type and then setup the physical timer
bl IdentifyBoardType
mov x3, x0 // Copy board type into register 3
bl SetupPhysicalTimer
// If we are on a platform that supports the Generic Interrupt Controller, configure it now
// Register 3 contains the board type copied above.
cmp x3, #RPI_BOARD_ENUM_RPI3
b.eq rpi3_skip_gic_config
bl ConfigureGIC400
rpi3_skip_gic_config:
// Set up SCTLR_EL2
// All set bits below are res1. LE, no WXN/I/SA/C/A/M
ldr x0, =0x30c50830
msr SCTLR_EL2, x0
// Switch to EL2
mov x0, #SPSR_EL3_VAL
msr SPSR_EL3, x0
adr x0, running_in_el2
msr ELR_EL3, x0
eret
// Running at EL2 - processor will be in EL2 on entry with the default ARMSTUB
running_in_el2:
// Read CPU id, stop slave cores
mrs x1, MPIDR_EL1
and x1, x1, #3
cbz x1, continue_on_core0
// CPU id > 0, stop
b ParkCore
// Sets stack for EL1 - can only be done in EL2 or EL3
adrp x19, __el1_stack_top
add x19, x19, :lo12:__el1_stack_top
msr SP_EL1, x19
// Enable CNTP for lower exception levels
mrs x0, CNTHCTL_EL2
orr x0, x0, #3
msr CNTHCTL_EL2, x0
msr CNTVOFF_EL2, xzr
// Disable various CPU architectural feature traps
mov x0, #0x33FF
msr CPTR_EL2, x0
msr HSTR_EL2, xzr
mov x0, #(3 << 20)
msr CPACR_EL1, x0
// Enable AArch64 in EL1
mov x0, #(1 << 31) // AArch64
orr x0, x0, #(1 << 1) // SWIO hardwired on Pi3
msr HCR_EL2, x0
mrs x0, HCR_EL2
// Configure CPU for EL1 and EL0
mov x2, #0x0800
movk x2, #0x30d0, lsl #16
msr SCTLR_EL1, x2
// Change execution level to EL1
mov x2, SPSR_EL2_VALUE
msr SPSR_EL2, x2
adr x2, running_in_el1 // Load the EL2 exception return reg with
// the running_in_el1 symbol so we go there on eret
msr ELR_EL2, x2
eret // Exception level changes only on exception return
running_in_el1:
bl IdentifyBoardType // Identify the board again in EL1 so we can store it
adrp x19, __hw_board_type // Store board type in global: __hw_board_type.
// We do this here as EL3 has its own memory
add x19, x19, :lo12:__hw_board_type // which is not accessible from EL1.
str w0, [x19]
// Set up exception handlers
adr x0, _exception_vectors
msr VBAR_EL1, x0 // Put the exception vector table address into VBAR_EL1
// Set top of stack.
// The linker script defines the __stack symbol to be the top of OS RAM.
// Stack grows down toward the code.
//
// When we set the SPSR_EL2 register to SPSR_EL2_VALUE above, the
// SPSR_EL2_EL1h bits are set which means that on the eret above, SPSel set to 1 for
// execution in EL1. This matters as it will impact which
// exception vector table entry is chosen when an exception (or interrupt) occurs.
// If the exception is processed in EL1 with SPSel == 1, then the 't' suffixed
// entries are used.
//
// SPSel == 1 means each exception level will have its own stack pointer (SP).
// I *think* this makes sense from a security standpoint as it eliminates a risk of a
// stack based attack in EL0 getting into EL1. The flip side is that we will have to
// handle the stacks individually and need to insure we do not overrun them.
// Not setting up the SPs correctly should be a catastrophic error on startup.
adrp x19, __el0_stack_top
add x19, x19, :lo12:__el0_stack_top
msr SP_EL0, x19 // Set the stack pointer for EL0
// Clear bss
// TODO - make this more robust for BSS segments not a multiple of 8 bytes in length
adrp x1, __bss_start
add x1, x1, :lo12:__bss_start
ldr w2, =__bss_size_in_double_words
cbz w2, initialize_cpp_statics // Skip the loop if the bss section is zero length
loop_for_bss_clear:
str xzr, [x1], #8 // Set 8 bytes at a time - the section
// in the linker script must match
sub w2, w2, #1
cbnz w2, loop_for_bss_clear
// Initialize the C++ statics
initialize_cpp_statics:
adrp x19, __init_array_start
add x19, x19, :lo12:__init_array_start
adrp x20, __init_array_end
add x20, x20, :lo12:__init_array_end
cmp x19, x20
beq branch_to_kernel_main
loop_for_next_static_initializer:
ldr x0, [x19], 8
blr x0
cmp x19, x20
bne loop_for_next_static_initializer
// Jump to the kernel main function, must have C style linkage
branch_to_kernel_main:
bl kernel_main
// If execution returns from the kernel main, then halt the core
b ParkCore
Linker Directives
Below the #defines but just before the assembly code, there are 3 linker directives. The first:
.section ".text.boot"
instructs the linker to place the code in the file into the ‘text.boot‘ section of the memory map. The ‘text‘ section is referenced in the linker script described in the previous post. The next directive simply tells the linker to expose the symbol _start global.
Determining Current Exception Level
AArch_64 has a dedicated register for holding the current exception level, unsurprisingly named ‘CurrentEL‘. Bits 2 and 3 of this register hold the exception level – which is binary 0 through 3 for exception levels 0 through 3 respectively.
mrs x0, CurrentEL
and x0, x0, #12 // clear reserved bits
cmp x0, #12
bne running_in_el2 // Not running on EL3, so skip to EL2 entry point
Configuration in EL3
There are a number of probably non-obvious initialization steps in the boot code. I found these in the RPi Armstub, though I believe they are also described in the ARM Documentation.
First, the L2 Cache for EL1 is configured with a latency of 3 cycles. This register needs to be configured early in the boot process, before memory access occurs. Next, the floating point and SIMD instruction sets are enabled.
The Secure Configuration Register for EL3 (SCR_EL3) is initialized next. The bits set in the SCR_EL3 register appear in the code, and their meaning can be found in the ARM documentation. After the SCR, the Auxiliary Configuration Register for EL3 (ACTLR_EL3) is initialized. The ACTLR_EL3 register contains implementation defined features – so the reference for them will be with the actual processor documentation. After the ACTLR_EL3 register, the CPU Extended Control Register for EL1 (CPUECTLR_EL1) is initialized. Again, documentation is the best place for more detail.
Identifying the RPI Type and Setting Up the Physical Timer
Next, the boot code jumps to a subroutine which identifies the Raspberry Pi Board Type, currently RPi3 or RPi4. This needs to be done to correctly configure the Physical Timer and to determine how to configure the interrupt controller.
The Physical Timer must be configured in EL3 and configuration is different between BCM 283x and BCM 2711 peripherals. The code for identifying the board type and setting up the timer can be found my Github repo. After initializing the Physical Timer, the code will then initialize the Generic Interrupt Controller (GIC 400 in this case) if the board is an RPi4, otherwise the GIC400 initialization is skipped for the RPi3 family. The RPi3 does not contain a GIC.
Further down the boot code, IdentifyBoardType is called again in the boot code, which may seen odd. This is a bit inefficient but fortunately there is not a lot of code in the identification subroutine. The second call to IdentifyBoardType is needed as it occurs in EL1 and is then stored in a global variable which is then accessible to the kernel code. The value cannot be stored after the first call to IdentifyBoardType, as that call is made in EL3 and EL3 has a separate memory space which is not shared with EL1.
Switching to EL2
Just after configuring the Physical Timer and conditionally configuring the GIC, the boot code initializes the System Control Register for EL2 (SCTLR_EL2). This register and the initialization values are in the ARM Documentation. After SCTLR_EL2 is initialized, we jump to the EL2 Exception Level – if we entered in EL3.
// Switch to EL2
mov x0, #SPSR_EL3_VAL
msr SPSR_EL3, x0
adr x0, running_in_el2
msr ELR_EL3, x0
eret
// Running at EL2 - processor will be in EL2 on entry with the default ARMSTUB
running_in_el2:
In the code snippet above, the Saved Program Status Register for EL3 (SPSR_EL3) is initialized and the address of the ‘running_in_el2‘ symbol is loaded into the Exception Level Return Register for EL3 (ELR_EL3). The meaning of the bits set in the SPSR_EL3 register are in the ARM documentation, but the one worth noting here is the last 4 bits which are initialized with the value 9 which tells the PE to shift to EL2H mode on return from the exception routine. When the ‘eret‘ instruction is executed, then the Exception Level is changed to EL2 and the program counter picks up at the address in ELR_EL3, which is the ‘running_in_el2‘ symbol.
Single-Core Execution
At present, the boot code exits running in single-core mode. PE 0 is used for execution and PE2 1, 2 and 3 are parked in an infinite loop. This is temporary and will be relaxed for SMP execution later – right now, single threaded execution is all that is needed. There are a number of other examples of booting with multiple cores available.
// Read CPU id, stop slave cores
mrs x1, MPIDR_EL1
and x1, x1, #3
cbz x1, continue_on_core0
// CPU id > 0, stop
b ParkCore
In the code above, the bottom 2 bits of the Multiprocessor Affinity Register (MPIDR_EL1) are checked to see if they hold the value of 0. The MPIDR_EL1 register holds information on the multi-processing state of the hardware and the different PEs. The value of zero in the bottom 2 bits of the register indicates PE 0 is running for an RPi quad PE CPU. There is not much documentation on this register, so for other CPU configurations – you will likely have to do some research to find the magic values to check.
If the current PE is not PE 0, then the PE is simply put into an infinite loop with the ‘wfe‘ instruction to let the CPU know the PE can go into a low power state.
Setting the Stack Pointer for EL1
In the code snippet that follows, the stack pointer for EL1 is set to values associated with symbols defined in the linker script. The stack grows down from the indicated location toward the heap which grows up from the end of the program code. The EL1 stack pointer can only be set in EL2 or EL3.
// Sets stack for EL1 - can only be done in EL2 or EL3
adrp x19, __el1_stack_top
add x19, x19, :lo12:__el1_stack_top
msr SP_EL1, x19
Processor Configuration
After setting the stack pointer for EL1, there are a handful of configurations for the counter/timer register, disabling EL2 traps for a variety of architectural features, enabling AArch64 in EL1 and finally configuring the CPU for execution in EL1 and EL0. A number of these settings are rather cryptic, particularly for CPTR_EL2, HSTR_EL2 and CPACR_EL1, so consult the ARM documentation before modifying values. In general, all traps from EL1 or EL0 to EL2 are disabled (as we are not implementing a hypervisor – at least yet) and traps from EL0 to EL1 for various instructions are also disabled. If you change the architectural features enabled, you should double-check the instruction traps.
I am not expert on these settings, they appear to be ‘standard’ for RPi bare-metal code. For other AArch64 implementations with special execution requirements (like Streaming SVE Mode) the configuration will be different.
// Enable CNTP for lower exception levels
mrs x0, CNTHCTL_EL2
orr x0, x0, #3
msr CNTHCTL_EL2, x0
msr CNTVOFF_EL2, xzr
// Disable various CPU architectural feature traps
mov x0, #0x33FF
msr CPTR_EL2, x0
msr HSTR_EL2, xzr
mov x0, #(3 << 20)
msr CPACR_EL1, x0
// Enable AArch64 in EL1
mov x0, #(1 << 31) // AArch64
orr x0, x0, #(1 << 1) // SWIO hardwired on Pi3
msr HCR_EL2, x0
mrs x0, HCR_EL2
// Configure CPU for EL1 and EL0
mov x2, #0x0800
movk x2, #0x30d0, lsl #16
msr SCTLR_EL1, x2
Switching to EL1
Much like moving from EL3 to EL2, to move from EL2 to EL1, it is necessary to setup the ‘EL return address’ register and execute the ‘eret‘ instruction.
// Change execution level to EL1
mov x2, SPSR_EL2_VALUE
msr SPSR_EL2, x2
adr x2, running_in_el1 // Load the EL2 exception return reg with
// the running_in_el1 symbol so we go there on eret
msr ELR_EL2, x2
eret // Exception level may only change on an exception return
running_in_el1:
bl IdentifyBoardType // Identify the board again in EL1 so we can store it
adrp x19, __hw_board_type // Store the board type in the global __hw_board_type.
// We do this here as EL3 has its own memory
add x19, x19, :lo12:__hw_board_type // which is not accessible from EL1.
str w0, [x19]
After switching to EL1, the code gets the board identity again and stores it in a global variable for use from kernel code.
Setting Up Exception Vectors
AArch64 exception vector tables are setup in memory and are used to identify the correct handler for different exceptions. Recall, exceptions are a super-set of just interrupts. The code required for setting up the vectors can be found in my Github repository in the isr_kernel_entry.S file. The key elements in that file are the kernel entry and exit code which just saves the registers on entry and restores them on exit and the exception table itself.
// Set up exception handlers
adr x0, _exception_vectors
msr VBAR_EL1, x0 // Put the exception vector table address into VBAR_EL1
Setting the Stack Pointer for EL0 Exceptions
Code required to setup the stack pointer for EL0 exceptions appears below. The comments in the snippet describe the interaction of exception processing and stack pointers. In short, if SPSel == 1, then h suffixed vectors are used and each exception level will have its own stack pointer. The CPU could be configured to share a stack pointer between EL1 and EL0 and that could be fine for bare metal code executing only in EL1 but for an OS with processes running in EL0, we should have different stacks.
For clarity, each process in EL0 will have a different stack of its own. The EL0 stack here is for exception processing where the exception code is run in EL0. Looking through the documentation, it appears as if SPSel settings are partly driven by support for the Linux exception processing model.
// Set top of stack.
// The linker script defines the __stack symbol to be the top of OS RAM.
// Stack grows down toward the code.
//
// When we set the SPSR_EL2 register to SPSR_EL2_VALUE above, the
// SPSR_EL2_EL1h bits are set which means that on the eret above, SPSel set to 1 for
// execution in EL1. This matters as it will impact which
// exception vector table entry is chosen when an exception (or interrupt) occurs.
// If the exception is processed in EL1 with SPSel == 1, then the 'h' suffixed
// entries are used.
//
// SPSel == 1 means each exception level will have its own stack pointer (SP).
// I *think* this makes sense from a security standpoint as it eliminates a risk of a
// stack based attack in EL0 getting into EL1. The flip side is that we will have to
// handle the stacks individually and need to insure we do not overrun them.
// Not setting up the SPs correctly should be a catastrophic error on startup.
adrp x19, __el0_stack_top
add x19, x19, :lo12:__el0_stack_top
msr SP_EL0, x19 // Set the stack pointer for EL0
As is the case for EL1, the symbol used for the EL0 top of stack is found in the linker script.
Clearing the BSS Segment for C Code
The C Language model prescribes that the BSS segment, which holds uninitialized data, be zeroed prior to execution jumping to the ‘main()‘ function. The code below relies on symbols from the linker script to zero out the BSS. The is done in 8 byte chunks, so the alignment needs to be correct.
// Clear bss
// TODO - make this more robust for BSS segments not a multiple of 8 bytes in length
adrp x1, __bss_start
add x1, x1, :lo12:__bss_start
ldr w2, =__bss_size_in_double_words
cbz w2, initialize_cpp_statics // Skip the loop if the bss section is zero length
loop_for_bss_clear:
str xzr, [x1], #8 // Set 8 bytes at a time - the section
// in the linker script must match
sub w2, w2, #1
cbnz w2, loop_for_bss_clear
There is some discussion of the bss segment in my post on Linker Scripts.
Initializing C++ Static Globals
In the C++ Language Model, global static variables must be initialized prior to execution of the ‘main()‘ function. For static class instances, this will require invoking the class constructor and passing the correct memory location for the class instance.
Fortunately, C++ compilers do the heavy lifting for us. The compiler generates an array of void functions which can be called just prior to jumping to the ‘main()‘ function which will initialize each static variable. All the initialization code must do is walk the array and call the functions. The code below does just that.
// Initialize the C++ statics
initialize_cpp_statics:
adrp x19, __init_array_start
add x19, x19, :lo12:__init_array_start
adrp x20, __init_array_end
add x20, x20, :lo12:__init_array_end
cmp x19, x20
beq branch_to_kernel_main
loop_for_next_static_initializer:
ldr x0, [x19], 8
blr x0
cmp x19, x20
bne loop_for_next_static_initializer
This is the last step performed in the boot code before jumping to the kernel main.
Jumping to Kernel Main
Finally, we have the branch to kernel_main(). One detail here is that I chose to use the symbol name kernel_main() instead of main() specifically to avoid any risk of ‘special handling of main()’ applied by the compiler or linker.
// Jump to the kernel main function, must have C style linkage
branch_to_kernel_main:
bl kernel_main
// If execution returns from the kernel main, then halt the core
b ParkCore
If execution returns from kernel_main(), the PE is just parked.
Conclusion
The post above provides *mostly complete* bare metal boot code for RPi3 or 4 platforms running in AArch64. Code referenced above can be found in the associated Github Repository.