AArch64 Bare Metal Boot Code

The basic bare metal code required to boot an AArch64 system is not terribly complicated, however, basic code will not do much. The code below handles most basic setup for a Raspberry 3 or 4 and has a few advanced features not found in other boot code examples. The code assumes RPi 3 or 4 HW, as elements in the BCM 283x or BCM2711 peripheral are initialized in the code. It is commented and the BCM specific code is generally abstracted out, so hopefully it is transparent enough that it can be adapted to different AArch64 platforms and different use cases. Features include:

  • Ability to handle entry in EL2 or EL3
  • Auto-detects Raspberry Pi version
  • Sets up the RPi Physical Timer
  • Sets up the General Interrupt Controller (GIC) for the RPi4
  • Sets up the Stacks for EL1 and EL0 Exception Processing
  • Initializes Environment for C Runtime
  • Initializes Environment for C++ Runtime

There are few few limitations at this point:

  • Single Core Only
  • No Virtual Memory Management
  • Semi-specific to RPi 3 & 4, though compatibles *should* work

This is a lengthy post but splitting it into multiple separate posts would probably be distracting. At the 100,000ft level, the idea is that the processor enters the top of this code in either EL3 or EL2, initializes the functions listed above and then exits in EL1 to the C++ code which performs the ‘kernel initialization’ and the kernel itself. All the code discussed in this post for the Raspberry Pi Bare Metal OS project can be found in my Github repository. Not all code in the kernel boot sequence is contained below, particularly the handful of subroutines which initialize different parts of the RPi hardware. Consulting Github for this code will be helpful.

RPi Boot Process Overview

Raspberry Pis have a somewhat unique boot process which works well to prevent bricking of the device. When powered on, it is actually the GPU in the BCM peripheral chip’s video core which starts to run boot code in an internal ROM and eventually starts the ARM processor. The ‘config.txt‘ file and the ‘command_line.txt‘ files are loaded by the video core, parsed and a variety of internal attributes are configured.

Once the two files are parsed and the video core is configured, the GPU loads the ‘armstub‘ file into the right spot in physical memory for the ARM processor to start executing it. On entry to the armstub the ARM core will be running in EL3. It is the armstub file which will eventually jump to the start of the kernel code.

The Raspberry Pi OS ships with ‘armstub8.bin‘ which is the default, but the armstub loaded by the video core can be changed in the ‘config.txt‘ file (consult Github for an example). The default armstub does some initialization before shifting the Exception Level down to EL2 prior to jumping into the kernel.

This project includes a custom ‘armstub’, named ‘armstub_minimal.bin‘. This minimal armstub does nothing more than jump into the kernel code – still at the EL3 Exception Level. This permits the startup code to handle any EL3 initialization that might be required for different use cases. There are a number elements of the HW that need to be configured in EL3, those appear in the boot code below. The boot code below can be used with the default armstub file shipped with the Raspberry Pi OS, as it will detect on entry if the core is running in EL3 or EL2 – and will skip all the EL3 initialization if the core is already running in EL2.

Exception Levels

There are four Exception Levels built into ARM 8 cores. I suspect the term ‘Exception Level’ comes from ARM 8 interrupt processing (interrupts are a subset of more general ‘exceptions’) where different hardware or software ‘exceptions’ (not to be confused with C++ or Java exceptions) are tied to different Exception Levels. Additionally, there are sets of instructions that are restricted to a specific Exception Level.

  • EL3 – Highest Exception Level and the only level in which the processor can switch from ‘secure mode’ to ‘insecure mode’. Code running at EL3 is typically called a ‘Secure Monitor’. EL3 is optional in ARM processors.
  • EL2 – Hypervisor Exception Level, virtualization code will run at this level and page fault exceptions generated by the memory manager when using 2 Stage address translation will be handled in this level. EL2 is also optional in ARM processors.
  • EL1 – What used to be called ‘Ring 0’ in OS development. This is the level the kernel and most interrupt handlers should execute within.
  • EL0 – What used to be called ‘Ring 3’ in OS development. This is the level within which application code will execute.

Exception Levels can change as a result of either (1) an exception which is handled at a specific (usually higher) exception level -or- (2) execution of the ‘eret‘ (exception return) call which permits the core (PE or ‘Processing Element’ in ARM documentation) to potentially drop to a lower exception level. Exceptions can leave the PE at the same EL or move to a higher level and conversely the exception return can leave the PE at the same EL or move it down.

The boot code in this post only supports execution in EL1 and EL0. Maybe in the future I will dabble in a lightweight hypervisor which would pull in EL2 but I doubt I will write a Secure Monitor for EL3. It should be noted that either EL3 or EL2 may be used – but not together. If running in Trusted or Secure Mode, EL2 is disabled.

Boot Code

The code below is a pretty complete Raspberry Pi Aarch64 boot up example which has been tested on RPi 3 and 4 but which should be generalizable to any AArch64 system with EL3, processors without EL3 would require more modifications to initialize subsystems correctly in EL1.

Linker Directives

Below the #defines but just before the assembly code, there are 3 linker directives. The first:

instructs the linker to place the code in the file into the ‘text.boot‘ section of the memory map. The ‘text‘ section is referenced in the linker script described in the previous post. The next directive simply tells the linker to expose the symbol _start global.

Determining Current Exception Level

AArch_64 has a dedicated register for holding the current exception level, unsurprisingly named ‘CurrentEL‘. Bits 2 and 3 of this register hold the exception level – which is binary 0 through 3 for exception levels 0 through 3 respectively.

Configuration in EL3

There are a number of probably non-obvious initialization steps in the boot code. I found these in the RPi Armstub, though I believe they are also described in the ARM Documentation.

First, the L2 Cache for EL1 is configured with a latency of 3 cycles. This register needs to be configured early in the boot process, before memory access occurs. Next, the floating point and SIMD instruction sets are enabled.

The Secure Configuration Register for EL3 (SCR_EL3) is initialized next. The bits set in the SCR_EL3 register appear in the code, and their meaning can be found in the ARM documentation. After the SCR, the Auxiliary Configuration Register for EL3 (ACTLR_EL3) is initialized. The ACTLR_EL3 register contains implementation defined features – so the reference for them will be with the actual processor documentation. After the ACTLR_EL3 register, the CPU Extended Control Register for EL1 (CPUECTLR_EL1) is initialized. Again, documentation is the best place for more detail.

Identifying the RPI Type and Setting Up the Physical Timer

Next, the boot code jumps to a subroutine which identifies the Raspberry Pi Board Type, currently RPi3 or RPi4. This needs to be done to correctly configure the Physical Timer and to determine how to configure the interrupt controller.

The Physical Timer must be configured in EL3 and configuration is different between BCM 283x and BCM 2711 peripherals. The code for identifying the board type and setting up the timer can be found my Github repo. After initializing the Physical Timer, the code will then initialize the Generic Interrupt Controller (GIC 400 in this case) if the board is an RPi4, otherwise the GIC400 initialization is skipped for the RPi3 family. The RPi3 does not contain a GIC.

Further down the boot code, IdentifyBoardType is called again in the boot code, which may seen odd. This is a bit inefficient but fortunately there is not a lot of code in the identification subroutine. The second call to IdentifyBoardType is needed as it occurs in EL1 and is then stored in a global variable which is then accessible to the kernel code. The value cannot be stored after the first call to IdentifyBoardType, as that call is made in EL3 and EL3 has a separate memory space which is not shared with EL1.

Switching to EL2

Just after configuring the Physical Timer and conditionally configuring the GIC, the boot code initializes the System Control Register for EL2 (SCTLR_EL2). This register and the initialization values are in the ARM Documentation. After SCTLR_EL2 is initialized, we jump to the EL2 Exception Level – if we entered in EL3.

In the code snippet above, the Saved Program Status Register for EL3 (SPSR_EL3) is initialized and the address of the ‘running_in_el2‘ symbol is loaded into the Exception Level Return Register for EL3 (ELR_EL3). The meaning of the bits set in the SPSR_EL3 register are in the ARM documentation, but the one worth noting here is the last 4 bits which are initialized with the value 9 which tells the PE to shift to EL2H mode on return from the exception routine. When the ‘eret‘ instruction is executed, then the Exception Level is changed to EL2 and the program counter picks up at the address in ELR_EL3, which is the ‘running_in_el2‘ symbol.

Single-Core Execution

At present, the boot code exits running in single-core mode. PE 0 is used for execution and PE2 1, 2 and 3 are parked in an infinite loop. This is temporary and will be relaxed for SMP execution later – right now, single threaded execution is all that is needed. There are a number of other examples of booting with multiple cores available.

In the code above, the bottom 2 bits of the Multiprocessor Affinity Register (MPIDR_EL1) are checked to see if they hold the value of 0. The MPIDR_EL1 register holds information on the multi-processing state of the hardware and the different PEs. The value of zero in the bottom 2 bits of the register indicates PE 0 is running for an RPi quad PE CPU. There is not much documentation on this register, so for other CPU configurations – you will likely have to do some research to find the magic values to check.

If the current PE is not PE 0, then the PE is simply put into an infinite loop with the ‘wfe‘ instruction to let the CPU know the PE can go into a low power state.

Setting the Stack Pointer for EL1

In the code snippet that follows, the stack pointer for EL1 is set to values associated with symbols defined in the linker script. The stack grows down from the indicated location toward the heap which grows up from the end of the program code. The EL1 stack pointer can only be set in EL2 or EL3.

Processor Configuration

After setting the stack pointer for EL1, there are a handful of configurations for the counter/timer register, disabling EL2 traps for a variety of architectural features, enabling AArch64 in EL1 and finally configuring the CPU for execution in EL1 and EL0. A number of these settings are rather cryptic, particularly for CPTR_EL2, HSTR_EL2 and CPACR_EL1, so consult the ARM documentation before modifying values. In general, all traps from EL1 or EL0 to EL2 are disabled (as we are not implementing a hypervisor – at least yet) and traps from EL0 to EL1 for various instructions are also disabled. If you change the architectural features enabled, you should double-check the instruction traps.

I am not expert on these settings, they appear to be ‘standard’ for RPi bare-metal code. For other AArch64 implementations with special execution requirements (like Streaming SVE Mode) the configuration will be different.

Switching to EL1

Much like moving from EL3 to EL2, to move from EL2 to EL1, it is necessary to setup the ‘EL return address’ register and execute the ‘eret‘ instruction.

After switching to EL1, the code gets the board identity again and stores it in a global variable for use from kernel code.

Setting Up Exception Vectors

AArch64 exception vector tables are setup in memory and are used to identify the correct handler for different exceptions. Recall, exceptions are a super-set of just interrupts. The code required for setting up the vectors can be found in my Github repository in the isr_kernel_entry.S file. The key elements in that file are the kernel entry and exit code which just saves the registers on entry and restores them on exit and the exception table itself.

Setting the Stack Pointer for EL0 Exceptions

Code required to setup the stack pointer for EL0 exceptions appears below. The comments in the snippet describe the interaction of exception processing and stack pointers. In short, if SPSel == 1, then h suffixed vectors are used and each exception level will have its own stack pointer. The CPU could be configured to share a stack pointer between EL1 and EL0 and that could be fine for bare metal code executing only in EL1 but for an OS with processes running in EL0, we should have different stacks.

For clarity, each process in EL0 will have a different stack of its own. The EL0 stack here is for exception processing where the exception code is run in EL0. Looking through the documentation, it appears as if SPSel settings are partly driven by support for the Linux exception processing model.

As is the case for EL1, the symbol used for the EL0 top of stack is found in the linker script.

Clearing the BSS Segment for C Code

The C Language model prescribes that the BSS segment, which holds uninitialized data, be zeroed prior to execution jumping to the ‘main()‘ function. The code below relies on symbols from the linker script to zero out the BSS. The is done in 8 byte chunks, so the alignment needs to be correct.

There is some discussion of the bss segment in my post on Linker Scripts.

Initializing C++ Static Globals

In the C++ Language Model, global static variables must be initialized prior to execution of the ‘main()‘ function. For static class instances, this will require invoking the class constructor and passing the correct memory location for the class instance.

Fortunately, C++ compilers do the heavy lifting for us. The compiler generates an array of void functions which can be called just prior to jumping to the ‘main()‘ function which will initialize each static variable. All the initialization code must do is walk the array and call the functions. The code below does just that.

This is the last step performed in the boot code before jumping to the kernel main.

Jumping to Kernel Main

Finally, we have the branch to kernel_main(). One detail here is that I chose to use the symbol name kernel_main() instead of main() specifically to avoid any risk of ‘special handling of main()’ applied by the compiler or linker.

If execution returns from kernel_main(), the PE is just parked.

Conclusion

The post above provides *mostly complete* bare metal boot code for RPi3 or 4 platforms running in AArch64. Code referenced above can be found in the associated Github Repository.

Leave a Reply