Home Up Questions?

The IBM eBios for NUMA systems

Introduction

The IBM eBios is a revision of the eBios from HAL Computer Systems, Inc. Its purpose is to take a number of 4-way SMP machines and combine them into a single NUMA machine, modifying the memory structure as supported by the MCU. The NUMA machine is presented as a large merged SMP to simplify operating system support.

The complete NUMA system consists of a number of 4-way SMP nodes. Each node is a complete, separate, 4-way SMP computer system. The nodes are connected with the MCU card. The MCU cards can be directly attached (2 nodes) or connected thru a MIC and a switch.

When the system boots, each node boots separately and runs the BIOS. After the BIOS sets up the base machine, the eBios runs. The eBios allows the separate machines to be combined in various ways, and then returns control to the BIOS, which boots the operating system.

The eBios begins by printing a message and waiting for a key press. Three keys are available:

The eBios waits until a key is pressed or 10 seconds has elapsed. The timeout of 10 seconds is the same as pressing ESC.

Pressing F1 is the normal path for the eBios.

The F1 Setup Menu

The F1 menu allows the eBios operation to be controlled in many ways. The following options are available. The arrow keys (up and down) can be used to select a particular option. The "enter" key can then be used to cycle among the possible values for that option.

The F1 menu has the following options:

; 1	          "SETUP MENU"
; 2
; 3	  "Extended BIOS"    Disable/NUMA/Cluster
; 4	  "Node ID"          0/1/2/3
; 5	  "Domain Members"   0/1/2/3/4/5/6/7/8/9/A/B/C/D/E/F
; 6	  "Switch"	     No/Yes
; 7	  "Configuration"    auto/2/4
; 8	  "Debug Mode"	     None/Debug/Trace
; 9			     * cpus
; 10	  "ESC - Exit, F10 - Save&Exit"

Extended BIOS mode

The "Extended BIOS" option controls the overall processing of the eBios.

The "Disable" value causes the eBios to skip all further processing and exit directly the BIOS, which boots the operating system. This allows the node to continue as its own separate system.

The "Cluster" value cause the eBios to set up the MCU and the MIC to allow communication, but does not adjust the memory map of the nodes. Each node continues to operate independently, but could use a device driver to use the MCU for inter-node communication. The "Node ID" and "Switch" option values are important, but the "Domain Members" and "Configuration" are not used for "Cluster" mode operation.

The "NUMA" value causes the eBios to reconfigure the system as a NUMA. Only one of the nodes (the Master) will continue to boot the operating system. The others are reconfigured as just extra memory and processors for the "SMP" image of the Master node.

Node ID

The "Node ID" is used to assign a node identifier to each node. The node identifier is used to determine how messages are routed through the switch. For a direct connection, any node IDs can be used. For a system with a switch, the node IDs are determined by the cabling from the node to the switch. The Fujitsu switch has 6 ports, labelled "Port 1", "Port 2", through "Port 6". The routing table is set up so that messages for "Node 0" are sent to "Port 1", messages for "Node 1" are sent to "Port 2", and so on, so that messages for "Node 5" are sent to "Port 6". Thus, the cabling of the system determines the node IDs that should be used for each node.

Domain Members

Once the Node IDs are defined, we can define those nodes that are to be grouped into a coherence domain -- that is the nodes of the NUMA system. The nodes of a NUMA system define a set. We represent that set as a bitmask. The least significant bit represents Node 0, the next least significant bit Node 1, and so on. The bitmask is then represented in hex. Thus a NUMA machine with nodes 0 and 1 has a Domain Members value of 3, a NUMA machine with nodes 0, 1, 2, and 3, has a Domain Members value of F, and a NUMA machine with nodes 0, 2, and 3 has a bitmask of 1101, which is a value of D.

The Domain Members option of the eBios can be used to partition the machine into several different NUMA systems at the same time. For example, nodes 0 and 2 can create a 2-node (8 processor) NUMA machine with a Domain Members value of 5 (0101) while at the same time nodes 1 and 3 create a separate 2-node (8 processor) NUMA machine with a Domain Members value of A (1010).

The lowest numbered Node ID in a coherence domain defines the Master of the NUMA machine; the other nodes are clients. The eBios causes the clients to wait, while the Master boots the operating system. The master of a NUMA machine is automatically determined by the Domain Members option.

Switch

The Switch option controls which of two sets of code is used to set up communication between the nodes. If a switch is used, this value should be set to "Yes". If a direct connection is used, the value should be set to "No". If a direct connection is used, at most two nodes can be linked into a NUMA system.

Configuration

The "Configuration" option selects one of two possible memory configurations which are supported by the MCU. The "2-node" value causes the 2-node memory configuration supported by the MCU (one node gets the lower 2 gigabytes of memory and the other gets the upper 2 gigabytes). The "4-node" value selects the 4-node memory configuration (each node gets 1 gigabyte, determined by their order in the coherence domain membership). The "auto" value causes the eBios to select either the 2-node or 4-node memory configuration depending upon the number of nodes in the NUMA system: for 1 or 2 nodes, the 2-node configuration is used; for 3 or 4 nodes, the 4-node configuration is used.

Software which can support either memory configuration can use the "auto" value, giving the largest possible memory space to each node. If the operating system software is written explicitly for one configuration of the other, this option can be set to force the use of the desired memory configuration.

Debug Mode

Debug mode may or may not be present on the menu. If present it can be set to "None", creating no debug output, or set to "Debug". The "Debug" value causes a number of internal variables to be printed to the screen, allowing an eBios development programmer to follow the execution of the eBios. An internal Trace mode is also available, but must be used very carefully, and so is currently turned off. The entire Debug structure can be disabled by an assembly time (compile time) switch setting (DEBUG in global.inc), reducing the size of the eBios by about 30 percent.

Number of CPUs

A final line in the setup menu shows the number of processors (cpus) for this node. Normally this will indicate 4 processors. For development and testing purposes, it is sometimes helpful to limit the number of processors. The number of processors that are in use for a particular execution can be limited by adjusting the MP Configuration Table that is presented by the BIOS to the Operating System. The MP Configuration table describes the processors, I/O buses, and memory of an SMP. The BIOS creates this table for each node, and the eBios merges the individual tables for each node into one global table describing the NUMA system. If a processor entry is not included in the MP Configuration Table, that processor will not be used by the operating system -- it will sit idle.

The number of processors per node is normally 4. Pressing key F1 while the Setup Menu is up will cause the operating system to use only one (1) processor from this node. Pressing key F2 will set the number of processors to 2, F3 will use 3 processors, and F4 will use 4 processors. Pressing key F5 will set the number of processors to 0. This option means that all memory and I/O devices for this node will be contributed to the NUMA system, but no processors. It is not clear if this is a viable option, since interrupts for local I/O devices on this node will need to be handled by a processor on this node, so this option may not be useful.

Operation and Interaction of Option Settings

Once the values of the options are selected, pressing either ESC or F10 will cause the system to configure as requested and boot the Operating System. The F10 key will write the selected options to flash memory. Future boots will use these values as their default values. The ESC key will use the selected values, but not save them; the old defaults remain the defaults for the next boot.

Values can be selected for options for each node independently, but must be consistently selected for the system to be configured correctly. For example, the Switch setting must reflect the communication hardware that is being used. The Node IDs must reflect the cabling to the switch if a switch is used. All members of a coherence domain must have the same value for Domain Members, so that exactly one Master is selected and the client members of the NUMA system are known. Every node must be a member of its own "Domain Members" bitmask. The Configuration settings for the members of a coherence domain must be the same so that memory is defined correctly on all members of a coherence domain.

Some option settings are almost the same. For example, a Cluster setting for the Extended BIOS option is almost the same as a NUMA setting with a coherence domain of one node. The difference is that in a Cluster, the memory configuration is not changed. While in a NUMA setting, the memory configuration will be reset to either the 2-node or 4-node memory map (as selected by the Configuration option). Thus, for a one node system, the node can have a full 4 Gig memory range (Cluster), a 2 Gig memory range (NUMA, 2-node), or a 1 Gig memory range (NUMA, 4-node).

Once a setup configuration has been defined, the eBios adjusts the processors, memory, I/O devices, and the MP Configuration Table to look like the selected SMP/NUMA machine. The F7 path is taken if an error occurs during the execution of the eBios, or if it is selected directly by the user instead of the F1 setup menu.

The F7 Setup Path

The F7 path provides 4 options:

  1. Restart Extended BIOS (start over)
  2. Run Extended Diagnostic
  3. Run eBios Debugger
  4. Boot as an SMP machine (exit eBios)

Option 1 can be used to recover from some types of user errors and returns the eBios to the point where it is waiting for F1/F7/ESC/timeout.

Option 2 runs an extended set of diagnostics. This is left from the HAL code, and I do not think it works.

Option 3 runs a simple eBios debugger that allows memory (and memory mapped I/O) to be read and written.

Option 4 exits the eBios (with no NUMA support), returning control to the BIOS which will boot the operating system.

The eBios debugger has three commands: r (read), w (write) and q (quit). Reading and writing memory requires a 32-bit address (in hex). For writing, you also need to specify the contents to be written into the 32-bit word. The quit command takes you back to the F7 menu above.

Design

The eBios is a distributed program running on multiple nodes at the same time which gets them to cooperate to form a NUMA system.

When it first starts, each node is executing the same code. At a later point each node classifies itself (according to the Domain Membership bitmask) as either the Master or a client. The Master executes one sequence of instructions which the clients execute a different sequence. At the end, the Master returns to the BIOS to boot the Operating System, while the clients all halt, waiting to be awakened by the Operating System when it needs to use the additional processors of its multi-processor system.

All nodes search their PCI busses for an IBM Opium card, and, if found, initialize the TC5s.

Then the nodes each wait for F1 (setup), F7 (exit), or a timeout (or ESC) which bypasses the F1 setup menu.

After the F1 setup menu, a Disable mode causes the eBios to exit. The Cluster mode and NUMA mode continue to initialize the MCU and either the direct connection or the MIC (depending upon the Switch option). At this point, a Cluster mode exits the eBios.

The NUMA mode begins a sequence of messages between the Master and the clients. Each node can use its Domain Membership bitmask to decide if it is the Master or a Client. The Master sends a message to each client in turn, asking it to agree that it is the Master of the domain. The Master waits as long as necessary, retransitting the request at intervals, to get a reply. This allows the various nodes to boot at different times and speeds -- nothing continues until all clients respond that the Master is the Master.

During this time, and all other places where we wait for hardware (like the cable or switch) or other nodes (like the messages), you can break the loop by pressing ESC. Typically this terminates the normal sequencing of the system and takes you to the F7 menu.

If ESC is pressed while we are waiting for a client to acknowledge the Master, that client is removed from the domain membership, and the Master continues with the other members of the domain.

Once all clients have acknowledged the Master, we enter a sequence of message communication to merge the separate nodes into one NUMA system. Each message is sent to each of the clients, one after another. Then we go on to the next message.

The first message simply synchronizes all systems -- the master sends a "Client Ready" message, and the client acknowledges this.

The next message sends the highest PCI I/O space address to the client. The client adjusts its PCI I/O space to be above the previous high, and returns an acknowledgement message with the new highest PCI I/O space address (which is passed on to the next client). This separates the range of PCI I/O addresses on each node. At the same time, the client also adjusts its PCI memory address range down to the top of its new memory range (as defined by the 2-node or 4-node memory configuration defined by the MCU).

In addition, the client passes back a memory copy key which allows the master to DMA transfer a block of information from each client. The DMA block is pulled by the Master and contains the MTRR values needed by the client and the MP Configuration Table for the node (modified to include the node number in the processor, bus, and memory entries).

The master merges the MTRR values from each client into a global MTRR table, and merges the MP Configuration Table values from each client into a global MP Configuration Table.

The next step sends a message to each client asking for another Memory Copy key, which is provided in the message acknowledgement. This Memory Copy key is used to push a block of information back to each client, providing it with the merged MTRR table, the merged MP Configuration Table, and the current real-time clock of the Master.

A final message from the Master to the clients tells the clients that the block of information has been DMA'ed to them. They define their own MTRR values to equal the global set and their real-time clock to equal that of the Master. Then each client adjusts the registers of its Memory I/O Controller (the 450NX PCIset MIOC) to place memory at the addresses required by the 2-node or 4-node memory configuration (as selected), and then halts.

The master finishes up and returns to the BIOS to boot the operating system, leaving the merged MP Configuration Table describing the complete NUMA system.
Home   Comments?