The POSTRISC architecture

This document describes the architecture of a non-existent virtual processor. The processor instruction set architecture and the processor itself are referred to as POSTRISC. This name is because «POST-RISC» is a common name for projects of hypothetical processors, replacing processors with the RISC architecture (Reduced Instruction Set Computer). The virtual processor POSTRISC combines the best (as it seems to me) qualities of existing and past architectures.

Main features of the POSTRISC architecture:

Sources are available at https://github.com/bdpx/postrisc. This repository contains source code, sample programs for POSTRISC, this description of the virtual processor. To build program cmake and clang++/g++ needed. To build this documentation from xml sources xsltproc/xmllint needed.

The postrisc is a console application for all basic tasks. Uses standard streams for input/output that need to be redirected from file or to file. It is called with different keys from the command line (table below).

Table 1: Command-line options
Option Operation description
--scan <src.s
scan and recognize individual tokens (tokens) of the source program in assembler virtual processor
--scan-html <src.s
scan and mark as html source program in assembler virtual processor
--assemble <src.s >src.bin
assemble the program src.s in the object file src.bin
--assemble-c
assemble as an C++ array for embedding into a C++ program
--base-address vaddr
set virtual base address for image loading
--device-array paddr vaddr config-size
set device array info: physical address, virtual address, device configuration space size
--disasm <src.bin
disassemble the file src.bin
--dumpbin <src.bin
disassemble the src.bin file with binary representation
--env key=value
add guest app environment variable
--execute <src.bin
execute the raw program src.bin in the emulator
--exeapp <src.bin
execute the ELF program src.bin in the emulator. Static PIE executables only, syscall emulator for Linux-like system (very limited).
--html >out.html
output html-text with information about the syntax of assembler instructions, the format of machine instructions, operation codes, statistics for the instruction set
--llvm
output llvm tablegen file with information about the instructions encoding, the format of machine instructions, operation codes, etc. Used as «PostriscInstrEncoding.td» in LLVM compiler backend.
--export-definitions
lists asm known predefined constants
--dump-file file
dump final emulation state to file
--log-file file
sets log file path
--log-level level
set logging level
--log-subsystem list
set logging subsystem mask
--memory size paddr
add memory device with size (in hexadecimal) starting from physical address paddr
--pa-size nbits
set physical address size in bits
--paging offset nlevels
set paging info: page offset in bits and number of indexing levels. Depth of virtual address space will be: offset+nlevels*(offset-3)
--profiling
do profiling (per bundle)
--rom path paddr vaddr
add ROM image, map it to corresponding physical and virtual addresses
--timing_info
report timing info
--verbose
verbose logging
--video paddr vaddr width height
add video device (experimental)
--
separator between POSTRISC engine options and emulated guest program options

For example, running POSTRISC ELF static-PIE image:

/path/to/postrisc \
    --exeapp --log-file "test-log.txt" \
    -- \
    postrisc-app -app-option1 -app-options2

The qtpostrisc is a Qt-based graphical application with assembler editor, debugger, doom graphical backend, etc (Qt5 required). Support same command-line options as console app.

For Wayland systems currently requires switch to X11 via «QT_QPA_PLATFORM=xcb» env:

QT_QPA_PLATFORM=xcb /path/to/qtpostrisc \
    --exeapp --log-file "doom-log.txt" \
    -- \
    doomgeneric.postrisc -doom -options

If this manual for the instruction architecture and assembler syntax doesn't exactly match the sample program (not yet updated), then this gen.html file contains a brief instruction set manual, automatically generated by the assembler.

Example program for assembler POSTRISC program.html does nothing sensible but uses all machine instructions and pseudo-instructions of assembler, all sections of the program, all addressing modes, and its only meaning is joint testing of assembler, disassembler, emulator in the process of writing them. It is a concatenation if separate little tests.

The resulting binary program may be disassembled: out_diz.s and out_dump.s (with reported binary representation).

The results: result.txt. And full system dump: dump.html.

The POSTRISC building is based on cmake. Use generators "MSYS Makefiles" (Windows) or "Unix Makefiles" (Linux). Set -DCMAKE_BUILD_TYPE=Release as default. Set -DCMAKE_CXX_COMPILER=g++ or clang++ (MSVC isn't supported). The «USE_QUADMATH» macro controls the used long floating point internal implementation (quadmath or mpreal). Set -DUSE_QUADMATH=0 for clang (it doesn't support libquadmath), set 0 or 1 for g++.

The author is grateful in advance for reporting errors and inaccuracies in the virtual processor description and in the source code (which is far from error-free).

A set of tools for working with POSTRISC (assembler, disassembler, emulator is implemented) is incomplete. Further directions for improvement (the number in brackets like [2] characterizes the comparative complexity of the tasks):

  1. Development of a set of sample programs to illustrate the capabilities of the virtual processor and assembler POSTRISC [1].
  2. Development of a macroprocessor for the assembler POSTRISC[3].
  3. Refactoring the formula calculation unit to introduce strict control over the use of object names in formulas (according to sections and segments of the program and the relocation types) [2].
  4. Development of the object file format for a virtual processor compatible with the ELF standard, processing assembler for ELF [2].
  5. Development of separate compilation tools, support for relocation records according to the ELF standard (Executable and Linkable Format) in assembler and the development of the linker [2].
  6. Development of the block of floating-point calculations in the emulator (implemented), and support for real constants in assembler [3].
  7. Development of a visual debugger for the emulator POSTRISC [2].
  8. Development of a text editor with syntax highlighting of the assembler virtual processor [2].
  9. Adding new «hardware» tools to the virtual processor architecture and its emulator: cache management, virtual memory and translation buffer [2].
  10. Development of a simulator (cycle-accurate emulator) of the possible hardware implementation of a virtual processor, with the simulation of hardware such as multi-level instruction/data caches, translation buffers, branch prediction cache, return prediction stack, and other modern hardware. [5]

Content

Chapter 1. Choosing an instruction set

§ 1.1. Bottlenecks issue

§ 1.2. Memory non-uniformity problem

§ 1.3. The technologies of parallel operation execution

§ 1.4. Instruction format budget

Chapter 2. Instruction set architecture (ISA)

§ 2.1. General description of the instruction set

§ 2.2. Register files

§ 2.3. Instructions format

§ 2.4. Instruction addressing modes

§ 2.5. Data addressing modes

§ 2.6. Special registers

Chapter 3. Basic instruction set

§ 3.1. Register-register binary instructions

§ 3.2. Register-immediate instructions

§ 3.3. Immediate shift/bitcount instructions

§ 3.4. Register-register unary instructions

§ 3.5. Fused instructions

§ 3.6. Conditional move instructions

§ 3.7. Load/store instructions

§ 3.8. Branch instructions

§ 3.9. Miscellaneous instructions

Chapter 4. The software exceptions support

§ 4.1. Program state for exception

Chapter 5. The register stack

§ 5.1. Registers rotation

§ 5.2. Call/return instructions

§ 5.3. Register frame allocation

§ 5.4. The function prolog/epilog

§ 5.5. The register stack system management

§ 5.6. Calling convention

Chapter 6. Predication

§ 6.1. Conditional execution of instructions

§ 6.2. Nullification Instructions

§ 6.3. Nullification in assembler

Chapter 7. Physical memory

§ 7.1. Physical addressing

§ 7.2. Data alignment and atomicity

§ 7.3. Byte order

§ 7.4. Memory consistency model

§ 7.5. Atomic/synchronization instructions

§ 7.6. Memory attributes

§ 7.7. Memory map

§ 7.8. Memory-related instructions

Chapter 8. Virtual memory

§ 8.1. Virtual addressing

§ 8.2. Translation lookaside buffers

§ 8.3. Search for translations in memory

§ 8.4. Translation instructions

Chapter 9. The floating-point facility

§ 9.1. Floating-point formats

§ 9.2. Special floating-point values

§ 9.3. Selection for IEEE options

§ 9.4. Representation of floats in registers

§ 9.5. Floating-point computational instructions

§ 9.6. Floating-point branch and nullification instructions

§ 9.7. Logical vector instructions

§ 9.8. Integer vector operations

Chapter 10. Extended instruction set

§ 10.1. Helper Address Calculation Instructions

§ 10.2. Multiprecision arithmetic

§ 10.3. Software interrupts, system calls

§ 10.4. Cipher and hash instructions

§ 10.5. Random number generation instruction

§ 10.6. CPU identification instructions

§ 10.7. Instructions for the emulation support

Chapter 11. Application Model (Application Binary Interface)

§ 11.1. Sections and segments

§ 11.2. Data model

§ 11.3. Reserved registers

§ 11.4. Position independent code and GOT

§ 11.5. Program relocation

§ 11.6. Thread local storage

§ 11.7. Modules and private data

§ 11.8. Examples of assembler code

Chapter 12. Interrupts and hardware exceptions

§ 12.1. Classification of interrupts

§ 12.2. Processor state preservation upon interruption

§ 12.3. Exception Priority

§ 12.4. Interrupt handling

Chapter 13. External interrupts

§ 13.1. Programmable external interrupt controllers

§ 13.2. Built-in interrupt controller

§ 13.3. Handling external interrupts

§ 13.4. Handling local interrupts

§ 13.5. Processor identification and interprocessor messages

Chapter 14. Debugging and monitoring

§ 14.1. Debug Events

§ 14.2. Debug registers

§ 14.3. Monitoring registers

Chapter 15. PAL (Privileged Architecture Library)

§ 15.1. PAL instructions and functions

§ 15.2. PAL replacement

Chapter 16. LLVM backend

§ 16.1. LLVM backend intro

§ 16.2. LLVM backend limitations

§ 16.3. MUSL port

§ 16.4. Code density comparison

§ 16.5. DOOM port

Chapter 1. Choosing an instruction set

When creating instruction set for existing processor architectures in different years, their architects proceeded from various, often mutually exclusive, goals. Among these goals are the following:

There are different architectures - the ones far gone in one certain direction, up to explicit conceptualism, and universal, seeking a balance of priorities in different directions. The choice made at the stage of designing the instruction set architecture may subsequently affect the possibility of developing the architecture in one direction or another. Errors in the design of architecture can cut off the possibility of effective implementation of architecture on new technologies due to incorrect prediction of the trend of technological innovation, narrow the scope of architecture, reduce the effectiveness of the application of architecture.

§ 1.1. Bottlenecks issue

The traditional architecture of a programmable computing device is based on the principle of controlling the system by executing a program, which is a sequences of instructions stored in memory. The execution of the instruction consists of a sequence of steps:

Naturally, processor performance equals the performance of the bottleneck of this system. It doesn't make sense to increase the capabilities of one pipeline stage if problems at other stages are not resolved. Accordingly, several processor bottlenecks arise:

You can immediately say that the bandwidth of RAM is the fatal bottleneck of the processor, and this problem is only removed by increasing the amount of built-in cache.

§ 1.2. Memory non-uniformity problem

At the heart of traditional architecture are two principles that relate to the central element of this – memory architecture. This is the principle of random access to any memory element (uniformity property) and the principle of controlling the system by executing a program, which is a sequences of instructions stored in memory.

However, early computers already got summary registers, and later the counter registers and indexes appeared, stored as close as possible to the calculator. The emergence of architectures with general-purpose registers meant the final division of memory into fast registers and slower RAM. The appearance of registers made it possible to explicitly track, analyze and plan dependencies according to data in the instruction stream, and, if there are no dependencies, execute the instructions at the same time.

The further memory size increase, the computing devices miniaturization, an increased gap between the the memory and the processor speed gave rise to cache memory. Caching removed some of the problems with speed memory operations without changing the programming paradigm. However, more was needed. Cache levels of increasing size. Current computer circuit as follows: a set of specialized computing devices with its own register files relies on a system of logically uniform memory with implicit multi-layer caching.

The architecture of a computer with 16 general-purpose registers is certainly better than the architecture with 8 registers. And architecture with two pipelines of multiplication-addition of floating-point numbers is better than with one pipeline. It might seem that an architecture with 1024 registers and 16 multiplication-addition pipelines would be almost ideal. However, a register file of 1024 registers with 16×4=64 read/write ports would be a technological absurdity. Caching also reached its limit after the advent of four cache level. Further enhancement of parallel data processing capabilities is carried out by creating massively parallel systems with shared memory, which abandoned the property of uniformity of memory, leaving it only for the local memory of one multiprocessor node. But these issues already lie outside the processor architecture of the processor itself.

The new architecture doesn't abolish the traditional architecture based on logically homogeneous memory and doesn't offer a new programming paradigm. The architecture is still based on logically homogeneous RAM. Architectural changes can only affect the model of a computing device with its state explicitly described by internal registers.

§ 1.3. The technologies of parallel operation execution

In addition to the memory non-uniformity, there is another fundamental fact that determines the development of architectures – parallelism of operations. Unlike traditional strictly consistent architecture, modern architectures to achieve maximum performance seek to execute more than one instruction at a time, and more than one operation in one instruction.

The problem of parallel computing is ultimately reduced to the problem of organization consistent simultaneous access of many computing devices to logically homogeneous memory, that is, to the same problem of real memory non-uniformity and insufficient bandwidth. Accordingly, it is the level of parallel memory sharing that determines the parallelization technologies used.

There are several technologies for increasing the degree of parallelism of calculations, depending on the hierarchical level of memory for which they are intended. These technologies are implemented either at the ISA level (instruction set architecture) or at the software level. Here we are more interested in the first case, since we want to evaluate the possibilities of parallel operations due to the correct choice of ISA.

Table 1.1: Technologies for achieving parallel computing
Memory hierarchy level Data exchange Technology Hardware Acceleration New ISA Code density Implementation
Separate register inside the pipeline SIMD: subword parallelism Wide registers and data buses 4-16 operations in one instruction 4-16 8 At ISA level, compiler
Pipeline data inside the pipeline Fused instructions Longer pipeline, additional read port 2-3 operations in one instruction 2-3 1.25 At ISA level, compiler
Separate register file Crossbar before register file OOOE+SS: out-of-order super-scalar execution Increase in the number of ports of the register file, associative hardware for instruction issuing 2-10 instructions per cycle 2 5 0 At ISA level, compiler
Many computing units with local register files inter-file transfer instructions MIMD+VLIW: very long instruction word Wide fetching of instructions, scheduling 2-8 instructions per cycle 0 0.25 At ISA level, compiler
Cache Explicit sync memory access instructions SMT/CMP: simultaneous multi-threading, chip multi-processing Multiport cache, next instruction fetch 2-4 microkernels (threads) on one chip 4 0 At the program level
Local shared RAM Explicit Sync Memory Access Instructions SMP: shared memory processing Memory banks, wide crossbar 2-64 microchips in one node 64 0 At the program level
Computing network Library network transfer functions MPP: massively parallel processing Developed network topology (hypercube, torus, mesh, fat tree) any number of nodes in the array 4096 0 At the program level

The commercially successful ISA implementation is a compromise between the implementation complexity and each technology benefits. Successful ISA implementation doesn't give preference to any one technology (isn't pure conceptual), but organically and in moderate doses combines several technologies.

SIMD (Single Instruction Multiple Data) are instructions for homogeneous vector operations on elements (8,16,32,64 bits long) of a wide register (64 or 128 bits long). They allow to perform several (2-16) operations in one instruction per cycle. However, software handling of exceptional situations is complicated (where is the error in the vector operation?). The program should contain a sufficient proportion of operations that allow vector execution, and the optimizing compiler must be able to find such operations. When accessing the memory there are problems with data alignment. Implementation of wide ports for reading and writing registers.

Fused instructions are three-operand instructions that combine two binary operations. For example: a = b × c + d. This reduces the total number of instructions and doubles (ideally) the number of computational operations performed per clock cycle (but not machine instructions). This requires a longer execution pipeline, and hence the increase in delays during branches. We need an additional read port for the third operand. The construction of an exception handler is becoming more complicated, since collisions are possible in both the first and second of the fused operations. It takes a place in the instruction to encode the fourth operand. There is a discrepancy in the formats of the computational instructions: binary and ternary formats, which complicates decoding, or you have to artificially convert all binary formats into ternary instructions. The program must contain operations that allow fusing, and the optimizing compiler must be able to find such operations. The percentage of fused operations should be large enough. The total number of possible fused instructions O(N2), where N is the number of basic operations, which is quite large. In practice, it is impossible to fuse all instructions, since the amount of decoding equipment and the place in the instruction allocated for the operation code are limited. Therefore, only some frequently occurring combinations of operations fuse.

Predication is conditional execution of instructions. Any instruction turns into a hardware-executed conditional branch statement. For example: if (a) b = c + d. An additional operand encodes the logical condition register. This technology replaces a control dependency with a data dependency and shifts a possible pipeline shutdown closer to the pipeline end. Most poorly predicted branches in short conditional calculations, and hence pipeline stops, It is eliminated due to the simultaneous execution of instructions from different branches of the conditional statement. However, this is a purely power method, which boils down to simultaneously issuing instructions from several execution branches under different predicates on the pipeline. It takes a place in the instruction to encode the extra operand – predicate register.

Superscalar (super-scalar) execution of instructions. Advantages: Execution of several (1-4) instructions per cycle. Disadvantages: Exception handling becomes more complicated, since the completion of instructions is required strictly in program order. Associative hardware of complexity O(N2) is required to analyze and select N simultaneously executed instructions. We need additional read and write ports, additional pipeline stages.

Out-of-order execution or OOOE is the execution of instructions is not in the manner prescribed by the program, and as the operands are ready, which allows you to bypass structural dependencies according to and do some useful things while waiting for the completion of previous instructions, such as reading from memory. However, the handling of exceptions is complicated, since the completion of instructions is required strictly in software order. Associative hardware of complexity O(N2) is required to analyze and select the next instruction from N buffered instructions. Additional pipeline stages are needed. A register file of sufficient size and equipment for dynamic renaming of registers are required.

VLIW (Very Long Instruction Word) or MIMD (Multiple Instruction Multiple Data) is the execution of instruction packages. Advantages: Execution of several (1-4) instructions per cycle. Disadvantages: Need synchronization – accurate knowledge of delay times for all pipelines and memory, and hence program intolerance when changing the processor model and incompatibility with data caching. Need additional read and write ports. The program must contain operations that allow synchronous execution, and the compiler must be able to find such operations. The increase in the size of the program due to empty slots for which no useful instructions were found.

§ 1.4. Instruction format budget

The program size should be as small as possible. The requirement of code density requires efficient usage of space in the instructions. The question arises about the most advantageous distribution of the bit budget between different types of information in the instruction. The following table shows what the instruction bit budget can be spent on:

Table 1.2: Allocation of the instruction bits
Type of information Advantages Disadvantages
Operation code increasing the variety of implemented functions reduces the data path (the number of instructions for the operation) Complicating functional units and the compiler
Wider register numbers more registers in a uniform register file facilitates variable allocation and data flow organization It is statistically useless when procedures with a small number of variables prevail, it increases the length of data buses and the number of intersections.
Additional operand registers Non-destructive 3-ary instructions and complex fused 4-ary operations reduce the number of data moves, shorten the data path Problems with additional register reading ports and register renaming ports (for OOOE)
Explicit predicate description conditional execution of short conditional statements without branches reduces the branch-delay of incorrect dynamic predictions Favorable only for short conditional statements that do not exceed the pipeline length.
Longer constants in the instruction code loading constants becomes easier, less often special instruction sequences are required for synthesizing long constants Statistically useless when short constants prevail
Templates for an early description of the instruction distribution to functional units facilitating decoding and distribution of instructions among functional units Problems with porting programs to machines with a different set of functional units
Explicit description of instructions that allow parallel execution Simplifice the instructions sheduling to functional units Useless with unpredictable execution times
Hints to the processor about the direction and frequency of branches Reduced downtime due to incorrect dynamic predictions Useless if the compiler doesn't have the necessary information, harmful if the prediction is incorrect. May conflict with hardware branch predictor.
Hints to the processor about the frequency and nature of future accesses to the cache line Reduced cache misses, better cache utilization Useless if the compiler doesn't have the necessary information or does incorrect predictions, harmful for different predictions of access patterns to the same line. May conflict with microarchitectural hardware prefetcher.
Explicit clustering of register files with binding to functional units The data bus length and the number of intersections are reduced, power consumption is reduced, Reduced space for register numbers in instructions Requires explicit data transfer between register files in different clusters by separate instructions, which lengthens the data path. It doesn't allow reducing or increasing the number of clusters specified by the architecture. It doesn't allow redistributing functional units.

A successful ISA implementation is a trade-off between the cost of instruction space and the benefits of using coding techniques. Preference should not be given to any one technique. The instruction set architecture combines different coding techniques at the ISA level.

In a broader sense, there is a question about splitting the workflow into separate instructions. The same sequence of operations can be represented in different ways as a sequence of instructions. The complication of the semantics of instructions makes it possible to increase their length and reduce the number without compromising the overall size of the program, and this, in turn, raises the question of a new redistribution of the bit budget.

However, for high-performance architectures, it's more important not even the size of the code at all, but the efficiency of using the cache for instructions. The cache line contains several instructions and begins with a naturally aligned address. The most effective option is to fetch all the instructions in one cache line starting from the first instruction. Fetches from not the line start require aligners and introduce delays, incomplete fetches use the cache not rationally.

Summarizing the above, we can say that a regular format of instructions is needed, which would shorten the data path without significantly complicating semantics and hardware support for each instruction would be dense enough, but when choosing between code density and caching efficiency would give preference to caching. The format of the instructions should give the scalability of parallel computing devices within a single portable architecture.

It should be noted that the growth in the volume of processed information is significantly ahead of the growth in the program complexity. The relative part of the RAM occupied by the program code is constantly decreasing. Therefore, the problem of minimizing the size of the program is gradually relegated to the background, remaining relevant only for embedded systems.

Pipeline parallel organization of a computing device requires a regular instruction format, that is, the constancy of the length of the portion supplied to the input of the pipeline decoder instructions. This is necessary in order to start decoding the next portion before decoding of instructions from the previous portion is completed.

The regularity of the instruction format means the finiteness of all instructions, the limitation of their length to the length of the decoded portion. However, not all instruction lengths are available for implementation. It should also be possible to determine the start of the next instruction before decoding the current instruction. The instructions must satisfy the conditions for the natural alignment of data in memory, which are constantly growing as memory systems evolve.

Table 1.3: Possible regular instruction formats
Format Instruction lengths Alignment Example
Irregular1-151Intel X86
Irregular2,4,6,82Motorola 68000, IBM S390
Semi-regular2,42MIPS-16
4-byte regular44Alpha, PowerPC, PA-RISC, Sparc, MIPS
8-byte bundles4,88Intel 80960
8 byte instructions88Fujitsu VPP
16-byte bundles5,1016IA-64

The first three rows of the table relate either to the legacy instruction architectures, or to special architectures for embedded applications for which the program size flashed directly into the ROM is more important than performance.

A regular 4-byte format is used by all modern RISC architectures. Now they are completing the cycle of their development, reaching the limits of improving this format. It is hardly worth hoping for significant progress based on new architectures based on this format.

The format of 8-byte instructions is used only on some vector and graphics processors, where there is generally no possibility of access to smaller memory atoms. Its application for general-purpose architecture would mean more than double the size of programs, which is unacceptable.

Table 1.4: Comparison of some architectures
Number of registers Architecture Advantages Disadvantages
8Intel X86 scaled indexed addressing mode, SSE2 double precision vector instructions double CISC: only 8 non-universal and non-orthogonal registers, lack of uniformity in coding
16AMD X86-64 PC-relative addressing compatible with old X86, only 16 registers
16ARM32 Predication, fused instructions Combination of the instruction counter with the general register
32ARM64 fused instructions usually 2 instructions to adress global/static data (hi/lo parts of address)
32SGI MIPS First RISC: Fixed Instruction Format, PC-relative addressing (MIPS16) delayed branches
32Intel 80960 Regular but not fixed format with 4 and 8 bytes instructions
32HP PA-RISC instruction nullification, speculative execution, system calls without interruptions, global virtual address space, inverted page hash tables delayed branches, comparison in each instruction
32DEC Alpha out-of-order execution of instructions, a fixed format for instructions, a unified PAL code, the absence of global dependencies outside the registers insufficient memory access formats, poor code density, lack of good SIMD extensions, inaccurate interrupts
32IBM PowerPC out-of-order execution of instructions with the ordered completion and exact interruptions, fused instructions «multiply-add», multiplicity of the condition register, saving or restoring several registers with one instruction, global virtual address space, inverted cluster page tables optional comparison in each computational instruction, dependencies between global flag instructions, inconvenient ABI.
32IBM/Motorola PowerPC AltiVec Vector Extension missing double-precision vector instructions (as in SSE2)
32Sun UltraSPARCRecursive interrupts, register rotation register windows of a fixed size, large register files but a small number of registers
128Intel IA-64 Predication, register rotation, instruction bundles only next execution of instructions, large multi-port register files, sparse code, complex compiler
128IBM Cell Unified register file for all types Explicit non-uniform scratchpad memory without cache, explicit DMA for exchange with main memory
256Fujitsu SPARC64 IX-FX Vector instructions for paired registers. Separate preparation instructions for specifying numbers from an extended set of registers

Chapter 2. Instruction set architecture (ISA)

This chapter provides a basic description of the POSTRISC virtual processor instruction architecture (instruction set architecture or ISA).

§ 2.1. General description of the instruction set

The architecture prefers security over performance. The exploitation of the unplanned program behavior should be avoided by design as possible. We should avoid ambiguous code interpretation. This was done for security reasons to prevent the return-oriented programming attacks like «return to libc» and to make all binary code available for inspection.

The variable-length instruction encoding allows starting execution from the middle of instruction and extracting unplanned instruction sequences. It is possible an alternative interpretations of program code via decoding from the middle of a variable-length instruction. It should be impossible to continue execution from the middle of the instruction. To ensure this, we can use a fixed format or variable-length self-synchronizing format. The POSTRISC chose a fixed format. So the variable-length instructions are forbidden and only fixed instruction encoding with aligned code chunks is allowed.

Some architectures allow placing data inside code, by design or due to the global data addressing limitations. In such architectures, data parts may be placed near a function that uses them or accumulated into bigger «data islands» for several functions. The data in a code section may lead to possible data execution and exploiting the unplanned program behavior. So the strong separation of code and data should be enforced at architecture level, and mixing of code and data in the code section should be prohibited. This also improves paging/caching/TLB.

The instruction set architecture is aimed to the most parallel extraction of instructions from memory and decoding. The format of the instructions is regular (the length of the decoded portion of the code is constant), but not strictly fixed (when all instructions are necessarily the same length), but almost fixed (inside the regular portion, the initial parts of the instructions are the same length, a possible continuation also has a fixed length). The unit of instruction flow is a 16-byte bundle assembled from three (usually) or two instructions. Bundles are always 16-byte aligned in memory.

Unlike traditional systems like VLIW (very long instruction word), the instruction bundling reflects a parallel fetching and decoding process only, but not the process of dispatching, executing, or completing instructions. The instruction bundles do not describe the binding of individual instructions to functional units, the possibility (or necessity) of parallel execution and/or completion, execution timings. The architecture doesn't expose microarchitectural details to software such as load data delays, branch delays, other fixed pipeline delays (pipeline hazards), or fixed set of functional units. This is necessary for programs portability within a family of machines with different microarchitecture/performance. It is assumed that the program can be used without recompilation on machines with different sets of functional units and timings.

Wherever possible, the instruction set tends to be uniform, that is, if some part of the instruction with the same meaning (for example, the number of the first register, the number of the second register, immediate value, etc.) is present in many instructions, then in all those instructions this part is placed at the same position.

Instruction set architecture uses the non-destructive instruction format for any calculation over registers, i.e. the result register is always encoded separately from the operand registers, unlike CISC dual-argument architectures, where the result is forcibly combined with one of the operands. Accordingly, two-argument unary instructions, three-argument binary instructions and four-argument fused instructions (trinary) are valid.

Fighting unpredictable branches or using vector extension requires the introduction of predicates and conditional execution, but to encode an additional predicate argument, each instruction needs extra space. The POSTRISC architecture uses implicit predication via nullification. Each instruction can be overridden to nop by the previous nullification instructions. Instructions are executed conditionally and canceled instructions are considered as non-ops. When we don't use predication, we don't pay for it in instruction bits.

In the new architecture, to reduce the data path, a limited number of frequently encountered combinations of operations are fused (combined in one machine instruction): addition (or subtraction) with a shift; multiplication with addition or subtraction; addition with a constant and memory access (base + displacement addressing mode); register addition (with shift) and memory access (indexed scaled addressing mode); comparison with the branch according to the result of the comparison; change of the cycle counter with comparison and branch according to the result of the comparison, etc. The architecture assumes the true hardware support for fused operations, rather than just compiling the code with hardware breakdown into the original operations.

In architecture, superscalar out-of-order instruction execution equipment can be effectively used. To do this, the instruction set has several limitations. There are no implicit or optional instruction results, no global registers and flags. The number of possible side effects of the instructions is limited. Most instructions have a single register result. Several instructions have two register results. The number of operands is limited to three (and for most instructions, two) registers.

For the POSTRISC instruction architecture, the underlying technology is parallel (super-scalar) out-of-order execution of complex (fused) instructions with implicit predication.

The instruction fetching and decoding will occur sequentially in program order. Out-of-order concurrent execution will be used to process at least one instruction bundle per cycle. The final completion of the instructions with the analysis of exceptions occurs sequentially in a program order.

All operations on integer data occur in general registers, with 2-3 registers of the source operands (there may be a direct meaning or a direct shift) and one register of the result.

All actions on floating-point data occur in general registers, with 1, 2 or 3 registers of the source operands and one register of the result. Floating-point instructions work on single/double/quadruple precision numbers in scalar or packed vector forms.

Many scalar operation codes are complemented by a wide range of vector operations. A special vector extension is used to process multimedia and numerical data in ordinary registers.

The architecture is of type load/store. Memory accesses are limited to load or store instructions that move data between registers and memory, and don't overlap whit using the loaded value. The memory access instructions usually expect strictly one memory access with a single virtual address translation. The unaligned memory accesses are possible, but strict data alignment is preferred.

Global flags and dedicated registers prevent efficient parallel execution of instructions, but duplicating resources and introducing explicit dependencies between instructions also require extra bits to be explicitly described in the instruction. Branch instructions do not use flags but check the values of general registers. The basic operation is the combination of «compare and jump» in one instruction.

To speed up the subroutine calls, to pass arguments through registers, and to reduce the number of memory accesses, a hardware circular buffer of rotated registers is implemented. It also improves code density by minimizing function prologs and epilogues. The second protected stack for rotating registers also protects the contents of all register frames from erroneous changes. The register rotation also complicates the return-oriented programming - there is no known assumption about the correspondence between the physical registers between different function frames.

Optional hints about the frequency and nature of future cache line accesses carried out (if such information is available) in separate instructions.

For immediates encoding there exist different variants with optional compression, interpreting binary values as signed/unsigned, separate sign bit and unsigned value, etc. The POSTRISC uses simple 2-complement binary representation. Each immediate class is defined as signed or unsigned depend on its usage. Base addressing displacements are defined as always signed. Shift amounts are always unsigned. Compare immediates for less/greater are signed or unsigned depend on type. Compare immediates for equal are chosen to be signed.

§ 2.2. Register files

Processor resources include register files, special registers, associative search structures, interrupts. Some resources are available for user programs, others are necessary for the functioning of the operating system. Each processor core has its own set of registers that contain the current state of the core. All registers are divided into register files. There are no registers that are not included in any register file.

It is known that for the usual code, increasing the register file size above 32 has negligible results. But using more registers has a sense for high-performance computing, digital signal processing, accelerating 3D graphics, and game physics. IBM uses the 128x128 SIMD register file in its POWER VMX extension and 64x128 in its POWER VSX extension. Fujitsu uses the 256x128 register file in its SPARC FX HPC-ACE extension. Intel Itanium had 128x82 floating-point registers for HPC.

For the POSTRISC architecture, the 128x128 register file is chosen as a compromise between ordinary usage and special computing purposes.

Table 2.1: Register Files
Register file Number of registers The size of the registers in bits Additional info
General Purpose Registers128128 General-purpose registers are intended for manipulations with scalars 1,2,4,8,16 bytes long or vectors of numbers 1,2,4,8 bytes long. General purpose registers are divided into 120 rotated windowed and 8 global registers. In each group, all registers are equal at the architecture level. Registers can be used to manipulate real numbers of quadruple precision, single and double precision packed vectors of real numbers, packed integer vectors of length 1,2,4,8 bytes. Exceptions from equality: local: r0, globals: tp, fp, sp, gz.
Special Purpose Registersup to 12832/64/128 As the name implies, special-purpose registers have different purposes. Not all of the 128 possible special registers are implemented. The ability to read/write depends on the priority level, register number, etc.
CPU identification registersimplementation-defined64 The read-only registers for reporting hardware capabilities/features. Available only indirectly.
Instruction TLB translation registersimplementation-defined128 The fixed translations which can't be evicted from the Instruction TLB buffer. Available only indirectly.
Data TLB translation registersimplementation-defined128 The fixed translations which can't be evicted from the Data TLB buffer. Available only indirectly.
Performance monitor registersimplementation-defined64 The counters for the internal processor core statistic like number of TLB misses, instruction/data cache misses, branch mispredictions, etc. Available only indirectly.
Instruction breakpoint registersimplementation-defined64 The instruction breakpoint register when enabled allows stopping execution on preferred code addresses.
Data breakpoint registersimplementation-defined64 The data breakpoint register when enabled allows stopping execution on preferred data addresses and/or addressing types like read/write/backstore/etc.

§ 2.3. Instructions format

Existing RISC architectures have exhausted the possibilities of a fixed 32-bit instruction format. Deep loop unrolling, function inlining, other compiler optimization technologies require more than 32 general-purpose (and floating-point) registers, preferably at least 128. However, increasing the number of registers over 32 with the 32-bit RISC instruction length turned out to be difficult. The three-address format requires at least 3×log2(128) or 21 bits for register numbers (and a four-address fused instruction even 28 bits).

The decision to separate code and data forces us to support the effective addressing modes to access global/static/const data outside code section. The approach with several instructions (like the high/low offset parts) to access global data seems unfit. But existing 32-bit instructions aren't enough to access the global data from any code position by one instruction. For the biggest known projects, their size is estimated as 150-250 MiB (210 MiB Chromium, 380 MiB Linux kernel «allyesconfig» build, various CADs, etc), which requires offsets with at least 28-30 bit size for future code blow. The POSTRISC supports programs up to 256 MiB with direct access to global data in one instruction.

Some vector processors (like NEC SX Aurora) or video cards use a longer fixed 64-bit format. But this doubles the program size and doesn't justify the possible benefits for the general purpose architecture. There remains the only intermediate format, consistent with the 2n byte alignment, with 3 instructions for 42 bits (slots), packed in 128-bit bundles. With the 128-bit format we can't transfer control to any instruction in the bundle, except the first, and execute part of the bundle. The bundle is a minimal execution unit. This approach for encoding is similar to Intel IA64 Itanium.

The POSTRISC architecture defines that a 128-bit bundle consists of a 2-bit template and three 42-bit slots. There are two types of instructions: one or two bundle slots length. A bundle may contain three simple one-slot instructions, or a dual-slot instruction and a one-slot (direct order), or a one-slot instruction and a dual-slot (reversed order).

All operation codes are placed in the first slot of double-slot instruction, so the second slot is used for the immediate extensions only. If the instruction format allows expansion to the second slot and the formation of a long instruction, then some immediate fields may have different lengths in short and long formats. For example, simm21(63) means that it is a 21-bit short format field, expandable to 63 bits in a long format.

The splitting of a bundle into instructions is completely determined by a 2-bit template, so that the main and additional instruction codes for different lengths of the instruction format do not overlap. However, they are defined to be always identical. The long instructions are always the extended versions of short instructions with the extended immediates. The following table shows the packaging of the template and instructions into bundles.

Table 2.2: The bundle splitting into slots and template
Slot 3
(bits 86…127)
Slot 2
(bits 44…85)
Slot 1
(bits 2…43)
Template
(bits 0…1)
42 bits 42 bits 42 bits 00
84 bits 42 bits 01
42 bits 84 bits 10
126 bits (reserved) 11

The following table shows the instruction formats and the instruction fields lengths in bits for one-slot instructions. The high 7 bits of {35:41} always define the primary operation code (or just opcode) of the instruction. Many instructions also have one or two extended opcode (opx). The remaining bits of the instruction contain one or more fields in various formats.

Table 2.3: Instruction formats
Name
format
Format bits
41403938373635343332313029282726252423222120191817161514131211109876543210
r1i * opcode ra simm28 (64)
RaU28 * opcode ra uimm28 (64)
r1b * opcode ra label28 (64)
br * opcode opx label28 (64)
RaU28 * opcode opx uimm28 (64)
alloc opcode opx framesize 0
allocsp * opcode opx framesize uimm21 (63)
raopxUI21 * opcode opx 0 uimm21 (63)
raopx2i * opcode opx rb simm21 (63)
r2si * opcode ra rb simm21 (63)
r2ui * opcode ra rb uimm21 (63)
raopx2b * opcode opx rb 0 label17 (30)
r2b * opcode ra rb opx label17 (30)
bbit * opcode ra shift opx label17 (30)
brcsi * opcode ra simm11 (40) label17 (30)
brcui * opcode ra uimm11 (40) label17 (30)
RaSIN * opcode ra simm11 (40) dist-no dist-yes opx
RaUIN * opcode ra uimm11 (40) dist-no dist-yes opx
RaSbN opcode ra shift opx dist-no dist-yes opx
RabN opcode ra rb opx dist-no dist-yes opx
r4 opcode ra rb rc rd opx
r3s1 opcode ra rb rc pos opx
r2s2 opcode ra rb shift pos opx
r2s3 opcode ra rb shift shift pos
r3s2 opcode ra rb rc shift pos
gmemx * opcode ra rb rc scale sm disp
RbcScale opcode 0 rb rc scale opx
Rbc opcode 0 rb rc 0 opx
mspr opcode ra 0 spr 0 opx
r2 opcode ra rb 0 0 opx
Round opcode ra rb 0 rm opx
r2s1 opcode ra rb shift 0 opx
r3 opcode ra rb rc 0 opx
RabcMo opcode ra rb rc mo opx
RabMo opcode ra rb 0 mo opx
RbcMo opcode 0 rb rc mo opx
fence opcode 0 mo opx
gmemu opcode ra rb simm10 opx
int opcode 0 rb simm10 opx
NoArgs opcode 0opx
Table 2.4: Used text (and color) notation for instruction fields
Field Length Description
opcode7primary operation code
opx4, 7, 11extended operation code
ra, rb, rc, rd7general register number, operand or result
spr7special register number
uimm, simm9, 10, 11, 21, 28unsigned/signed immediate
disp9, 21, 28signed immediate for the address offset
label17, 28signed immediate of branch/jump/call
stride10signed immediate for base update
dist-yes, dist-no5nullification block size
shift, pos7bit number, shift value, field legth
scale3indexing scale factor
sm2indexing scaling mode
rm3floating-point rounding mode
mo3memory ordering mode
0variousunused (reserved, must be zeros)

Formats marked in the table with asterisk (*) allow the instruction continuation to the next bundle slot with the formation of a two-slot instruction. The primary codes of single and dual-slot instructions are the same. The assembled code should directly specify a forced extension of the instruction to the second slot by the additional suffix «.l» (long). The assembler adds dummy nop instructions to the code if the long instruction doesn't fit in the rest of the bundle and need to start a new bundle.

addi    r23, r23, 1234
addi.l  r23, r23, 1234

Notes: Btw, 42-bit slot format is in line with the «Answer to the Ultimate Question of Life, The Universe, and Everything»!

§ 2.4. Instruction addressing modes

The calculation of effective addresses takes place with cyclic rounding modulo 264. Absolute addressing directly in the instructions is missing. Only position independent code (PIC) can be used. Target addresses for addressing executable code can only be calculated relative to the address of the current instruction bundle (instruction pointer ip) or relative to the base addresses in general registers.

The architecture supports 2 modes for ip-relative code addressing:

EA = ip + 16 × sign_extend(disp)

The call/jump offset takes up 28 bits in the instruction slot and allows to encode the branch to a maximum of ±2 GiB in both directions from the current address. If the two-slot instruction is used, the branch distance is maximum ±8 EiB on either side of the current address.

The jump offset
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode other offset (28 bits)
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
0 continued (60 bits instead of 28)

The branch offset takes 17 bits in the instruction slot and allows to encode the branch to a maximum of ±1 MiB in both directions from the current address. If the two-slot instruction is used, the offset takes 30 bits, and the branch distance is ±8 GiB in both directions. The branch condition is encoded by the other parts of the instruction.

The branch offset
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode other offset (17 bits)
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
other (30 bits instead of 17)

The linker, creating the image of the program module, must correctly replace all symbolic links for procedures and global data with offsets, where the symbol is accessed, to the location of the symbol itself. That is, for example, calls to the same static procedure from different places in the program occur with different relative offsets.

The architecture also supports the base-relative instruction addressing. The effective address is computed as a sum of 2 registers aligned to the bundle boundary.

EA = (GR[base] + GR[index]) & mask{63:4}.

base-relative branch
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode other base index other

§ 2.5. Data addressing modes

Absolute data addressing isn't directly supported. The architecture makes it impossible to put absolute static addresses into the instruction code. Only a position-independent code is available (PIC/PIE). Target absolute addresses can be calculated relative to the address of the current instruction bundle or reserved base registers only. The architecture supports the following data addressing modes:

  1. base ip-relative
  2. base plus displacement addressing mode or later simply base with offset addressing
  3. base plus scaled index addressing mode or later simply scaled indexed addressing
  4. base with base immediate pre or post-update

For relative addressing, the immediate unsigned disp field, which is 28 bits or 64 bits for a dual-slot instruction, after unsigned extension, is added to the contents of the instruction pointer to produce a 64-bit effective address. We assume that the program data sections like «.data» or «.rodata» are placing strictly after the code sections like «.text» in the loaded program. The 28-bit immediate value allows to address 256 MiB forward from the current bundle. The dual-slot instructions allow addressing full 64-bit address space.

EA = ip + zero_extend(disp)

Relative addressing
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode target disp (28 bit)
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
0 continued (64 bits instead of 28)

For the base plus displacement addressing mode the disp offset, which is 21 bits or 63 bits for a dual-slot instruction, after sign extension, is added to the contents of the base register, to produce a 64-bit effective address. The 21-bit immediate value disp allows addressing ±1 MiB in both directions from the base address.

EA = GR [base] + sign_extend(disp)

Base with offset addressing
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode target base disp (21 bits)
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
continued (63 bits instead of 21)

For scaled indexed addressing mode, firstly, the contents of the index register is extended according to sm instruction modifier. The sm instruction modifier may be x64 (no extension), u32 (32 bit unsigned), i32 (32 bit signed). Secondly, extended index is shifted left by the scale, then added with a 9-bit signed offset disp (−256…255), and added with the contents of the base register to produce a 64-bit effective address.

EA = GR[base] + (SM(GR[index]) << scale) + sign_extend(disp)

Indexed (scaled) addressing
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode target base index scale sm disp (9 bits)
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
disp continued (51 bits instead of 9)

For base with base immediate post-update addressing mode the 10-bit stride immediate is added to base after memory access.

EA = GR[base]

GR[base] = EA + sign_extend(stride)

For base with base immediate pre-update addressing mode the 10-bit stride immediate is added to base before memory access.

EA = GR[base] + sign_extend(stride)

GR[base] = EA

base with base immediate pre/post-update
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode target base stride (10 bits) opx
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
stride continued (52 bits instead of 10)

Other addressing methods can be implemented through the above. Absolute addressing of data can be implemented by using any register with a value of 0 as the base. Static data should be aligned to 4-byte boundary, but if not - can be addressed by base+displacement addressing after placing the base address in one of the free registers. The special instruction ldar (load address relative) makes this preparation easier.

ldar base, text_hi (ip_relative_offset)
ldd dst, base, text_lo (ip_relative_offset)

Here ip_relative_offset is the label of the loaded object in the immutable data segment, text_hi is a built-in assembler function for calculating the relative address of the instruction bundle (or aligned 16-byte data portion), text_lo is a built-in assembler function for calculating the displacement within a bundle (portion). Using the ldar instruction, you can address 1 GiB on either side of the current position, or the entire address space, if you use the two-slot version of ldar:

ldar.l base, text_hi (ip_relative_offset)
ldd dst, base, text_lo (ip_relative_offset)

Addressing of private data can be implemented by first placing the correct base address in one of the free registers. Special instruction ldan (load address near) allows to calculate the nearest base address pointing to the middle of the page containing the desired object.

ldan base, gp, data_hi (gp_relative_offset)
ldd dst, base, data_lo (gp_relative_offset)

Here gp_relative_offset is the label of the object in the data segment, data_hi is a built-in assembler function to calculate the older part of the relative offset (relative to gp) to the middle of the data page where the label is located, data_lo is a built-in assembler function to calculate the offset of the label relative to the middle of the page. Using the ldan instruction, you can address 1 GiB of private data (or the entire address space if you use the two-slot version of ldan).

You can also immediately use the dual-slot memory access instructions with addressing 263 bytes in both directions from the base address.

ldd.l dst, gp, gp_relative_offset

§ 2.6. Special registers

There are several special registers, each 64 bit length. Not all special registers are available for direct access, most are available only for privileged software (at the system level). The table provides information on the purpose of special registers and their availability in protected and privileged mode.

Table 2.5: Special Registers
Group Registers Description
Registers available to the program at any privilege level for direct and/or indirect reading and updating ip instruction pointer
fpcrfloating-point status/control register
rscregister stack control
rspregister stack pointer
eipexception instruction pointer
ebsexception bit stack
ecaexception context address
Registers available for reading/writing only at the system privilege level bspbottom stack pointer
pebprocess env block
tebthread env block
reipreturnable default exception instruction pointer
itcinterval time counter
itminterval time match register
psrprocessor status register
ptapage table addresses
Debug facility registers ibr0…ibr3instruction breakpoint registers
dbr0…dbr3data breakpoint registers
mr0…mr8monitoring registers
Registers for switching to the kernel and making system calls are available only in the kernel kipkernel instruction pointer
kspkernel stack pointer
krspkernel register stack pointer
Registers for interrupt handling (interrupt context descriptors, shadow copies of general registers), interrupts available in the handler iipinterruption instruction pointer
iipainterruption instruction previous address
ipsrinterruption processor status register
causeinterruption cause register
ivainterruption vector address
ifainterruption faulting address
iibinterruption instruction bundle
Registers of the built-in interrupt controller for controlling external interrupts and asynchronous interrupts from the processor itself (available only at the system level) tprtask priority register
ivinterrupt vector
lidlocal identification register (read only)
irr0…irr3interrupt request registers (read only)
isr0…isr3interrupt service registers (read only)
itcvinterval time counter vector
tsvtermal sensor vector
pmvperformance monitor vector
cmcvcorrected machine-check vector

Direct access to special registers can be obtained using instructions mfspr (move from special-purpose register) and mtspr (move to special-purpose register). You can copy the special register to the general register (mfspr), perform the necessary operations, and then put the new value in a special register (mtspr).

The format of the mtspr and mfspr instructions
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode ra 0 spr 0 opx

Syntax:

mfspr ra, spr
mtspr ra, spr

The special register instruction pointer (ip) stores the address of the bundle containing the currently executing instruction. The register ip can be read directly via mfspr instruction, but better to get the ip-relative address (including those with zero offset) using the ldar/ldafr instruction. The register ip cannot be changed directly (via mtspr instruction), but it automatically increases at the end of the bundle execution, and also receives a new value as a result of the execution of taken branch instructions. Also ip is an implicitly implied operand in a relative branch. Because the instruction format is regular and instruction bundles have a fixed length of 16 bytes and are aligned on a 16-byte boundary. The ip register lower 4 bits are always zero, writing them is ignored.

Register format ip
6
3
6
2
6
1
6
0
5
9
5
8
5
7
5
6
5
5
5
4
5
3
5
2
5
1
5
0
4
9
4
8
4
7
4
6
4
5
4
4
4
3
4
2
4
1
4
0
3
9
3
8
3
7
3
6
3
5
3
4
3
3
3
2
3
1
3
0
2
9
2
8
2
7
2
6
2
5
2
4
2
3
2
2
2
1
2
0
1
9
1
8
1
7
1
6
1
5
1
4
1
3
1
2
1
1
1
0

9

8

7

6

5

4

3

2

1

0
bundle address 0

Special floating-point status/control register (fpcr) designed to control the floating-point unit FPU.

Special registers rsc, rsp are used to control register rotation and flushing the contents of the circular register buffer into memory.

Special registers eip (exception instruction pointer), reip (returnable default exception instruction pointer), ebs (exception bit stack), eca (exception context address) are used to implement almost zero-cost software exceptions (like C++ try/catch/throw).

The 64-bit special processor status register (psr) controls the current core behavior. It is writable only at the most privileged level, its changing requires explicit serialization.

Register format psr
6362616059585756555453525150494847464544434241403938373635343332
future
313029282726252423222120191817161514131211109876543210
0 ri 0 pl vm pp mc us ib ic ss tb lp dd id pm
Table 2.6: The psr fields
Group Field Size Description
Miscellaneous pm1 User performance monitor enabled. If 1, the performance monitor is turned on and counts events, otherwise the performance monitor is disabled.
Predicationfuture32

The future field is used to control the nullification of the subsequent instructions. The nullification instruction may mark any of the subsequent 10 instructions as non-executing in this field. A value of 0 for a bit means that the instruction is executing, 1 - is not executing (nullified).

The field is automatically shifted to the right when each instruction is executed, with zeros added for the new farthest instructions. In the case of the branch, the mask is completely cleared, thereby canceling all possible nullifications.

Debugger id1 Instruction Debug Breakpoint fault. If psr.id=1, breakpoints for instructions are enabled and may cause a Instruction Debug error. Otherwise, errors and traps on the address breakpoint are prohibited.
dd1 Data Debug Breakpoint fault. If psr.dd=1, breakpoints for the data are enabled and may cause a Data Debug error. Otherwise, errors and traps on the address breakpoint are prohibited.
lp1 Lower Privilege transfer trap. If 1, the Lower Privilege Transfer trap occurs when a transition occurs changes (decreases) the privilege level (the number psr.cpl increases to 1).
tb1 Taken branch trap. If 1, then any branch that occurs causes the debug trap Taken branch. Interrupting and returning from it doesn't cause this trap.
ss1 Single Step Trap. If 1, then the debug trap Single Step occurs after the successful execution of each instruction.
Privileges, restrictions cpl1 current privilege level. The current privilege level of the executable thread. Controls the availability of system registers, instructions, and virtual memory pages. The value 0 is the kernel level, and the value 1 is the user level. Modified by the instructions syscall, sysret, rfi, trap.
Interrupts ri2 Restart Instruction. Stores the size of the executed part of the current instruction bundle. Used to partially restart the bundle after call, syscall, interruption. Instructions from the ipsr.ri range are not executed (that is, instructions are skipped while psr.ri is less than that stored when ipsr.ri was interrupted).
ib1interruption Bit. If 1, unmasked delayed external interrupts can interrupt the processor and transfer control to the external interrupt handler. If 0, pending external interrupts cannot interrupt the processor.
ic1interruption Collection. If 1, then upon interruption, partial preservation of the context occurs (using the registers iip, iipa, ipsr, ifa, iib).
us1Used Shadow registers. If 1, then during the interruption a partial preservation of the context occurred (shadow registers, iip, ipsr) are used.
mc1Machine Check. If 1, then machine abortions are masked.
vm1 Virtual Machine. If 1, attempting to execute some instructions results in a «Virtualization fault» error. If there is no virtualization implementation, this bit is not implemented and is reserved. The psr.vm bit is available only for the rfi and vmsw instructions.

Special registers bsp (bottom stack pointer) stores bottom limit for downward grown stack, which current position is stored in general register sp. The architecture assumes that all not-used stack pages will be premapped as guard pages and might be allocated in any order, it doesn't use pre-touching for allocated stack frames. The bsp should be page-aligned.

Special registers peb (process env block) and teb (thread env block) store read-only user-mode addresses of the associated process and thread data blocks respectively.

Special register interval time counter (itc) is an unsigned 64-bit number for measuring time intervals and synchronization in intervals of the order of nanoseconds. The increase in itc is based on a fixed ratio with the processor frequency. itc increases by one time in N cycles, where N is an integer defined by the implementation, the power of two is from 1 to 32. Applications can directly read itc for time-based computing and performance measurements. itc can only be written at the most privileged level. The OS must ensure that an interrupt from the system timer occurs before itc overflows. For itc, it is not architecturally guaranteed that any other processors in the multiprocessor system will be synchronized with the time interval counters, nor with the system clock. The software must calibrate itc with a valid calendar time and periodically adjust possible drift.

Modifications of itc aren't necessarily synchronized with the instruction thread. Explicit synchronization may be required to ensure that modifications to itc are observed by the subsequent program instructions. The software should take into account the possible spread of errors when reading the interval timer due to various machine stops, such as interrupts, etc.

Special interval timer match register (itm) is a 64-bit unsigned number which contains the future value of itc at which an «interval time match» interrupt will occur.

Special register pta (page table address) controls the hardware address translation and stores root addresss for page table.

Special registers iip, iipa, ipsr save part of the context (state) of the processor upon interruption.

Special registers iva, cause, ifa, iib manage the interrupt table (iva), as well as recognition and processing of interrupts.

Special registers lid, iv, tpr, irr0 - irr3, isr0 - isr3, itcv, tsv, pmv, cmcv are for the embedded programmable interrupt controller and manage external interrupts.

Special registers ibr0-ibr3, dbr0-dbr3, mr-mr7, are for debugging and monitoring facility.

Chapter 3. Basic instruction set

This chapter describes the basic virtual processor instruction set. It is approximately 300 truly machine instructions and 30 pseudo-instructions (assembler instructions that do not have exact machine analogs and are replaced by assembler with other machine instructions, possibly with argument correction). It includes instructions for working with general registers, branch instructions, instructions for working with special registers. It doesn't include privileged instructions, floating-point instructions, multimedia instructions, support instructions for an extended (virtual) memory system.

§ 3.1. Register-register binary instructions

The register-register binary instructions have 3 arguments. The first argument is the result register number, the second and third are the numbers of the operand registers.

register-register instruction format
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode ra rb rc 0 opx

Syntax:

INSTRUCTION_NAME ra, rb, rc
Instruction Operation Description
Arithmetic instructions
addRa = Rb + RcAddition (64 bits)
addwsRa = Rb + RcAddition (word, sign-extend)
addwzRa = Rb + RcAddition (word, zero-extend)
subRa = Rb − RcSubtraction (64 bits)
subwsRa = Rb − RcSubtraction (word, sign-extend)
subwzRa = Rb − RcSubtraction (word, zero-extend)
absdRa = abs(Rb − Rc)Absolute difference (64 bits)
absdwRa = abs(Rb − Rc)Absolute difference (32 bits)
mulRa = LOPART(Rb × Rc)Multiply (the lower part of 128 bits)
mulwsRa = sext(Rb × Rc)Multiply word, sign-extend
mulwzRa = zext(Rb × Rc)Multiply word, zero-extend
mulhsRa = HIPART (Rb × Rc)Signed multiplication (the high part from 128 bits)
mulhuRa = HIPART (Rb × Rc)Unsigned multiplication (the high part of 128 bits)
divRa = Rb / RcSigned division
divuRa = Rb / RcUnsigned division
modRa = Rb % RcThe remainder of the signed division
moduRa = Rb % RcThe remainder of the unsigned division
Bitwise instructions
andRa = Rb AND RcBitwise AND
andnRa = NOT (Rb) AND RcBitwise AND with inverse of the first operand
orRa = Rb OR RcBitwise OR
ornRa = NOT (Rb) OR RcBitwise OR with inverse of the first operand
nandRa = NOT (Rb AND Rc)Bitwise AND with the inverse of the result
norRa = NOT (Rb OR Rc)Bitwise OR with result inversion
xorRa = Rb XOR RcBitwise XOR
xnorRa = NOT (Rb XOR Rc)Bitwise XOR with result inversion
compare instructions (64 bit)
cmpdeqRa = Rb == RcComparison for equality
cmpdneRa = Rb != RcComparison of inequality
cmpdltRa = Rb < RcSigned comparison less
cmpdleRa = Rb <= RcSigned less-equal comparison
cmpdltuRa = Rb < RcUnsigned comparison less
cmpdleuRa = Rb <= RcUnsigned less-than comparison
cmpdgtpseudo instructionpermutation of arguments and cmpdlt
cmpdgepseudo instructionpermutation of arguments and cmpdle
cmpdgtupseudo instructionpermutation of arguments and cmpdltu
cmpdgeupseudo instructionpermutation of arguments and cmpdleu
compare instructions (32 bit)
cmpweqRa = Rb == RcComparison for equality
cmpwneRa = Rb != RcComparison of inequality
cmpwltRa = Rb < RcSigned comparison less
cmpwleRa = Rb <= RcSigned less-equal comparison
cmpwltuRa = Rb < RcUnsigned comparison less
cmpwleuRa = Rb <= RcUnsigned less-than comparison
cmpwgtpseudo instructionpermutation of arguments and cmpwlt
cmpwgepseudo instructionpermutation of arguments and cmpwle
cmpwgtupseudo instructionpermutation of arguments and cmpwltu
cmpwgeupseudo instructionpermutation of arguments and cmpwleu
Min/Max instructions
minsRa = MIN (Rb, Rc)Minimum (signed)
minuRa = MIN (Rb, Rc)Minimum (unsigned)
maxsRa = MAX (Rb, Rc)Maximum (signed)
maxuRa = MAX (Rb, Rc)Maximum (unsigned)
Shift instructions
sllRa = Rb << RcLeft shift and zero expansion
srlRa = Rb >> RcRight shift and zero expansion
sraRa = Rb >> RcRight shift and sign extension
srdRa = Rb >> RcRight shift as a signed division

The architecture doesn't use bit flags to store comparison results and doesn't use them as implicit operands/results, as, for example, do the architectures Intel X86, SPARC, IBM POWER. The comparison result as a value of 0 or 1 is stored in the general register. In this sense, POSTRISC is similar to MIPS or Alpha architectures. Additionally, to reduce the data path, instructions for determining the minimum/maximum are implemented (comparison and selection in one instruction).

These eight bitwise register-register instructions are enough to implement any binary logic function with a single instruction.

The shift value for the register-register shift instructions is defined as the lower bits of the third register: 5 bits (for 32 bit operations) or 6 bits (for 64-bit operations) or 7 bits (for 128 bit operations). High bits are ignored.

The shift right as division instructions produce a right shift according to the rules for dividing numbers with a sign. First, an arithmetic right shift is performed (with the expansion of the sign bit). If the obtained value is negative, and when shifting to the right, the non-zero bits were forced out (to the left), then the result is corrected (adding a unit). The instruction was introduced to quickly divide signed numbers by 2shift according to the language rules like C/C++, for dividing negative numbers. With this division, the result is symmetrical with respect to zero, and the remainder can be negative.

§ 3.2. Register-immediate instructions

The register-immediate arithmetic instructions. The first argument is the number of the register of the result, the second is the number of the register-operand, the third is an immediate value of 21 or 63 bit length, sign or zero extended to 64 bits. Instructions of this group allow continuation of the immediate to the next bundle slot with the formation of a dual-slot instruction.

register-immediate instruction format
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst src imm21(63)

Syntax:

INSTRUCTION_NAME ra, rb, simm
INSTRUCTION NAME ra, rb, imm
Instruction Operation Description
Arithmetic instructions
addiRa = Rb + immAddition
subfiRa = imm − RbSubtraction from the intermediates
addiwsRa = Rb + immAddition (32 bit signed)
addiwzRa = Rb + immAddition (32 bit unsigned)
subfiwsRa = imm − RbSubtract from immediate (32 bit signed)
subfiwzRa = imm − RbSubtract from immediate (32 bit unsigned)
muliRa = LOPART (Rb × imm)Multiplication (the lower part is 128 bits)
mulwsiRa = sext(Rb × imm)Multiply words, sign extension
mulwziRa = zext(Rb × imm)Multiply words, sign extension
diviRa = Rb / immSigned division
divuiRa = Rb / immUnsigned division
modiRa = Rb % immThe remainder of the signed division
moduiRa = Rb % immThe remainder of the unsigned division
Bitwise instructions
andiRa = Rb & immBitwise AND
andniRa = not (Rb) & immBitwise AND with register inversion
oriRa = Rb | immBitwise OR
orniRa = not (Rb) | immBitwise OR with register inversion
xoriRa = Rb xor immBitwise XOR
compare instructions (64 bit)
cmpdeqiRa = Rb == immComparison for equality
cmpdneiRa = Rb != immComparison of inequality
cmpdltiRa = Rb < immSigned comparison less
cmpdltuiRa = Rb < immUnsigned comparison less
cmpdgtiRa = Rb > immSigned comparison more
cmpdgtuiRa = Rb > immUnsigned comparison more
cmpdleiRa = Rb <= immSigned comparison less or equal (pseudo)
cmpdleuiRa = Rb <= immUnsigned comparison less or equal (pseudo)
cmpdgeiRa = Rb >= immSigned comparison more or equal (pseudo)
cmpdgeuiRa = Rb >= immUnsigned comparison more or equal (pseudo)
compare instructions (32 bit)
cmpweqiRa = Rb == immComparison for equality
cmpwneiRa = Rb != immComparison of inequality
cmpwltiRa = Rb < immSigned comparison less
cmpwltuiRa = Rb < immUnsigned comparison less
cmpwgtiRa = Rb > immSigned comparison more
cmpwgtuiRa = Rb > immUnsigned comparison more
cmpwleiRa = Rb <= immSigned comparison less or equal (pseudo)
cmpwleuiRa = Rb <= immUnsigned comparison less or equal (pseudo)
cmpwgeiRa = Rb >= immSigned comparison more or equal (pseudo)
cmpwgeuiRa = Rb >= immUnsigned comparison more or equal (pseudo)
Min/Max instructions
minsiRa = smin (Rb, imm)Minimum (signed)
minuiRa = umin (Rb, imm)Minimum (unsigned)
maxsiRa = smax (Rb, imm)Maximum (signed)
maxuiRa = umax (Rb, imm)Maximum (unsigned)

For bitwise register-immediate instructions the immediate value is always sign extended. Since it is possible to invert the immediate in advance, 5 instructions are enough instead of 8 for two registers.

§ 3.3. Immediate shift/bitcount instructions

Binary instructions register and immediate shift. Shift or rotation instructions shift the value from the src register for a fixed number of bits in shift. Syntax:

INSTRUCTION_NAME dst, src, shift

Here the first argument is the number of the result register, second argument is the register number, third is the shift/rotate immediate value from 0 to 63.

Binary instruction format with immediate shift value
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst src shift 0 opx
Table 3.3: Instructions where the second argument is shift constant
Instruction Operation Description
sllishift left logical immediateLeft shift and zero expansion
srlishift right logical immediateRight shift and zero expansion
sraishift right algebraic immediateRight shift and sign extension
srdishift right dividing immediateRight shift as a signed division
cntpopcount populationBit population
cntlzcount leading zerosNumber of consecutive zeros in the most significant bits
cnttzcount trailing zerosNumber of consecutive zeros in the least significant bits
permbpermute bitsThe bits permutation according to mask

The instructions cntpop, cntlz, cnttz count the ones/zeros in the interval of shift bits. cntpop – the total number of ones. cntlz – the length of a continuous sequence of zeros from the beginning interval (the most significant bits), or shift + 1 if there are all zeroes. cnttz – The length of a continuous sequence of zeros from the end span (least significant bits), or shift + 1 if there are all zeroes.

The instruction permb (permute bits) reverses the order of bits/bytes in the register according to the immediate mask shift. The mask determines the sequential involvement in the rearrangement of neighbors: bits, pairs of bits, nibbles (four bits), bytes, byte pairs, and four bytes of the original 64-bit value. For example, a maximum mask of 63 (all units) means a permutation of all pairs (a complete inversion of the order of the bits to the reverse as for FFT), mask 1 is only permutation of adjacent bits, mask 32 is permutation of four bytes, mask 32 + 16 + 8 is reverse order of bytes (endianness) in the register, mask 16 + 8 is reverse the byte order in each four bytes in the register.

§ 3.4. Register-register unary instructions

Instruction format
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst src 0 0 opx
Table 3.4: Unary instructions
Instruction Operation Description
movmove register

Instruction mov (move register) copies data from one register to another.

Syntax:

mov   ra, rb

§ 3.5. Fused instructions

Fused instructions have more that two input paarameters and can perform two or more actions in one instruction.

Table 3.5: Fused instructions
Name Operation
mov2     ra,rb,rc,rd
move 2 registers: gr[ra] = gr[rc], gr[rb] = gr[rd]
addadd   ra,rb,rc,rd
add and add: gr[ra] = gr[rb] + gr[rc] + gr[rd]
addsub   ra,rb,rc,rd
add and sub: gr[ra] = gr[rb] + gr[rc] − gr[rd]
subsub   ra,rb,rc,rd
sub and sub: gr[ra] = gr[rb] − gr[rc] − gr[rd]
muladd   ra,rb,rc,rd
multiply and add: gr[ra] = gr[rb] × gr[rc] + gr[rd]
mulsub   ra,rb,rc,rd
multiply and sub: gr[ra] = gr[rb] × gr[rc] − gr[rd]
mulsubf  ra,rb,rc,rd
multiply and sub from: gr[ra] = gr[rd] − gr[rb] × gr[rc]
mbsel    ra,rb,rc,rd
masked bit select: gr[ra] = gr[rb] ? gr[rc]: gr[rd] (bitwise)
slp      ra,rb,rc,rd
shift left pair
srp      ra,rb,rc,rd
shift right pair
slsrl    ra,rb,rc,rd
shift left and shift right logical
slsra    ra,rb,rc,rd
shift left and shift right algebraic
sladd    ra,rb,rc,shift
shift left and add: gr[ra] = gr[rb] + (gr[rc] << shift)
sladdws  ra,rb,rc,shift
shift left and add: gr[ra] = gr[rb] + (gr[rc] << shift)
sladdwz  ra,rb,rc,shift
shift left and add: gr[ra] = gr[rb] + (gr[rc] << shift)
slsub    ra,rb,rc,shift
shift left and subtract: gr[ra] = (gr[rc] << shift) − gr[rb]
slsubws  ra,rb,rc,shift
shift left and subtract: gr[ra] = (gr[rc] << shift) − gr[rb]
slsubwz  ra,rb,rc,shift
shift left and subtract: gr[ra] = (gr[rc] << shift) − gr[rb]
slsubf   ra,rb,rc,shift
shift left and subtract from: gr[ra ] = gr[rb] − (gr[rc] << shift)
slsubfws ra,rb,rc,shift
shift left and subtract from: gr[ra ] = gr[rb] − (gr[rc] << shift)
slsubfwz ra,rb,rc,shift
shift left and subtract from: gr[ra ] = gr[rb] − (gr[rc] << shift)
slor     ra,rb,rc,shift
shift left and or: gr[ra] = gr[rb] | (gr[rc] << shift)
slxor    ra,rb,rc,shift
shift left and xor: gr[ra] = gr[rb] ^ (gr[rc] << shift)
srpi     ra,rb,rc,shift
shift right pair immediate
slsrli   ra,rb,shift,count
shift left and shift right logical immediate
slsrai   ra,rb,shift,count
shift left and shift right algebraic immediate
deps     ra,rb,shift,count
deposit set: Insert a group of units
depc     ra,rb,shift,count
deposit clear: Insert a group of zeros
depa     ra,rb,shift,count
deposit alter: Change group of bits
dep      ra,rb,rc,shift,pos
deposit: deposit of parts from two registers
rlmi     ra,rb,shift,count,pos
Fused 4-register instruction format
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode ra rb rc rd opx

The instruction mov2 (move 2 registers) moves 2 registers. It may be used for register values swapping and just code path reduction.

Fused instructions of the type shift-addition are intended to reduce the critical data path in address calculations. They combine in one machine instruction a left shift (by the number of bits from 0 to 7) with addition (or subtraction). The open question remains about handling overflow during shear with addition that may occur. with intermediate calculations (shift), but no place for the final result.

srpi instruction formats
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode ra rb rc shift opx

Pair shift instructions slp (shift left pair), srp (shift right pair) and srpi (shift right pair immediate) shift two registers as a whole to the left (right) by count bits, and puts the lowest part of the integer in the result register. srp takes the count low bits from the second, high bits from the first. The first argument is the result register number, the second and third are the numbers of the pair of shifted operand registers, fourth is register or immediate count from 0 to 63 to indicate the amount of shift. The instruction can be used to implement many useful 64-bit operations: rotation by a fixed number of bits, left or right shift, extraction of a part of the register:

Double shift instructions produce a left and then a right shift (with arithmetic or logical extension). They can be used to extract the bit portion from the register and other manipulations.

slsrai, slsrli, deps, depc, depa instruction format
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode ra rb shift count opx

The dep (deposit) instruction combines the count the least significant bits from the first operand register and the remaining bits from the second operand register. dep takes high bits from the second, count low bits from the first. The first argument is the number of the register of the result, the second and third are the numbers of the combined registers, the fourth param count is immediate number from 0 to 63 to indicate the portion size of the first merged register.

dep instruction format
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst src src shift pos

Direct deposit instructions copy from one register to another with a change in part of the register: deps (deposit set) – insert a unit block, depc (deposit clear) – insert a block of zeros, depa (deposit alter) – invert the block of bits. The block has a length of count bits and is located after the first shift bits. If the value of count+shift is greater than the size of the register (64 bits), ones/zeros bit filling or inversion continues from the beginning of the register. The first argument is the number of the result register, the second is the number of the source operand register, the third and fourth are immediate values shift and count from 0 to 63.

The rlmi instruction extracts a portion of bits of a given length/position from the register and puts it at the specified position in the result register.

rlmi instruction format
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst src shift count pos

§ 3.6. Conditional move instructions

Conditional move instructions copies data from one of two registers depend on condition.

Syntax:

NAME ra, rb, rc, rd

Description:

ra =  cond(rb) ? rc : rd
Table 3.6: Conditional move instructions
Instruction Condition
cmovlsbleast significand bit is set
cmovweqword equal 0
cmovwltword less than 0
cmovwleword less than or equal 0
cmovdeqdoubleword equal 0
cmovdltdoubleword less than 0
cmovdledoubleword less than or equal 0

§ 3.7. Load/store instructions

The 1st group of the general-purpose register load/store instructions uses the base plus offset addressing mode. The first argument is the number of the loaded (stored) register target, second is the base register number, third is an 21 bits length immediate offset disp. The instructions in this group allow continuation of the immediate value disp in the instruction code for the next slot of the bundle with the formation of a dual-slot instruction (63 bit offset). The offset disp after the sign extension is added to the base register to produce a 64-bit effective address.

EA = gr[base] + sign_extend(disp)
Format of load/store instructions with basic addressing
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode target base disp21
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
continued disp (63 bits instead of 21)

The 2nd group of general-purpose load/store instructions uses the ip-relative addressing. The first argument is the number of the loaded (or stored) register target, second is an unsigned forward offset disp with a length of 28 bits. The instructions in this group allow continuation of the immediate value disp in the instruction code for the next slot of the bundle with the formation of a dual-slot instruction (64 bit offset).

EA = ip + zero_extend(disp)
Format of load/store instructions with ip-addressing
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode target uimm28
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
0 continued disp (64 bits instead of 28)

The 3rd group of general register load/store instructions uses the basic scaled indexed addressing method. The first argument is the number of the loaded or saved register target, second is base register number base, third is index register index, next is shift amount scale, last is short offset disp 9 bits long, sign extended to 64 bits. The instructions in this group allow continuation of the immediate value disp in the instruction code for the next slot of the bundle with the formation of a dual-slot instruction (52 bit offset).

EA = gr[base] + (SM(gr[index]) << scale) + sign_extend(disp)
Scaled indexed instructions format
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode target base index scale sm disp
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
continued disp (51 bits instead of 9)

The 4th group of load/store instructions use base addressing with base updating after usage by the immediate stride. Arguments: target register, base register, signed immediate stride (10 bits). The instructions in this group allow continuation of the immediate value stride in the instruction code for the next slot of the bundle with the formation of a dual-slot instruction (52 bit offset).

For load: (ld[s]Nmia):

EA = gr[base]
tmp = MEM(EA)
gr[base] = gr[base] + sign_extend(stride)
gr[target] = tmp

For store: (stNmia):

EA = gr[base]
MEM(EA) = gr[target]
gr[base] = gr[base] + sign_extend(stride)

The 5th group of load/store instructions use base addressing with base updating before usage by the immediate stride. Arguments are same as for post-update.

For load: (ld[s]Nmib):

EA = gr[base] + sign_extend(stride)
tmp = MEM(EA)
gr[base] = EA
gr[target] = tmp

For store: (stNmib):

EA = gr[base] + sign_extend(stride)
MEM(EA) = gr[target]
gr[base] = EA
Format of load/store instructions with immediate update
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode target base stride opx
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
continued stride (52 bits instead of 10)

The signed immediate disp is added to base to form an effective base. The signed immediate stride (non-zero 11-bit) is added to base to form a new base. For loads, if target is same as base, base update doesn't occurs or loaded value replaces updated base. For stores, if target is same as base, base update occurs after the old value memory storing.

Table 3.7: Load/store instructions
Size in bytes Operation Description, parameters
1 2 4 8 16
ldbz ldhz ldwz lddz ldq load base with offset addressing:
INSN target,base,disp21
ldbs ldhs ldws ldds load signed
stb sth stw std stq store
ldbzr ldhzr ldwzr lddzr ldqr load ip-relative addressing:
INSN target,disp28
ldbsr ldhsr ldwsr lddsr load signed
stbr sthr stwr stdr stqr store
ldbzx ldhzx ldwzx lddzx ldqx load scaled indexed addressing:
INSN target,base,index,scale,disp
ldbsx ldhsx ldwsx lddsx load signed
stbx sthx stwx stdx stqx store
ldbzmia ldhzmia ldwzmia lddzmia ldqmia load base update with immediate stride after memory access:
INSN target,base,stride
ldbsmia ldhsmia ldwsmia lddsmia load signed
stbmia sthmia stwmia stdmia stqmia store
ldbzmib ldhzmib ldwzmib lddzmib ldqmib load base update with immediate stride before memory access:
INSN target,base,stride
ldbsmib ldhsmib ldwsmib lddsmib load signed
stbmib sthmib stwmib stdmib stqmib store

§ 3.8. Branch instructions

Instructions of the unconditional branch will jump to the effective address. Additionally, the return address can be stored in the general register. Using predication can turn an unconditional jump into a conditional jump.

Instruction Operation Description
jmp    label
jump relativeip-relative jump
jmpr   rb,rc
jump register indirectbase-relative jump
jmpt   rb,rc
jump tablejump to table-relative address
jmptws rb,rc
jump table word signed indexjump to table-relative address
jmptwz rb,rc
jump table word unsigned indexjump to table-relative address

Branch relative forms is an universal instructions conditional or unconditional static branch or procedure call to a relative address.

Relative branch instructions are generated according to the LDAR rule. After the operation code, there is a register for saving a possible return address, and a 28-bit field for encoding the offset (with a sign) relative to ip. This gives a maximum distance of ±2 GiB in both directions from the current position for a one-slot instruction and all available address space for a long instruction. The jmp instruction allows the continuation of the immediate offset in the instruction code to the next slot of the bundle with the formation of a dual-slot instruction.

ip = ip + 16 × simm

Instruction format jmp
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode opx simm (28 bits)
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
0 extended label (60 bits instead of 28)

The instruction jmpr (branch register indirect) is used to branch according to the base addresses in the register. The instruction jmpr, when calculating the target address, discards the 4 least significant bits of the result, so that the address always aligned with the beginning of the bundle is always obtained.

ip = (gr[base] + gr[index]) & mask{63: 4}

Instruction format jmpr
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode 0 base index 0 opx

The jmpt (jump table), jmptws, jmptwz (jump table word indexed) instructions are intended for organizing table-driven select statements (C language operator switch with continuous distribution of variants, preferably starting from zero). Traditionally, in most architectures, the table-driven switch operator uses table of absolute addresses for storing entry points into the code of options. This table is private for each process (if the loader base code address is different). If the architecture implements the possibility of relative addressing, then the table of absolute addresses can be replaced by a table of relative offsets, shared by all processes, and put it in the read-only data section.

Instruction format jmpt
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode 0 base index 0 opx

jmpt: ip = base + mem4[base + 4 × index]

jmptws: ip = base + mem4[base + 4 × sign_extend(index)]

jmptwz: ip = base + mem4[base + 4 × zero_extend(index)]

.text
; limit = 7
 bdgti  selector, limit, default
 ldafr  base, table
 jmpt   base, selector

label_0:
...
label_1:
...
...
label_7:
...
default:
...
.rodata
table:
    dw (label_0 - table)
    ...
    dw (label_7 - table)

The instructions for the conditional branch calculate the condition and (if the condition is true) jump to the effective address. Traditionally (x86, x64, SPARC), conditional branch is implemented using two instructions – comparison (with the generation of flags of the logical result) and conditional branch (by flags). However, conditional branches are very common in programs. Therefore, POSTRISC uses combined compare and conditional branch instructions to compress code and shorten the critical data path.

Table 3.9: Conditional branch instructions
Instruction Operation
bdeq     ra, rb, label
branch if doubleword equal
bdne     ra, rb, label
branch if doubleword not equal
bdlt     ra, rb, label
branch if doubleword less than
bdltu    ra, rb, label
branch if doubleword less than unsigned
bdle     ra, rb, label
branch if doubleword less than or equal
bdleu    ra, rb, label
branch if doubleword less than or equal unsigned
bdgt     ra, rb, label
branch if doubleword greater than
bdgtu    ra, rb, label
branch if doubleword greater than unsigned
bdge     ra, rb, label
branch if doubleword greater than or equal
bdgeu    ra, rb, label
branch if doubleword greater than or equal unsigned
bdeqi    ra, simm, label
branch if doubleword equal immediate
bdnei    ra, simm, label
branch if doubleword not equal immediate
bdlti    ra, simm, label
branch if doubleword less than immediate
bdgti    ra, simm, label
branch if doubleword greater than immediate
bdltui   ra, uimm, label
branch if doubleword less than unsigned immediate
bdgtui   ra, uimm, label
branch if doubleword greater than unsigned immediate
bweq     ra, rb, label
branch if word equal
bwne     ra, rb, label
branch if word not equal
bwlt     ra, rb, label
branch if word less than
bwltu    ra, rb, label
branch if word less than unsigned
bwle     ra, rb, label
branch if word less than or equal
bwleu    ra, rb, label
branch if word less than or equal unsigned
bwgt     ra, rb, label
branch if word greater than
bwgtu    ra, rb, label
branch if word greater than unsigned
bwge     ra, rb, label
branch if word greater than or equal
bwgeu    ra, rb, label
branch if word greater than or equal unsigned
bweqi    ra, simm, label
branch if word equal immediate
bwnei    ra, simm, label
branch if word not equal immediate
bwlti    ra, simm, label
branch if word less than immediate
bwgti    ra, simm, label
branch if word greater than immediate
bwltui   ra, uimm, label
branch if word less than unsigned immediate
bwgtui   ra, uimm, label
branch if word greater than unsigned immediate
bbs      ra, rb, label
branch if bit set
bbsi     ra, shift, label
branch if bit set immediate
bbc      ra, rb, label
branch if bit clear
bbci     ra, shift, label
branch if bit clear immediate
bmall    ra, uimm, label
branch if mask all bits set
bmany    ra, uimm, label
branch if mask any bit set
bmnone   ra, uimm, label
branch if mask none bit set
bmnotall ra, uimm, label
branch if mask not all bit set

Relative branch instructions are formed according to the rules BRC, BRCI, BRCIU, BBIT. After the operation code, the first compared register, the second compared register (or shift immediate), and a 16-bit field for encoding the offset (with a sign) relative to ip. This gives a maximum distance of ±1 MiB in both directions from the current position. In the case of a long instruction, the maximum distance increases to ±8 GiB on both sides of the current position.

The format of bdeq, bdne, bdlt, bdle, bdltu, bdleu, bbs, bbc
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode srcA srcB opx label17
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
0 label30

Instructions bgt (blt), bge (ble), bgtu (bltu), bgeu (bleu) are pseudo-instructions with a replacement of the order of the arguments and are reduced to instructions «less».

Format of instructions bbci, bbsi
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode src shift opx label17
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
0 label30

Relative branch instructions which use immediate. After the operation code, the first compared register, the second compared register (or constant), and a 17-bit field for encoding the offset (with a sign) relative to ip. This gives a maximum distance of ±1 MiB in both directions from the current position. In the case of a long instruction, the maximum distance increases to ±8 GiB on both sides of the current position.

The format of the instructions is bdeqi, bdnei, bdlti, bdgti
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode src simm11 label17
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
simm40 label30
Instruction format bdltui, bdgtui
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode src uimm11 label17
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
uimm40 label30

The loop control instructions are for optimization (by shortening the critical execution path) the most common forms of loops with a constant step. Loop control instructions add step (1 or -1) to the loop counter (first argument register) according to loop condition, check the loop continuation condition (compare the counter with the second argument register), and, if the condition is true, make a relative branch to the effective address (label argument).

Format of instructions like rep* (register-register comparison)
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst/src src opx label17
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
0 label30

Syntax:

INSTRUCTION_NAME ra, rb, label

A variant of loop control instructions in which register numbers are the same is a special case. The architecture determined that in this case, in comparison (as the boundary of the counter change) the old register value will participate. This can be used, for example, for branch that occur in the event of an overflow.

Table 3.10: Loop control instructions
Instruction Operation
repdltAdd 1 and branch if doubleword less (signed)
repdltuAdd 1 and branch if doubleword less (unsigned)
repdleAdd 1 and branch if doubleword less or equal (signed)
repdleuAdd 1 and branch if doubleword less or equal (unsigned)
repdgtAdd -1 and branch if doubleword greater (signed)
repdgtuAdd -1 and branch if doubleword greater (unsigned)
repdgeAdd -1 and branch if doubleword greater than or equal (signed)
repdgeuAdd -1 and branch if doubleword greater than or equal (unsigned)

A similar style of loop implementation with minimal software management costs found on almost all DSP (digital signal processor) processors. The general purpose processors have limited form (with special register iteration counter) implemented in the IBM PowerPC and Intel Itanium architectures, and universal instructions for general-purpose type add-compare-jump registers are available in the HP PA-RISC architecture (instructions addb, addib), in the DEC VAX architecture (aobleq, aoblss, sobgeq, sobgtr), in the IBM S/390 architecture (brct, bctr, bxle).

§ 3.9. Miscellaneous instructions

Instruction ldi (load immediate) loads a constant into the register (high 64 bits are reset). The first argument of the instruction ldi is the register number of the result, the second is the immediate value 28 bits long (for the short form it sign extended to 64 bits) or full 64 bits (for a dual-slot instruction).

Instruction ldih (load immediate into high 64-bit) loads a constant into the upper part of the 128-bit register (the lower 64 bits remain unchanged). The first argument of the instruction is the result register number, the second is the immediate value 28 bits long (for the short form it sign extended to 64 bits) or full 64 bits (for a dual-slot instruction).

INSTRUCTION_NAME dst, simm
Instruction format ldi, ldih
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst simm28
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
0 simm (extended to 64 bits)

The instruction nop (dummy none operation instruction) intended for the sole purpose is the code alignment to fill in the missing slots in the bundles of instructions, and for the optimal selection of instructions (fetch) from memory.

For example, if necessary, insert a label in the code, the compiler should add (if necessary) the last (incomplete) bundle with dummy instructions, and put the first instruction after the label in a new bundle (since the branch is possible only at the beginning of the bundle). Or, for example, various implementations can gain performance gains, if the destination address of the frequently performed jump is aligned on the 32/64/128-byte boundary (not just the beginning of the bundle, but the beginning of the cache line).

This instruction should not be used for any other purpose. The architecture doesn't contain software delays when loading data (load delays), conditional branch (branch delays), pipeline delays (pipeline hazards).

The nop instruction is processed at the sampling stage, but may not be fed to the next stages of the pipeline (issue), retire and never cause an interrupt (detect stage) itself. This instruction has no dependencies on either reading or writing.

The nop instruction is automatically added by the assembler to populate incomplete instruction bundle, if necessary, place the next instruction in a new bundle (in the case of a tag or a long instruction). The instruction has one immediate argument (unused).

Instruction format nop
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode opx simm28
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
0 extended simm (64 bits instead of 28)

Undefined instruction codes are reserved, and can be used for future extensions (new instructions). But one instruction undef is specially defined forever as reserved. It can automatically be added by assembler to fill in an incomplete bundle of instructions. after instructions to unconditionally jump, call a function, or return from a function. It is also used to fill the tail of code segments. The instruction has one immediate argument (unused).

Instruction format undef
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode 0opx

Chapter 4. The software exceptions support

Software exceptions are for C++-like throw/try/catch exceptions and for more common SEH-like exceptions. The POSTRISC is planned to support deterministic exception handling via frame-based unwinding with sufficient hardware support. The really zero-cost for exception usage is expected for no-exception cases and fast unwinding for exception cases.

The 128-bit link register r0 preserves 18-bit eip offset, which allow alternate return point in case of exception. The exception landing pad address should be after current return address no further than 4 MiB offset. The return instructions may jump to usual return address or to the landing pad depending on exception state.

Link register format r0
6
3
6
2
6
1
6
0
5
9
5
8
5
7
5
6
5
5
5
4
5
3
5
2
5
1
5
0
4
9
4
8
4
7
4
6
4
5
4
4
4
3
4
2
4
1
4
0
3
9
3
8
3
7
3
6
3
5
3
4
3
3
3
2
3
1
3
0
2
9
2
8
2
7
2
6
2
5
2
4
2
3
2
2
2
1
2
0
1
9
1
8
1
7
1
6
1
5
1
4
1
3
1
2
1
1
1
0

9

8

7

6

5

4

3

2

1

0
return address 0 ri
preserved caller future eip offset out-size framesize

alternate_retaddr = retaddr + 16 × ZEXT(eip_offset)

§ 4.1. Program state for exception

Special register eip always holds the address of next proper part of unwinding code. This register is automatically restored during normal subroutine return. It's modified during object construction and destruction. Special register eca holds the throwing value (usually the address of throwing object).

Two return address will be saved to link register during subroutine call: for normal return and for exception return. Because registers are 128 bits long, it's enough place for both. But because we need store also frameinfo and previous future vector, exception return address is stored as an positive offset from normal return address. Exception landing pad should be after function body near 4MiB.

So we don't need to return some optional pair (normal return value and optional exception info), and always do the check after each call for possible software exception. Excepted subroutine finally return directly to the proper next part of unwinding code.

The instruction ehthrow sets special register eca to the value (gr[src] + simm21). Usually it should be the address of exception context. This triggers execution to jump to eip address.

Instruction format ehthrow
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode opx src simm21

The instruction ehadj should be called after the successful construction of the object which requires destruction. It checks the current eca context and jumps to current eip if exception is set. Otherwize, it adjusts eip register to the new actual unwinding code address and continues normally to the next instruction.

Instruction format ehadj
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode opx simm (28 bits)

The instruction ehcatch copys the exception context eca to general register, clears eca, and adjust new eip value to ip+offset×16.

The instruction ehcatch should be called before the catch block or before the object destructor. For the catch block it adjusts eip register to the end of catch block. Before object destructor it should adjusts eip register to the position after destructor.

Instruction format ehcatch
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode opx dst 0 label17 (30)

The instruction ehnext should be called after the object destructor. It restores exception context saved in ehcatch before destructor call and checks for possible double exception fault. If it is the second software exception at the time of unwinding first software exception, then hardware exception occurs. Otherwise, if it is normal destructor call during unwinding first software exception, execution continues to new eip address. Otherwise, if it is normal destructor call without any unwinding, execution continues to next instruction.

Instruction format ehnext
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode opx src 0 label17 (30)

Chapter 5. The register stack

§ 5.1. Registers rotation

Traditionally (in most architectures) the register file is a global resource, where all registers are visible to all program procedures. If the procedure wants to use a register, the contents of this register must be stored in memory, and later restored from memory. The work of saving/restoring registers is usually divided between the procedure that makes the call (caller) and the one that is called (callee). For example, the first 14 registers out of 32 existing may be required to save caller, then the remaining 18 registers must save callee. The optimal split depends on the processor architecture: the number of registers and their universality (orthogonality), and for new architectures it is usually determined experimentally by comparing the code effectiveness for different variants, based on the analysis of a statistically large codebase.

Within one procedure, you can optimize the use of registers well, but in the case of several procedures, and especially if they are compiled separately, the use of register resources becomes suboptimal. A typical example of extreme inefficiency is recursive procedures. Even if the recursive procedure uses only one of the N available registers, each recursive call to such a procedure wants to use exactly this specific register, therefore, this register is repeatedly spilled/filled, despite the presence of many unused registers.

Summing up these arguments, we can say that a significant percentage of all memory accesses are the operations of spilling/filling registers, which in essence are not related to useful work. This fraction is not very dependent on the total number of registers due to the binding of the procedure code to specific registers. So an increase in the number of registers, although it helps to improve the efficiency of large and complex procedures, doesn't help in any way to reduce inter-procedure save and restore registers. This proportion grows with an increase in the number of procedures and a decrease in their average size (as is usually the case for object-oriented programming languages).

The solution to this problem is to implement hardware registers rotation. The registers is no more global resource. Each procedure called gets its own working subset of registers. The registers saving/restoring is not required while the registers working set of several called procedures fits in the register file.

For example, in the POSTRISC architecture, a file of 128 general-purpose registers is divided into two subsets: up to 120 rotable or stackable registers r0 - r119 (locally visible only to the current procedure) and 8 static registers g0 - g3, tp, fp, sp, gz (globally visible to all procedures). The mechanism of the register stack is implemented through the circular renaming of registers as a side effect of procedure calls and returns from procedures. The renaming mechanism is not visible to the program. There 128 rotate registers in the hardware circular buffer, which allows an easy way to cyclically calculate the remainder. In total, the general-purpose logical file of registers has 136 (128 + 8) registers, of which up to 128 maximum are simultaneously available to the program.

Static registers must be maintained and restored at the procedure boundary in accordance with programmatic conventions (APIs). Stackable registers are automatically saved and restored by the corresponding hardware mechanism without the explicit participation of the program. All other register files are visible for all procedures and must be saved/restored programmatically in accordance with program agreements.

Table 5.1: General Purpose Registers (Hardware Model)
circular register buffer (128) Global (8)
local A not available global
not available local B not available global
not available local C not available global
local D (cont) not available local D global
not available local E not available global

The above diagram shows the process of using the hardware buffer of local rotate registers. Five procedures A, B, C, D, E call each other, pass call arguments through the register buffer, place their local variables in the buffer. As the hardware circular buffer is exhausted (in procedure D), the registers are flushed onto the stack in memory and the buffer is reused from the beginning. Of course, not the entire buffer is discarded, but the necessary count to create a new frame.

Table 5.2: Register buffer (dividing into parts, looped back)
clean dirty local invalid clean

In general, the register buffer contains the following five parts (order matters):

Table 5.3: Register ring buffer parts
Part Description
clean these registers belong to inactive frames and have already been flushed to the stack in memory, but have not yet been used by other frames (if there is an advanced reset of registers to memory or advanced recovery from memory)
dirty these registers belong to inactive frames and have not yet been flushed to the stack in memory (obligatory dumping to memory is required before using under other frames)
local these are local registers of the active frame
invalid is garbage left over from past procedure calls, or registers that have never been used (can be used to expand the current active frame, to create a new active frame, or to expand the zone of clean-registers when returning from procedures or when reading registers from memory ahead of time)
Table 5.4: General purpose registers (visible to the program model)
Local (120) Global (8)
local Anot availableglobal
local Bnot availableglobal
local Cnot availableglobal
local Dnot availableglobal
local Enot availableglobal

Each procedure «sees» only its local registers, and the first physical local register is visible under the logical number r0.

The diagram below shows an example of working with the register stack. First, we have a register frame for the current function of 17 registers (r0 - r16). The last 5 of them (r12 - r16) the function uses to place the parameters for calling the next function. The return address when calling will fall into the first register parameter (r12), as well as the number of stored registers and output frame size (12, 5) - these two numbers can be packed together with return address in link register. This register number for the return address, as the boundary between the stored registers and the output parameters, is indicated in the call instructions.

After the call, the second function has at its disposal a register frame of 5 registers. The return address is visible in the register r0. Then the second function expands its register frame to the required number of registers for local computing (up to 10 registers).

After completion of work, the second function restores the saved part of the frame of the first function, and gives the parameter registers back to it. The number of registers to be returned is indicated in the instructions for returning from the function, and, according to ABI, it must match the number of incoming parameter registers.

physical numbering caller function registers callee registers immediately after the call (input parameters) callee extends the register frame caller registers after returning
0 are hidden are hidden are hidden are hidden
1
2
3
4r0r0
5r1r1
6r2r2
7r3r3
8r4r4
9r5r5
10r6r6
11r7r7
12r8r8
13r9r9
14r10r10
15r11r11
16r12r0r0r12
17r13r1r1r13
18r14r2r2r14
19r15r3r3r15
20r16r4r4r16
21 not available not availabler5 not available
22r6
23r7
24r8
25r9
26not available
27
28
29
30

special register register stack control (rsc) stores information about the status of the circular register buffer, and the current active frame of local general purpose registers.

Register format rsc
6
3
6
2
6
1
6
0
5
9
5
8
5
7
5
6
5
5
5
4
5
3
5
2
5
1
5
0
4
9
4
8
4
7
4
6
4
5
4
4
4
3
4
2
4
1
4
0
3
9
3
8
3
7
3
6
3
5
3
4
3
3
3
2
3
1
3
0
2
9
2
8
2
7
2
6
2
5
2
4
2
3
2
2
2
1
2
0
1
9
1
8
1
7
1
6
1
5
1
4
1
3
1
2
1
1
1
0

9

8

7

6

5

4

3

2

1

0
0 ndirty soc bof sof

Four fields hidden from direct access store the positions and sizes of register portions local to the rotation buffer. Their sizes depend on the implementation (register ring buffer size), except for sof whose size is always 7 bits. So, for example, for a buffer of 128 registers, the number of bits for each position is 7, for 256 is 8. Field sof (size of frame) is the size of the last active frame (possibly empty), Field bof (bottom of frame) is the position in the buffer of the beginning of the last active frame and border with the dirty section, Field soc (size of clean) is the size of the clean section, The ndirty field is the number of dirty registries.

The special register stack pointer (rsp) contains the memory address, where the next local register should be saved when the hardware circular register buffer is full. Since the address must be aligned on an 16-byte register size boundary for register spilling/filling, then the lower 4 bits of the register rsp are always zero, the writing them is ignored. A specific architecture implementation can spill/fill the registers in aligned groups of 2-16 registers at a time to optimize the work with memory, so the additional least significant bits of the register may be fixed as zero.

Register format rsp
6
3
6
2
6
1
6
0
5
9
5
8
5
7
5
6
5
5
5
4
5
3
5
2
5
1
5
0
4
9
4
8
4
7
4
6
4
5
4
4
4
3
4
2
4
1
4
0
3
9
3
8
3
7
3
6
3
5
3
4
3
3
3
2
3
1
3
0
2
9
2
8
2
7
2
6
2
5
2
4
2
3
2
2
2
1
2
0
1
9
1
8
1
7
1
6
1
5
1
4
1
3
1
2
1
1
1
0

9

8

7

6

5

4

3

2

1

0
address 0 0

In the Berkeley RISC research project, where register rotation was first applied (?), only eight of the 64 existing registers were visible to the program. A full set of 64 registers is called the register file, and a portion of eight – register window. The file allows up to eight procedure calls with their own sets of registers. Until the program calls a chain longer than eight calls, the registers never had to be stored in RAM, which is scary slow compared to register access. For many programs, a chain of six calls is enough.

A direct descendant of the RISC Berkeley project is Sun Microsystems' SPARC (UltraSPARC) architecture. Compared to the prototype, this processor provides the simultaneous visibility of four sets of eight registers each (has 32 simultaneously visible registers). Of these, 8 are global and 24 are windowed. Three sets, eight registers each, are implemented as a «register window». The eight registers i0i7 are input to the current procedure, eight registers l0l7 are local to the procedure of the current level, and eight registers o0o7 are the output for calling the next level procedure. When a new procedure is called, the register window shifts to sixteen registers, hiding old input registers and old local registers, and making the output registers of the current procedure the input registers of the new procedure. Additionally, eight registers g0g7 are globally visible to procedures at all levels.

The size of the frame and the number of output registers, unfortunately, are fixed in SPARC. It's also bad that flashing registers pushed from the stack into memory is implemented through interrupts, and the fact that the dumping place is not separated from the regular stack of automatic objects.

In the AMD 29000 architecture (64 global and 128 window visible registers), the register rotation design was further refined with variable-sized windows, which helps resource utilization in the general case, when fewer than eight registers are needed to call the procedure. A second separate stack for saving registers was also implemented.

Register rotation was used in the architecture of Intel 80960 (i960) processors for embedded applications (32 visible registers, of which 16 global and 16 windowed, with a fixed rotation step of 16 registers).

The last (of implemented) known processor that uses register rotation is Intel Itanium (IA64 architecture). It has 128 registers, of which 32 are static and 96 are windowed. It is possible to set a frame of any size from 0 to 96 registers with any number of output registers. To spill registers into memory without processor interrupting, an asynchronous hardware mechanism is implemented. The spill occurs on a separate (second) stack, which grows towards the main stack and is not visible to the user program explicitly. Both stacks share the same memory array.

The rotation of the registers is also applied in the new architecture of the educational processor MMIX, which replaced the legacy MIX processor in examples for new editions of Donald Knuth's book «The Art of Programming». MMIX architecture has a register file of 256 registers visible to the program, allows using the variable window size of visible registers, and even allows to change the boundary between the global and rotate registers visible to the program dynamically, which is usually not used in real architectures.

§ 5.2. Call/return instructions

Because the POSTRISC architecture uses hardware register rotation, then the call/return instructions execution is closely related to the operation of the circular buffer of local registers. When the procedure is called, the current frame of local registers is partially saved, when returning from the procedure, the previous frame is restored.

The POSTRISC may be extended in the future by big-SIMD facilities (256 or even 512 bit) using register pairs/groups for SIMD. Such SIMD register pairs/groups shouldn't cross a register frame boundary. The register frame base (bottom of frame) and the preserved frame size should be a multiple of register pair/group (2 or 4) to guarantee SIMD register pairs/groups alignment. The link info may be stored only in the even (or multiple 4) register to guarantee register pairs alignment. Currently, only 2-register alignment is required for frame size.

Procedure call instructions callr, callri, callmi, callmrw, callplt perform similar actions. They vary by way of computing the target call address only.

The first argument for all call instructions is the register in which the return address and other link info will be stored. All local registers starting from r0 with lower numbers up to it exclusively will be hidden after the register window rotation. All local registers, starting from the register specified in the instructions and with greater numbers, that are currently allocated will become the initial frame of the new procedure.

Then, the branch effective address is calculated (differently for different instructions). The return address along with the current frame info is stored in the return register.

Then the register window is rotated, the frame of local registers is partially saved, and the branch to the target address is performed. The new procedure always sees its return address and previous frame info in the first rotated register r0, and the input parameters in the following registers r1, r2 ...

Link register format r0
6
3
6
2
6
1
6
0
5
9
5
8
5
7
5
6
5
5
5
4
5
3
5
2
5
1
5
0
4
9
4
8
4
7
4
6
4
5
4
4
4
3
4
2
4
1
4
0
3
9
3
8
3
7
3
6
3
5
3
4
3
3
3
2
3
1
3
0
2
9
2
8
2
7
2
6
2
5
2
4
2
3
2
2
2
1
2
0
1
9
1
8
1
7
1
6
1
5
1
4
1
3
1
2
1
1
1
0

9

8

7

6

5

4

3

2

1

0
return address 0 ri
preserved caller future eip offset out-size framesize

If the call instruction is the last in the bundle, call instruction saves the return address as a pointer to the next bundle after the current bundle, and and the stored slot number ri is set to zero. If the call instruction isn't the last in the bundle, then the current bundle address and the next slot number ri were saved.

In the general case, return to middle of bundle may be less optimal but saves code size. The processor anyhow fetch and execute whole bundle, but discards execution of first ri instructions. For better performance, the bundle before call instruction may be filled with dummy nop instructions to shift call instruction to the end of bundle. There is corresponding compiler command line parameter to choose between «dense» and «aligned» calls.

For example, dense calls:

ldi %r33, 1234    ; r33 is future r1 (param for myfunc)
callr %r32, myfunc    ; r32 is future r0 (link info)
callr %r32, myfunc2
callr %r32, myfunc3

For example, aligned call:

ldi %r33, 1234    ; r33 is future r1 (param for myfunc)
nop   0
callr %r32, myfunc    ; r32 is future r0 (link info)
; this is next bundle and aligned return address
add %r34, %r12, %r12    ; next instruction after return from myfunc
sub %r14, %r22, %r11
Table 5.6: Instructions for calling procedures, managing the register frame and the regular stack
Instruction Description
callr   dst,label
ip-relative call
callri  dst,base,index
call register indirect
callmi  dst,base,disp11
memory-indirect call, base addressing
callmrw dst,base,disp11
memory-indirect call, word, base relative addressing
callplt dst,uimm28
call procedure linkage table: indirect, relative addressing
alloc   framesize
allocate register stack frame
allocsp framesize,uimm21
allocate register stack frame, update SP
ret
return from the subroutine
retf    uimm21
return from the subroutine, update SP

The instruction callr (call relative) makes a procedure call using ip-relative addressing using 28-bit immediate signed offset. This gives the maximum distance of the ±2 GiB to both sides of the current position for a one-word instruction. A long form of instruction is also implemented.

Instruction format callr
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst simm (28 bits)
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
0 simm (60 bits)

EA = ip + 16 × simm

call (EA)

Instruction callri (call register indirect) is the address of the procedure call from the register. The branch address is calculated as base plus index. The callri instruction discards the 4 least significant bits of the address, so the call address is always aligned at the beginning of the bundle.

Instruction format callri
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst base index 0 opx

EA = (gr[base] + gr[index]) & mask {63:4},

call (EA)

The instruction callmi (call memory indirect) takes the callee address from memory using base+displacement addressing. The instruction discards the 4 least significant bits of the loaded value, so that the address always aligned with the beginning of the bundle is always obtained. The instruction is intended to load from address table with aditional checks for finalized state of virtual page. The vtables should be relocated by linker and set as finalized to disable future access rights changes (hardware-assisted one way relro). The 10-bit displacement is enough to support vtables (or other function pointer tables) with up to 1024 items.

Instruction format callmi, callmrw
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst base simm10 opx

EA = gr[base] + sign_extend(simm10)

EA = mem8 (EA)

EA = EA & mask {63:4},

call (EA)

The instruction callmrw (call memory indirect relative word) takes the callee relative offset from memory using base+displacement addressing. This offset is used to compute callee address relative to base address. The instruction discards the 4 least significant bits of the loaded value, so that the address always aligned with the beginning of the bundle is always obtained. The instruction is intended to load from address table with aditional checks for finalized state of virtual page. The vtables should be relocated by linker and set as finalized to disable future access rights changes (hardware-assisted one way relro). The 10-bit displacement is enough to support vtables (or other function pointer tables) with up to 1024 items.

EA = gr[base] + sign_extend(simm10)

offset = mem4(EA)

EA = (base + offset) & mask {63:4},

call (EA)

The callplt instruction (call procedure linkage table) takes the address of the call from memory using ip-relative addressing. The instruction discards the 4 least significant bits of the loaded value, so that the address always aligned with the beginning of the bundle is always obtained. The instruction is intended to load from address table with aditional checks for finalized state of virtual page. The import tables should be relocated by linker and set as finalized to disable future access rights changes (hardware-assisted one way relro).

Instruction format callplt
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst uimm28

EA = ip + zero_extend(uimm28)

EA = mem8 (EA)

EA = EA & mask {63:4},

call (EA)

The ret and retf instructions (return from subroutine) is used to return control from the procedure. It also restores the caller procedure register window state, and retf roll-back fixed-size stack frame.

Unlike other branch instructions, these instructions may use special hardware structures to predict the destination branch address. If the prediction array branch target buffer is generally used for branch address prediction, then for ret instructions it can be additionally (for better prediction accuracy) implemented hardware branch target stack as a short stack of saved return addresses.

While restoring the previous frame state the ret instructions may load part or all of the previous frame from memory if necessary (when the circular hardware register buffer overflows). Instruction may return control until a complete recovery from memory is completed, but the architecture guarantees that attempts to use not yet recovered from memory local registers in the subsequent instructions will be delayed until recovery is performed (via the scoreboard mechanism of the registers).

Instruction format ret
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode 0opx
Instruction format retf
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode opx 0 uimm21 (63)

The link register is implicit argument for both ret instructions. It is the first current function local register and provides the return address and the previous frame info. The argument for retf is the displacement which is used for the optional stack rollback (maybe be 0). The instruction may cause an error if link register contains a broken frame info and there is no place in the local registers for the outgoing and preserved frame parts of the previous procedure since the maximum frame size is 120 registers.

§ 5.3. Register frame allocation

Each callee procedure after call obtains the remaining frame part of the calling procedure starting from the link register(the parameters and maybe slightly more). If callee wants to increase the size of its register frame it should use the alloc (allocate register stack frame) instruction. The first parameter of the instruction is local register, which will be the last in the frame of our procedure (from r0 to r119).

Instruction format alloc
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode opx framesize 0

If there is not enough free space in the rotate registers hardware buffer to accommodate a new frame, the alloc instruction flushes registers from the previous functions frames onto the register stack in memory. The instruction can return control before a complete flush is completed, but the architecture guarantees that attempts by the subsequent instructions to use local registers not yet flushed to the stack will be delayed until the flush (through the scoreboard mechanism of the registers).

The new eip is set up from reip. The reip should point to simple universal function epilog with just ret instruction. This epilog should live in the highest corresponding usermode/kernel region. The reip should be set up during thread start.

The next minimum program for a virtual processor demonstrates the use of the callr, alloc and ret instructions.

.text
; at the beginning of the program, the register stack is empty
alloc  54   ; expand frame to 54 registers
ehadj  endfunc
ldi    %r47, 1  ; will be saved when called
ldi    %r53, 3  ; first argument
ldi    %r52, 2  ; second argument
ldi    %r51, 1  ; third argument
; func procedure call, all registers up to 50 will be saved,
; return address, eip, frame size (50) are saved in r50
callr  %r50, func
; at this point, after returning, the frame will be again 54
halt
func:
; at the starting point, the func procedure has a 4-register frame
; their previous numbers are 50, 51, 52, 53, new - 0, 1, 2, 3
; extend the frame to 10 registers (plus regs 4,5,6,7,8,9)
alloc  10
write  "r0 = %x128(r0)"    ; print packed return info
write  "r1 = %i64(r1)"    ; print 1st argument
write  "r2 = %i64(r2)"    ; print 2nd argument
write  "r3 = %i64(r3)"    ; print 3rd argument
ret
endfunc:
.end

Result of execution:

r0 = 000000010000c232_fffffffff1230020
r1 = 1
r2 = 2
r3 = 3

Here: 0xfffffffff1230020 - return bundle address, 0x0000c232 - packed: previous frame size (50 registers), and output frame size (3 parameters and link), offset between return address and previous eip exception return address (endfunc label). 0x00000001 - previous future mask, nonzero because callr is a middle from 3 instructions in the bundle, so we return to the bundle middle and skip one instruction.

The instruction allocsp is introduced for code compression. Its function similar to alloc, but additionally it push usual stack. The allocsp adiust sp down by immediate size.

alloc    framesize
allocsp  framesize, uimm21
Instruction format allocsp
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode opx framesize uimm21 (63)

§ 5.4. The function prolog/epilog

The type of prolog/epilog depend on:

In the examples below r1…r5 are arguments, and r6…r10 are optional local registers. The frame stack grows from up to down addresses.

The simplest function which doesn't allocate local registers (uses arguments only) and doesn't allocate stack frame. All instructions inside function don't generate software/hardware exceptions (never touch memory, divide, etc). Then it's enough just:

insn    # can't fail, can use only args r1..r5
...
insn    # can't fail, can use only args r1..r5
ret

The next function can generate software/hardware exceptions but doesn't allocate local registers (uses arguments only) and doesn't allocate stack frame. Here in case of exception control can be transferred to eip, so we need proper eip before execution.

The special register reip is introduced to not blow code with multiple copies of standard universal epilog which consists from only ret instruction. It stores the address of such epilog. The proper initialization of reip to avalable standard universal epilog is at runtime at thread start.

Each call instruction setup eip by reip copy, so we won't worry about proper eip just after call. So even if instructions may fail, we don't need additional setup at function start.

insn    # can fail, can use only args r1..r5
...
insn    # can fail, can use only args r1..r5
ret

The next function doesn't allocate stack frame but allocate local registers. The alloc instruction here does the local registers allocation. The register allocation may trigger register spilling to memory so may fail and trigger hardware exception. But again, because eip stores the copy of reip, we won't worry about eip.

alloc   11
insn    # can fail, can use r1..r5 and r6..r10
...
ret

The next function allocates local registers and allocates the fixed-size stack frame. In this case we need to set new eip before execution to the label before return for proper traditional stack unwinding. The stack frame should be no bigger than pagesize, so we don't touch next page after the stack guard page.

std     %gz, %sp, -frame_size_immediate # touch new stack frame
allocsp 11, frame_size_immediate
ehadj   before_return   # immediately after allocsp
...
insn    # can fail, can use r1..r10
ldwz    %r7, %sp, +offset # using sp for local frame addressing
...
before_return:
addi    %sp, %sp, frame_size_immediate
ret

The next function allocates local registers and allocates the fixed-size stack frame. The stack frame is bigger than pagesize so proper guard-page extension via store probing is required.

# guard page probing for frame size bigger than pagesize
std     %gz, %sp, -page_size * 1
std     %gz, %sp, -page_size * 2
...
std     %gz, %sp, -page_size * n
# allocation only after probing
allocsp 11, frame_size_immediate
ehadj   before_return   # immediately after allocsp
...
insn    # can fail, can use r1..r10
ldwz    %r7, %sp, +offset # using sp for local frame addressing
...
before_return:
addi    %sp, %sp, frame_size_immediate
ret

The before_return block:

...
before_return:
addi    %sp, %sp, frame_size_immediate
ret

may be changed to one retf instruction:

...
before_return:
retf    frame_size_immediate

and, if there is a space in previous bundle, then retf may be copied into it, and before_return block may be potentially amortized once for several functions with same frame size:

...
retf    frame_size_immediate
before_return:
retf    frame_size_immediate

The next function allocates local registers and allocates stack frame with variable size (uses variable length arrays or alloca function) possibly with initial size more than pagesize. In this case we have 2 rollback points: for the case of failure in local register alocation, and for the case of failure in initial stack alocation. The sp can't be used for access local stack frame (because of variable frame size), so some local temp register is used to save/restore old sp value (r6 in example) with negative offsets.

# optional guard page probing for frame size bigger than pagesize
std    %gz, %sp, -page_size * 1
std    %gz, %sp, -page_size * 2
...
std     %gz, %sp, -page_size * n
# allocation only after probing, r6 is allocated on the fly
allocsp 11, initial_frame_size_immediate
addi    %r6, %sp, initial_frame_size_immediate
ehadj   before_return   # immediately after saving fp in r6
...
insn    # can fail, can use r1..r10
ldwz    %r7, %r6, -offset # using r6 for local frame addressing

# alloca or VLA
# optional guard page probing for big frame size
std    %gz, %sp, -page_size * 1    
std    %gz, %sp, -page_size * 2
...
std    %gz, %sp, -page_size * m
# allocation only after probing
sub    %sp, %sp, additional_frame_size
# end of alloca or VLA

stw    %r7, %r6, -offset # using r6 for local frame addressing
...
before_return:
mov    %sp, %r6
ret

§ 5.5. The register stack system management

One alloc instruction, along with instructions for calling procedures and returning control, in principle, it is enough for user programs to handle the register stack. But for system programs that handle interrupt processing, returning from an interrupt, context switching, initialization of the register stack, some more instructions are needed.

Instruction without parameters rscover (register stack cover frame) is used, to put the last (active) frame of the register stack into the dirty state (registers belonging to inactive procedure frames). After executing this instruction, the size of the active frame of the local registers is zero. This instruction prepares the register stack for subsequent disconnection or switching.

Instruction without parameters rsflush (register stack flush) used to flush all inactive frames of the register stack into memory (transfer from the dirty state to the clean state). After executing this instruction, the register stack can be disabled without fear of data loss.

Instruction without parameters rsload (register stack load) used to load from memory the last inactive frame of the register stack and be ready to activate it. After executing this instruction, the register stack is ready to work (a group of clean registers appears in it).

Instruction format rscover, rsflush, rsload
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode 0opx

§ 5.6. Calling convention

ABI defines a standard relationship function relationship, namely stack frame location, using registers, passing parameters.

The standard function call convention applies only to global functions. Local functions (not available from other object files) may be used by other agreement unless it prevents the correct recovery after an exception.

Convention about the use of registers in standard function calls divides all the global registers available to the program into two categories: saved (preserved) and non-saved (scratch) registers.

preserved registers are guaranteed to be saved when the procedure is called. The called procedure (callee) guarantees the safety of the contents of such a register with a normal return from it. She either doesn't touch this register, or saves it contents somewhere and restores before returning.

Unsaved (scratch) registers may not be saved when a procedure is called. The calling procedure (caller) must store the contents of such a register in memory (on the stack) or in another, but persistent register, if it doesn't want to lose its contents when calling callee. The callee function called uses this register for its needs without restriction.

The architecture provides 128 general purpose registers of 128 bits, and several 64/128 bit special purpose registers. General purpose registers are divided into global (static) and rotatable. The following table shows how registers are used.

Table 5.7: Saving registers when calling procedures
registers volatility
spstack pointer is saved. The address of the top of the stack must be aligned on a 16-byte boundary. It should always point to the last placed stack frame, growing down in the direction of lower addresses. The contents of the word at this address always points to a previously placed stack frame. If required, may be reduced by the called function. The stack top pointer must be updated atomically with a single instruction to avoid any period of time in which the interrupt can happen with a partially updated stack.
tpthread pointer is saved. This register stores the base address of the TDATA segment of the main program module.
r0communication register, saved automatically by the rotation mechanism of the registers.
r1-r32 Used to pass parameters to the called function (not saved). Registers r1 and r2 store the return value.

Static registers g0-g7 must retain their values in the process of accessing the function. Functions that use these registers must save their values before changing, and restore them before returning from the function.

External signals can interrupt the flow of instructions at any time. Functions called during signal processing do not have any special restrictions on their use of registers. In addition, when the signal processing function returns control, the process resumes its work with correctly restored registers. Therefore, programs and compilers are free to use all registers above except reserved for use by the system without fear of signal processing programs that inadvertently change their values.

The operating system provides each thread with its own stack, in which data is placed on both sides. The stack of rotated registers grows from the bottom towards the higher addresses, work with it under the control of the equipment and is not visible to ABI. The usual stack of software local objects grows from top to bottom towards lower addresses. Each frame (frame) corresponds to an activation record of a procedure in a call chain. The stack pointer sp (stack pointer) always points to the first byte after the top of the stack. The stack frame should be aligned at the 16-byte boundary, and should be a multiple of 16 bytes in size.

The last function in the call chain, which itself doesn't call anyone, may not have its own frame. Such functions are called leaf or terminal (in the graph of dependencies between functions). All other functions must have their own stack frame in the dynamic stack. The following figure shows the organization of the stack frame. sp in the figure means the pointer (register r1) of the top of the stack the called function after it has executed the code setting the stack frame.

Stack frame organization

highest address

        + -> Frame header (return address, gp, rsc)
        | Register storage area (aligned on the boundary of 16 bytes)
        | Local variable space (aligned on the boundary of 16 bytes)
sp ---> + - The title of the next frame (sp + 0)

lowest address

The following requirements apply to the stack frame:

The header of the stack frame consists of a pointer to the previous frame (link info), storage areas rsc, lp and gp, resulting in 32 bytes. Link info always contains a pointer to the previous frame in the stack. Before function B refers to another function C, it must save the contents of the communication register received from function A in the storage area lp for the stack frame of function A, and must set its own stack frame.

Except for the header of the stack frame and inserts for alignment at the 16-byte boundary, the function should not allocate space for areas that it doesn't use. If the function doesn't call other functions and doesn't require anything from the rest of the stack frame, then it should not set the stack frame. The parameter saving area follows the stack frame, the register saving area should not contain any inserts.

For machines of the RISC type (where there are many registers) it is generally more efficient to pass arguments to the called functions in the registers (real and general purpose), rather than constructing a list of arguments in memory or pushing them onto the stack. Since all calculations must somehow be performed in registers, then extra memory traffic can be eliminated if the caller can calculate the arguments in the registers and pass them in the same registers of the called function (callee), and she can immediately use them for its calculations. The number of arguments that can be passed in this form is limited by the number of available registers in the processor architecture.

For POSTRISC, up to 16 parameters are passed in general registers and are visible in the callee new frame in registers r1…r16. The caller passes parameters starting from any register. Exact register number on the caller side depend on caller local frame size.

Parameter storage area, which is located at a fixed distance of 32 bytes from the pointer to the top of the stack, reserved in each frame of the stack for use under the argument list. A minimum of 8 double words is always reserved. The size of this area should be sufficient to preserve the longest list of arguments passed to the function if it owns a stack frame. Although not all arguments for a particular call are in storage, consider their list formation in this area, with each argument occupying one or more double words.

If more arguments are passed than are allowed to be stored in registers, the remaining arguments are stored in the parameter storage area. Values passed through the stack are bitwise identical to those that would be placed in registers.

For variable argument lists, the ABI uses the type va_list, which is a pointer to the location in memory of the next parameter. Using the simple va_list type means that variable arguments should always stay in the same location despite the type, so that they can be found at runtime. This ABI defines the location, which is the common registers r8-r18 for the first eight double words and the parameter storage area on the stack for the rest. Alignment requirements, such as for real types, may require so that the va_list pointer is pre-aligned before accessing the value.

The return value of the function. Functions must return type values int, long, long long, enum, short, char, or pointers to any type, in register r1, extended to 64 bits (zeros or sign).

Arrays of characters up to 8 bytes long, or bit strings up to 64 bits long, will be returned in the g8 register, right justified. Structures or joins of any length, and character strings longer than 8 bytes, will be returned in the storage buffer allocated by the caller. The caller passes the address of this buffer as a hidden optional argument.

Functions must return a single real result of type float, double, long double (quadruple) in the r1 register, rounded to the desired precision. Functions must return complex numbers in the registers r1 (real part) and r2 (imaginary part), rounded to the desired accuracy.

Chapter 6. Predication

§ 6.1. Conditional execution of instructions

The architecture defines a model in which the control flow is passed to the next sequential instruction in memory, unless otherwise directed by a jump instruction or interrupt. The architecture requires the program to appear that the processor is executing instructions in the order in which they are located in memory, although in reality the order can be changed inside the processor. The instruction execution model described in this chapter provides a logical representation of the steps involved in executing the instruction. The branch and interrupt sections show how flow control can be changed during program execution.

If the branch direction is incorrectly predicted, the branch instruction causes the pipeline to stop. All speculatively launched instructions are reset from the fetch stage to the stage of writing results – for almost the entire length of the pipeline.

Predication is a conditional execution of instructions. The purpose of conditional execution is to remove badly predicted branches from the program. In this case, any instruction becomes a hardware-executed conditional branch operator. For example:

if (a) b = c + d.

add (a) b = c, d

The optional argument «a» (predicate) sets the logical condition – to execute the instruction or not. This technology replaces a control dependency with a data dependency and shifts a possible pipeline shutdown closer to the pipeline end. All instructions issued with a false value of the predicate are rejected at the completion stage (retire) or earlier (up to the decode stage) without interruptions.

The instruction predication may be explicit or implicit. With explicit predication, each instruction contains an additional argument – a one-bit predicate register, and, accordingly, the architecture contains a file of several predicate registers (16 predicates in the ARM-32 architecture, 64 in Intel-Itanium).

When implicit predication, the architecture contains a special register-mask for storing information about the conditionality of the execution of future instructions. Before executing the instruction, the first bit from this register is taken as its predicate. Then the register is shifted by one bit, while the current bit is lost. The subsequent instruction takes the second bit as the predicate. The register is constantly updated from the other end with «clean» bits corresponding to unconditionally executed instructions.

Some instructions may write data to this register, thereby canceling the unconditional execution of some future instructions according to the bitmask. These are the so-called nullification instructions. For example, using mask 0b10011 containing 3 1-bits, the 3 instructions (1, 2, and 5th) after the nullification instruction will be canceled.

The advantage of conditional execution is the elimination of most branches in short conditional calculations, and hence the pipeline stops. However, this is a purely power method, which boils down to simultaneously issuing instructions from several execution branches under different predicates on the pipeline. In addition, when explicitly predicting, a place in the instruction is required to explicitly encode the optional argument – the predicate register.

Predication is more suitable for short conditional calculations. It makes no sense to apply predication for loops, or conditional statements longer than the gain from the continuous operation of the pipeline without branches. However, it is the only means of removing downtime for poorly predictable branches (for example, a conditional branch depending on unpredictable data).

An implicit predication scheme was chosen for the POSTRISC architecture. This is due to the fact that according to statistics collected for other architectures where there is a predication, approximately 90% of instructions are executed without using predication, so spending several bits for the predicate in each instruction is not profitable. On the other hand, the remaining 10% of the instructions depend on unpredictable data and, without predication, introduce a significant delay in the pipeline operation. Therefore, architecture without predication will also be suboptimal.

The special field psr.future is used to control the nullification of the subsequent instructions. The least significant bit of the register corresponds to the current instruction, other bits correspond to the subsequent instructions. At the end of the instruction, a right shift occurs. In the case of the branch, the future mask is completely cleared, thereby canceling all possible established nullifications.

Format of nullification instructions
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode nullification condition dist-no dist-yes opx

Based on the condition of nullification (for different instructions consists of 2 registers, or register and shift, or register and immediate value) either the next «n-yes» instructions are nullified or the next «n-no» subsequent instructions (after first block «n-yes» instructions) are nullified in the psr.future.

The following are examples of removing branches from short conditional statements and the corresponding use of nullification. In all cases, the management dependency is converted to data dependence and masking for future instructions. Everywhere it is assumed that the calculation of conditions gives side effects in the form of possible exceptional situations and should occur strictly predictively. If there can be no side effects in the form of exceptional situations, then, naturally, the calculation of a difficult condition can be done without predication and reduce unnecessary manipulations.

Table 6.1: Schematic examples of using predication
Conditional statement Predication
if (c1) {x1; }
else {x2; }
c1 c1yes, c1no
x1 (c1no)
x2 (c1yes)
if (c1) {
 x1;
 if (c2) x2;
 else x3;
 x4;
} else {
 x5;
 if (c3) x6;
 else x7;
 x8;
}
c1 c1yes, c1no
x1 (c1no)
c2 c2yes, c2no (c1no)
x2 (c1no, c2no)
x3 (c1no, c2yes)
x4 (c1no)
x5 (c1yes)
c3 c3yes, c3no (c1yes)
x6 (c1yes, c3no)
x7 (c1yes, c3yes)
x8 (c1yes)
if (c1) {x1;
} else if (c2) {x2;
} else if (c3) {x3;
} else {x4;
}
c1 c1yes, c1no
c2 c2yes, c2no (c1yes)
c3 c3yes, c3no (c1yes, c2yes)
x1 (c1yes)
x2 (c2yes)
x3 (c3no)
x4 (c3yes)
if (c1 & & c2) {x1; }
else {x2; }
c1 c1yes, c1no
c2 c2yes, c2no (c1no)
x1 (c1no, c2no)
x2 (c2yes)
if (c1 || c2) {x1;
} else {x2;
}
c1 c1yes, c1no
c2 c2yes, c2no (c1yes)
x1 (c2no)
x2 (c1yes, c2yes)
if (c1 || (c2 & & c3)) {
 x1;
} else {
 x2;
}
c1 (p0) p2, p3
c2 (p3) p4, p5 (unc)
c3 (p4) p2, p3
x1 (p2)
x2 (p3)
if (c1 & & (c2 || c3)) {
 x1;
} else {
 x2;
}
c1 (p0) p2, p3
c2 (p2) p4, p5 (unc)
c3 (p5) p4, p5 (unc)
x1 (p4)
x2 (p3)

§ 6.2. Nullification Instructions

Nullification instructions mark in the special field psr.future the fact that the execution of the subsequent instructions was canceled. Nullification instructions create mask of 1s for nullified instruction for if or else block, and or them with current future mask. Nullification instructions assume that the «if»-block precedes the «else»-block.

Next instructions cancel future instructions depending on the result of comparing two registers.

Table 6.2: reg-reg nullification instructions
Instruction Operation
nuldeqnullify if doubleword equal
nuldnenullify if doubleword not equal
nuldltnullify if doubleword less
nuldlenullify if doubleword less or equal
nuldltunullify if doubleword less unsigned
nuldleunullify if doubleword less or equal unsigned
nulweqnullify if word equal
nulwnenullify if word not equal
nulwltnullify if word less
nulwlenullify if word less or equal
nulwltunullify if word less unsigned
nulwleunullify if word less or equal unsigned
Nullification instruction format compare-regs
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode ra rb opx dist-no dist-yes opx

Next instructions cancel future instructions depending on the result of comparing the register and the 14(40)-bit immediate value, with or without a sign. The conditions are the same as for compare with immediate and branch instructions.

Table 6.3: reg-imm nullification instructions
Instruction Operation
nuldeqinullify if doubleword equal
nuldneinullify if doubleword not equal
nuldltinullify if doubleword less
nuldleinullify if doubleword less or equal
nuldltuinullify if doubleword less unsigned
nuldleuinullify if doubleword less or equal unsigned
nulweqinullify if word equal
nulwneinullify if word not equal
nulwltinullify if word less
nulwleinullify if word less or equal
nulwltuinullify if word less unsigned
nulwleuinullify if word less or equal unsigned
nulmallnullify if mask all bit set
nulmanynullify if mask any bit set
nulnonenullify if mask none bit set
nulnotallnullify if mask not all bit set
Format of nullification instructions compare-with-immediate
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode ra imm11 dist-no dist-yes opx
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
imm40 0

Instructions nulbs (nullify if bit set) and nulbsi (nullify if bit set immediate) cancel future instructions depending on whether or not a bit is set in the register.

Analogous instructions nulbc (nullify if bit clear) and nulbci (nullify if bit clear immediate) cancel future instructions depending on whether or not a bit is clear in the register.

Nullification instruction format nbs, nbc
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode ra rb opx dist-no dist-yes opx
Format of nullification instructions nbsi, nbci
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode ra shift opx dist-no dist-yes opx

Floating-point scalar values may be checked for nullification. Two registers may be compared, or single register value may be classified (normalized, signed, denormal, NaN, INF, etc).

Format of nullification instructions fp compare
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode ra rb opx dist-no dist-yes opx
Format of nullification instructions fp classify
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode ra classify opx dist-no dist-yes opx

§ 6.3. Nullification in assembler

Assembler eliminates the need to manually set predication distances. You can use named markers whose distances are computed automatically. General syntax:

INSTRUCTION NAME regular_parameters (pred1, pred2, pred3, ...)

The predicate list indicates in which previously defined predicate instruction this instruction is the last in if or else-block. These predicates should be mentioned in previous nullification instructions no higher than 31 instructions from the current one for «yes» predicate and 63 instructions from the current one for «no» predicate.

write    "test nullification (explicit distances)"
ldi      %r10, 0
nuleq    %r10, %r10, 5, 4
write    "0" ; nullified in 5
write    "1" ; nullified in 5
write    "2" ; nullified in 5
write    "3" ; nullified in 5
write    "4" ; nullified in 5
write    "5" ; nullified in 4
write    "6" ; nullified in 4
write    "7" ; nullified in 4
write    "8" ; nullified in 4
write    "test nullification (predicate names)"
ldi      %r10, 0
nuleq    %r10, %r10, equal, nonequal
write    "0"
write    "1"
write    "2"
write    "3"
write    "4" (equal)
write    "5"
write    "6"
write    "7"
write    "8" (nonequal)

Both variants print «5 6 7 8» (4 instructions else-block) and avoid printing «0 1 2 3 4» (5 instructions if-block) due to predication.

The else-block may be empty if both «yes» and «not» distances refer to same instruction (dist_yes == dist_not). To create zero-length else-block, the last instruction of if-block should be marked as last for else-block also. To create zero-length if-block, the nullification instruction itself should be marked as last in if-block.

In the next sample all 3 subsequent instructions will be nullified, because nullification condition is true. And there is no else-block. Both markers of block ends «(equal)» and «(nonequal)» are on same instruction.

nuleq    r10, r10, equal, nonequal
write    "0"                     ; part of equal-block
write    "1"                     ; part of equal-block
write    "2" (equal, nonequal)   ; part of equal-block

In the next example all 3 subsequent else-block instructions will be executed, because nullification condition is false, and if-block for true nullification condition isn't presented. The «(equal)» marker or if-block is set just on predication instruction, so the size of if-block is zero - the distance from end of block to nullification instruction.

nuleq    r10, r12, equal, nonequal (equal)
write    "0"                ; part of nonequal-block
write    "1"                ; part of nonequal-block
write    "2" (nonequal)     ; part of nonequal-block

Chapter 7. Physical memory

From the most applications point of view, memory is defined as a linear array of bytes, indexed from 0 to 264−1. Each byte is identified by its index or address, and each byte contains a value. This information is sufficient for programming applications that do not require special features in any system environment. Other objects are constructed as sequences of bytes.

The architecture supports composite types of size 1,2,4,8,16 bytes. The following is the terminology used in this guide for composite data types. It is considered that the word size is 4 bytes.

Byte is a 8 contiguous bits starting at an arbitrarily addressable byte boundary. Bits are numbered from right to left from 0 to 7.

Halfword is a two contiguous bytes starting on an arbitrary (but multiple of two) byte boundary. The bits are numbered from right to left from 0 to 15.

Word is a four contiguous bytes starting on an arbitrary (but multiple of four) byte boundary. Bits are numbered from right to left from 0 to 31.

Doubleword is a eight contiguous bytes starting on an arbitrary (but multiple of eight) byte boundary. The bits are numbered from right to left from 0 to 63.

Quadword is a sixteen contiguous bytes starting on an arbitrary (but multiple of 16) byte boundary. The bits are numbered from right to left from 0 to 127.

Octaword (optional) is a 32 contiguous bytes starting on an arbitrary (but multiple of 32) byte boundary. The bits are numbered from right to left from 0 to 255.

This chapter additionally defines physical addressing, physical memory map, physical memory properties, memory ordering.

An extension of simple memory model include: virtual memory, cache, memory mapped IO, multiprocessor systems with shared memory, and, together with services, provided by the operating system, describes the mechanism which allows explicit management of this extended memory model.

A simple sequential execution model allows at most one memory access at a time and requires so that all memory accesses seemed to be executed in program order. Unlike this simple model, a relaxed memory model is further defined. In multiprocessor systems that allow multiple locations of data copies, aggressive architecture implementations can allow time intervals during which different copies have different meanings.

The program accesses the memory using the effective address calculated by the processor, when it performs a download, write, jump, or cache management instruction, and when it selects the next sequential instruction. The effective address is converted to a physical address according to the translation procedures. The physical address is used by the memory subsystem to execute memory access. The memory model provides the following features:

Architecture allows memory to take advantage of benefits efficiency from poor sequencing of memory access between processors or between processors and external devices.

Memory accesses by a single processor seem to be completed sequentially from the point of view of the programming model, but this may not end in order with respect to the final position in the memory hierarchy. Order is guaranteed at every level of the memory hierarchy just to access the same address from the same processor.

The architecture must provide instructions to allow the programmer to guarantee consistent and ordered state of memory.

The following defines the resources of the operating system for translating virtual addresses to physical addresses, physical addressing, memory sequencing and physical memory properties, status registers to support virtual memory management, virtual memory errors.

§ 7.1. Physical addressing

The blocks of RAM, ROM, flash, memory mapped IO and other control blocks occupy a common 64-bit physical address space with byte addressing. Accesses to RAM and the IO address ranges can be performed either through virtual addressing, by mapping to a 64-bit physical address space, or directly through physical addressing.

While software should always consider physical addressing as 64-bit, in fact, PALEN less than 64 bits of the physical address can be implemented in hardware. As shown below, the physical address consists of two parts: unimplemented and implemented bits. At least 40 bits of physical addressing must be implemented.

The system software can determine the specific value of PALEN by reading the PALEN field of the configuration word with the cpuid instruction.

Not all of these available addresses have real devices under them. The hardware at startup maps the available address blocks to the physical memory ranges and notifies the system about mapping. Similarly, the control ranges of the registers of external devices are mapped to physical addresses. Most physical addresses usually remain unused.

64-bit physical address
6
3
6
2
6
1
6
0
5
9
5
8
5
7
5
6
5
5
5
4
5
3
5
2
5
1
5
0
4
9
4
8
4
7
4
6
4
5
4
4
4
3
4
2
4
1
4
0
3
9
3
8
3
7
3
6
3
5
3
4
3
3
3
2
3
1
3
0
2
9
2
8
2
7
2
6
2
5
2
4
2
3
2
2
2
1
2
0
1
9
1
8
1
7
1
6
1
5
1
4
1
3
1
2
1
1
1
0

9

8

7

6

5

4

3

2

1

0
reserved implemented physical address bits

When the processor model doesn't implement all the bits of the physical address, the missing bits must be zero. If the software generates physical addresses with non-zero unimplemented bits, a runtime error occurs. Accessing instructions for unimplemented physical addresses results in the error «unimplemented instruction address». Accessing data by unimplemented physical addresses results in the error «unimplemented data address». Any accesses to the implemented but unused addresses end with an asynchronous «machine check abort» when the platform reports an operation timeout. The exact machine behavior of the check is implementation-dependent.

§ 7.2. Data alignment and atomicity

Memory accesses give a significant performance hit when accessing operands, which are not aligned at the natural address boundary. A naturally-aligned 2-byte number in memory has a zero bit in the low order of the address. A naturally-aligned 4-byte number in memory has two zero bits in the least significant bits of the address. A naturally-aligned 8-byte number in memory has three zero bits in the least significant bits of the address. A 16-byte number, naturally aligned in memory, has four zero bits in the least significant bits of the address. In general, a naturally aligned object of size 2N bytes has N zero bits in the least significant bits of the address.

Struct data types must provide natural alignment for all of their fields by inserting (paddings). Additionally, it should be possible to use the structs as elements of an array, by using the final padding with the strictest alignment among all struct fields.

Using the example of the following S C language structure, containing a set of various scalars and a character string, the location of the fields in memory is shown.

struct {
   int       a;    /* usual 4 bytes */
   double    b;    /* usual 8 bytes */
   int       c;    /* usual 4 bytes */
   char      d[7];
   short     e;    /* usual 2 bytes */
   int       f;
} S;

C language rules for mapping structures allow the use of paddings (byte skipping) to align scalars in memory on natural boundaries.

Table 7.1: Aligned representation of the structure in memory
0 1 2 3 4 5 6 7
4 bytes (a) padding
8 bytes (b)
4 bytes (c) d[0] d[1] d[2] d[3]
d[4] d[5] d[6] padding 2 bytes (e) padding
4 bytes (f) final padding

In the example, to map the structure to memory, alignment was made along the boundary that is natural for each scalar. This alignment gives an additional four missing bytes between a and b, one byte between d and e, and two bytes between e and f. Since the alignment for the double precision number b is the strictest for this structure, then the whole structure should be aligned on an 8-byte boundary. This gives 4 more bytes at the end of the struct.

Unaligned memory accesses throw an error «Unaligned data address». POSTRISC will not contain any hardware support for unaligned memory accesses, limiting itself to the installed program handler of the corresponding interrupts. Therefore, the software is required to align all scalar values on their natural boundaries in memory.

Since the instruction fetch, aligned load/store, and operations with semaphores operate only on aligned target addresses, they are atomic. The operation is atomic if for other agents working with memory (other processors, IO devices), memory access from our processor is an indivisible transaction (and vice versa). If our processor stores data to memory, then no other agent will be able to read from memory a mixture of old data and the newly written data replacing them. Similarly, if our processor reads data, then it will never read from memory a mixture of old data and new-write data replacing them from another agent. Of course, at the machine architecture level, these rules only apply to atoms memory, that is, correctly aligned objects of 1, 2, 4, 8, or 16 bytes in size. For arbitrary objects in memory, the atomic nature of their change is not guaranteed by architecture, and software tricks must be applied.

§ 7.3. Byte order

If scalars (individual data elements or instructions) were indivisible, then there would be no concept of «byte order». It makes no sense to consider the order of bits or groups of bits within the smallest addressable memory atom, because this order for an atom cannot be observed and determined. The question of order arises only when scalars, which the programmer and processor refer to as indivisible objects, occupy more than one addressable memory atom.

For most existing computer architectures, the smallest addressable memory atom is a 8-bit bytes. Other scalars consist of groups of 2, 4, 8, or 16 bytes in length. When a 4-byte scalar moves from register to memory, it occupies four consecutive byte addresses. Thus, it becomes necessary to establish the order of byte addresses relative to the scalar value: which byte contains the most significant eight bits of the scalar, which byte contains the next eight bits of importance, and so on.

For a scalar consisting of several atoms (bytes) of memory, the choice of byte order in memory is essentially arbitrary. There is N! ways to determine the order of N bytes within a long number, but only two of these orderings are actually used.

The order in which the smallest address is assigned to a byte that contains eight bits of a scalar of the lowest order (the rightmost bits), the next consecutive address is next in ascending order of eight bits, and so on. This order is called little-endian because it is least significant (from to the smaller end) the bits of the scalar, regarded as a binary number, are the first to go into memory. Intel-X86 is an example of an architecture using this byte order.

In a little-endian machine, bytes within a large number are numbered from right to left in decreasing order of byte addresses, so the low byte is stored in memory at the lowest address. This is a direct byte order (a format for storing and transmitting binary data, in which the least (least significant) bit (byte) is transmitted first.

7 6 5 4 3 2 1 0

The order in which the smallest address is assigned to a byte that contains eight bits of a scalar of the highest order (the leftmost bits), the next consecutive address is the next in descending order of eight bits, and so on. This order is called big-endian because the most significant ones (from the larger end) the bits of the scalar, regarded as a binary number, are the first to go into memory. IBM PowerPC is a sample architecture using this byte order.

In a big-endian machine, bytes within a large number are numbered from left to right in ascending order of byte addresses, so the low byte is stored in memory at the highest address. This is the reverse byte order (a format for storing and transmitting binary data in which the most significant (most significant) byte is transmitted or stored first. The terms little/big-endian comes from Gulliver's Travel Jonathan Swift.

0 1 2 3 4 5 6 7

Using the example of the following S structure of the C language containing a set of various scalars and a character string, shows the location of fields in memory under different conventions on byte order. Comments show values for each element of the structure. These values show how the individual bytes that make up each element of the structure are mapped into memory.

struct {
   int     a;     /* 0x1112_1314 (4 bytes) */
   double  b;     /* 0x2122_2324_2526_2728 (8 bytes) */
   int     c;     /* 0x3132_3334 (4 bytes) */
   char    d[7];  /* "A","B","C","D","E","F","G" bytes array */
   short   e;     /* 0x5152 (2 bytes) */
   int     f;     /* 0x6162_6364 (4 bytes) */
} S;

C language rules for mapping structures allow the use of inserts (byte skipping) to align scalars in memory at desired (natural) boundaries. In the examples below, the mapping of the structure into memory is done with natural alignment border for each scalar. This alignment gives an additional four missing bytes between a and b, one byte between d and e, and two bytes between e and f. The same amount of padding is present in big-endian and little-endian mappings.

The contents of each byte, as defined in the S structure, are displayed as a hexadecimal number or character (for line elements). Cell addresses (offsets from the beginning of the structure) are shown below the data stored at this address.

Table 7.4: Little-endian structure mapping S
0x14
0
0x13
1
0x12
2
0x11
3
padding
4
padding
5
padding
6
padding
7
0x28
8
0x27
9
0x26
10
0x25
11
0x24
12
0x23
13
0x22
14
0x21
15
0x34
16
0x33
17
0x32
18
0x31
19
«A»
20
«B»
21
«C»
22
«D»
23
«E»
24
«F»
25
«G»
26
padding
27
0x52
28
0x51
29
padding
30
padding
31
0x64
32
0x63
33
0x62
34
0x61
35
padding
36
padding
37
padding
38
padding
39
Table 7.5: Big-endian structure mapping S
0x11
0
0x12
1
0x13
2
0x14
3
padding
4
padding
5
padding
6
padding
7
0x21
8
0x22
9
0x23
10
0x24
11
0x25
12
0x26
13
0x27
14
0x28
15
0x31
16
0x32
17
0x33
18
0x34
19
«A»
20
«B»
21
«C»
22
«D»
23
«E»
24
«F»
25
«G»
26
padding
27
0x51
28
0x52
29
padding
30
padding
31
0x61
32
0x62
33
0x63
34
0x64
35
padding
36
padding
37
padding
38
padding
39

For POSTRISC architecture, the primary is the little-endian direct order. All operations on data in registers/memory are carried out according to this order. Implementations may include optional support for big-endian addressing for loading/storing numbers.

The bit numbering within bytes doesn't affect the byte numbering convention (big-endian or little-endian). The byte numbering convention doesn't matter when accessing the full aligned data in memory. However, the numbering agreement is important when accessing less or not aligned data, or when manipulating data in registers, as follows:

Retrieving the 5th byte from an 8-byte number into the low byte of the register requires a right shift 5 bytes according to the little-endian agreement, but the right shift is 2 bytes according to the big-endian agreement.

The manipulation of data in the register is almost the same for both conventions. In both integers and floating-point numbers store the sign bits in the leftmost byte and their least significant bit in the rightmost byte, so the same integer instructions and floating-point instructions are used unchanged for both conventions. However, big-endian character strings have their most significant character on the left, while little-endian strings have their most significant character on the right.

In addition to little-endian and big-endian, there are other (combined) options for storing long scalars in memory. For example, some architecture (PDP-11?) stores double-byte numbers according to the little-endian order, but 4-byte numbers as pairs of double-byte numbers but according to big-endian order. It happens that integers are stored according to one principle, and real ones according to another, for example, if a floating-point coprocessor (ARM, TMS320C4x) is added to the integer processor later.

§ 7.4. Memory consistency model

There are several memory-consistency models for SMP systems:

  1. Sequential consistency (all reads and all writes are in-order).
  2. Relaxed consistency (some types of reordering are allowed):
    • loads can be reordered after loads (for better working of cache coherency, better scaling),
    • loads can be reordered after stores,
    • stores can be reordered after stores,
    • stores can be reordered after loads.
  3. Weak consistency (reads and writes are arbitrarily reordered, limited only by explicit memory barriers)

Atomic operations can be reordered with loads and stores.

The instruction fetching is incoherent with data, so self-modifying code can't be executed without special instruction cache flush/reload instructions plus maybe jump instructions.

The POSTRISC follows the weak memory model. And same the weak memory model with acquire loads and release stores also called release-consistency model. Only the acquire/release atomic instructions are synchronization points.

Table 7.6: Memory ordering in some architectures
Architecture Loads can be reordered after Stores can be reordered after Atomics can be reordered with Dependent loads can be reordered Incoherent instruction cache/
pipeline
loads stores loads stores loads stores
Alpha + + + + + + + +
ARM + + + + + + +
RISC-V WMO + + + + + + +
RISC-V TSO + +
PA-RISC + + + +
POWER + + + + + + +
SPARC RMO + + + + + + +
SPARC PSO + + + +
SPARC TSO + +
x86 + +
AMD-64 +
IA-64 + + + + + + +
IBM-Z +
Postrisc + + + + + + +

Notes: On Alpha the dependent loads can be reordered. If the processor first fetches a pointer to some data and then the data, it might not fetch the data itself but use stale data which it has already cached and not yet invalidated. Allowing this relaxation makes cache hardware simpler and faster but leads to the requirement of memory barriers for readers and writers. On Alpha hardware (like multiprocessor Alpha 21264 systems) cache line invalidations sent to other processors are processed in lazy fashion by default, unless requested explicitly to be processed between dependent loads. The Alpha architecture specification also allows other forms of dependent loads reordering, for example using speculative data reads ahead of knowing the real pointer to be dereferenced.

§ 7.5. Atomic/synchronization instructions

The processor implementation must follow the programmatic order of executing the instructions of a single-threaded program. But the effects of the actions of one thread on the memory can be observed by other threads not in the programmatic order of this thread Depending on the guarantees that the architecture explicitly gives and the permissions that the implementation is explicitly allowed, talk about a stricter or weaker ordering of memory. POSTRISC is an architecture with weak memory ordering. There are no obvious restrictions on the visibility of third-party processors or other devices (for example, input-output) actions on the memory of the current thread. Similarly, the current thread has no explicit guarantees on the other agents actions visibility order.

Special instruction «fence» used as memory barrier. Supported mo types are acquire, release, acquire-release (acq_rel), sequential-consistent (seq_cst).

fence.mo
Instruction format for atomic fence
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode 0 mo opx

Special instructions «load-atomic» and «store-atomic» used to push the visibility of changes from one thread to another. relaxed - normal operation, acquire - acquire changes, release - submit changes, acquire-release - sequential-consistent. Before changing the general data, the thread performs acquire by load-acquire from the watchdog variable. Similarly, after a change is made to the general data, the stream pushes the changes, executing release by store-release to the watchdog variable.

Those are the instructions ldab, ldah, ldaw, ldad, ldaq (load), and stab, stah, staw, stad, staq (store).

INSN_MNEMONIC.mo target, base

These instructions are one-way barriers. They do not allow speculative and out-of-order execution of operations with memory through themselves. Acquire doesn't allow the subsequent instructions to move forward, and release doesn't let the instructions before it lag behind. With the correct (pairwise) use of acquire-release, a closed section of code is obtained, locked at the top (acquire) and at the bottom (release).

The possible memory orderings for atomic load: relaxed, acquire, seq_cst (sequentially-consistent). The possible memory orderings for atomic store: relaxed, release, seq_cst (sequentially-consistent).

Instruction format for atomic load/store
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode target base 0 mo opx

The load-op atomic instructions copy the old value of a variable from memory to register, and sets a new value in memory (test-and-set), or obtained from the old (fetch-and-add and analogues). The possible memory orderings: relaxed, acquire, release, acq_rel (acquire-release), seq_cst (sequentially-consistent).

ea = gr[base]
gr[dst] = mem[ea]
mem[ea] = dst op gr[src]
Instruction format load-op
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst base src mo opx
ld_op_type.mo dst, base, src
Table 7.7: atomic instructions load-op
Instruction Description
swap[b|h|w|d|q].moswap 1-16 bytes
ldadd[b|h|w|d].moaddition, 1-8 bytes
ldand[b|h|w|d].mobitwise AND, 1-8 bytes
ldor[b|h|w|d].mobitwise OR, 1-8 bytes
ldxor[b|h|w|d].mobitwise XOR, 1-8 bytes
ldsmin[b|h|w|d].mosigned minimum, 1-8 bytes
ldsmax[b|h|w|d].mosigned maximum, 1-8 bytes
ldumin[b|h|w|d].mounsigned minimum, 1-8 bytes
ldumax[b|h|w|d].mounsigned maximum, 1-8 bytes

The store-op atomic instructions update value in memory via corresponing operation. Comparing to load-op, they don't return old variable value from memory to register, so may be implemented as a one-way communication. The possible memory orderings: relaxed, acquire, release, acq_rel (acquire-release), seq_cst (sequentially-consistent).

ea = gr[base]
mem[ea] = mem[ea] op gr[src]
Instruction format store-op
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode 0 base src mo opx
store_op_type.mo base, src
Table 7.8: atomic instructions store-op
Instruction Description
stadd[b|h|w|d].moaddition, 1-8 bytes
stand[b|h|w|d].mobitwise AND, 1-8 bytes
stor[b|h|w|d].mobitwise OR, 1-8 bytes
stxor[b|h|w|d].mobitwise XOR, 1-8 bytes
stsmin[b|h|w|d].mosigned minimum, 1-8 bytes
stsmax[b|h|w|d].mosigned maximum, 1-8 bytes
stumin[b|h|w|d].mounsigned minimum, 1-8 bytes
stumax[b|h|w|d].mounsigned maximum, 1-8 bytes

Instructions cas[b|h|w|d|q] (compare and swap 1-16 bytes), Designed for non-blocking interactions in a multi-threaded multiprocessor environment. Both instructions are atomic indivisible memory operations that cannot be partially performed.

The cas instruction reads an N-byte number from memory at the address from the base register, and compares it with the value in the register dst. If the values match, the instruction saves the new value from the src register at this address. Otherwise, the instruction doesn't save anything at this address. The base address must be aligned at the N-byte boundary. The read value is stored in the register dst.

value = mem [base]
if (value == gr [dst]) {
 mem [base] = gr [src]
}
gr [dst] = value
Format of instructions casX
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst base src mo opx

Using the following procedure, a stream can modify the contents of a memory cell even if there is a possibility that the stream can be interrupted and replaced by another thread that will update the cell, or that a thread on another processor can simultaneously modify a cell. First, an 8-byte number is entirely loaded into the register. Then, the updated value is computed and placed in another sval register. Then the casq instruction is executed with parameters test (register number where the initial value), base (base address register number) and sval (register number that contains the updated value). If the modification was completed successfully, the original value will be returned. If the memory cell doesn't contain the original value (the current thread was interrupted or the thread of another processor interrupted), the update will not be successful, general register with the number dst of the instruction casq contains the new current value of the memory cell. If the memory cell doesn't contain an original value, the thread may try to repeat the procedure again using the new current value.

loop:
ldad.relaxed test, base
mov save, test
...
addi sval, dst, 12; some kind of modification
...
casd.relaxed sval, base, test
bne save, test, loop

The instruction casd can be used for controlled sharing of a common data area, including the ability to send messages (to a linked message list) when a common area is in use. To achieve this, an 8-byte number in memory can be used as a control number. A value of zero indicates that the common area is not in use, and that no messages exist. A negative value indicates that the area is in use by someone, and that no messages exist. A positive value indicates that the shared area is in use, and that the value is the address of the most recent message added to the list. Therefore, any number of threads wishing to capture a common area, can use casd to update the check number, to indicate that the area is in use or add messages to the list. The only thread that has captured the shared area can also safely use casq to remove messages from the list.

The instruction casq can be used similarly to casd. Additionally, it has other uses. Consider a linked data list, with a check number used to address the first message in the list, as described above. If multiple threads are allowed to delete messages using casd (and not just the only thread that captured the common area), then the list will probably be incorrectly modified. This can happen if, for example, after one thread reads the address the very last message to move the message, another thread will delete the first two messages, and then adds the first message back to the linked list («ABA» issue in IBM terminology). The first thread, continuing the interrupted execution, will not be able to determine that the list has changed. By increasing the size of the control word to a pair of 8-byte numbers, containing the address of the first message and the modification tag (change number), which increases by 1 each time the list is modified, and using casq to update both fields together, the possibility of incorrect list updating can be reduced to an insignificant level. Namely, incorrect modification can occur only if the first stream was interrupted and during this time the number of changes to the list is exactly a multiple of 264, and only if the last change to the list uses the original address of the message.

§ 7.6. Memory attributes

The architecture of any processor needs a simple and effective mechanism for distinguishing memory accesses and IO operations for IO devices mapped to the address space. When accessing the memory, it is possible to cache data with a write back, blocking operations with semaphores are allowed, optimizing reordering of load/store operations and optimizing write-combining(coalescing) of write operations are available. For operations with mapped IO devices, write through is strictly necessary and cannot be cached, possible side effects even when reading, you need a strict order (sequential) without permutations/merges.

In addition, a dedicated, fixed address range must exist in the physical address space for the bootloader code. It's some analog of the PC EPROM and BIOS. It contains entry point from which the execution starts after the system restart, and other embedded code, implementation-dependent (processor-dependent code or PDC) and platform (system-dependent code or SDC). It is a read-only memory block, although updating may be permitted.

Memory attributes define speculativeness, cacheability, orderliness, and write policy. If virtual addressing is enabled, the memory attributes that define the actually displayed physical page are determined by the TLB. If physical addressing is enabled, memory attributes are determined based on the physical address.

The software must use the correct address subspaces when using physical addressing. Otherwise, incorrect access to IO devices with side effects is possible.

An address range can be either cacheable or non-cacheable. If the range is cacheable, the processor is allowed to distribute a local copy corresponding physical memory at all levels of the processor cache hierarchy. Distribution can be changed by cache management instructions.

The cached page is memory coherent, i.e. the processor and memory system guarantee that there is a consistent representation of memory for each processor. Processors support multiprocessor cache coherence based on physical addresses between all processors in the coherence domain (tightly coupled multiprocessors). Coherence doesn't depend on virtual aliases, since they are forbidden.

The processor is not required to support coherence between local instruction and data caches; that is, locally, the entry may not be observable by the local instruction cache. Moreover, multiprocessor coherence is not required from the instruction cache. However, the processor must ensure that the operations of other IO agents like «Direct Memory Access» (DMA) are physically coherent with a cache of data and instructions.

For an uncached access, the processor doesn't provide any coherence mechanisms. The memory system must ensure that a consistent memory representation is seen by each processor.

When writing to cached memory with write-back, only the processor-owned local copy of the data cache line changes. Writing to a lower level cache system (or to the level of the physical arrangement of data in memory) occurs when a changed cache line is explicitly (or implicitly) pushed out of a higher level cache. With write through policy, data changes affect all levels of caching immediately.

For non-cached address ranges, a write-combining (coalescing) can be set, which tells the processor that multiple writes to a limited memory area (typically 32 bytes) can be assembled together in the write buffer and made later as one large combined write. The processor can combine writes for an indefinite period of time. Several writes can be combined into one large, which accumulates in the buffer. Write-combining – means to increase processor efficiency. A processor with multiple write buffers should provide the preemptive order, using buffers approximately the same, even if some buffers are only partially full.

The processor can flush data from write buffers to memory in any order. The combined writes aren't performed in the original order. Write-combining can be either spatial or time-based. For example, writing bytes 4 and 5 and writing bytes 6 and 7 are combined into a single writing of bytes 4, 5, 6, and 7. In addition, writing bytes 5 and 6 is combined with subsequent writing of bytes 6 and 7, into a single write of bytes 5, 6, and 7 (with the removing of the first write to the byte 6).

The memory attributes may be defined in several ways.

Memory attributes may be defined via special registers at the level of physical address ranges. In X86 the special memory type range registers (MTRRs) are a set of processor supplementary capability control registers that provide system software with control of how accesses to memory ranges by the CPU are cached. It uses a set of programmable model-specific registers (MSRs) which are special registers provided by most modern CPUs. Possible access modes to memory ranges can be uncached, write-through, write-combining, write-protect, and write-back. In write-back mode, writes are written to the CPU's cache and the cache is marked dirty, so that its contents are written to memory later. Write-combining allows bus write transfers to be combined into a larger transfer before bursting them over the bus to allow more efficient writes to system resources like graphics card memory. This often increases the speed of image write operations by several times, at the cost of losing the simple sequential read/write semantics of normal memory. Additional bits, added in AMD64, allow the shadowing of ROM contents in system memory (shadow ROM), and the configuration of memory-mapped I/O.

Memory attributes may be defined at the level of virtual addresses via virtual page properties as an additional part of cached translation info. Then such per-page memory attributes may redefine previous per-range physical address atributes or restrict them in compatible manner.

Memory attributes may be determined by physical memory mapping only. In this case, fixed address ranges have specified memory attributes. Memory attributes are set implicitly during the initial physical address ranges mapping at reset and can't be changed further.

In the POSTRISC, the last way will be choosen. Memory attributes of physical address ranges are defined from their mapping to corresponding physical adreess ranges. They can't be redefind further via special registers and/or page properties. So the physical address space is divided into fixed parts with mmio-like and memory-like address ranges.

Table 7.9: Classification of physical addresses
Addresses Use
0 to 1 GiBmmio-like for compatible devices (not 64-bit ready).
1 to 4 GiBmemory-like for compatible devices (not 64-bit ready).
4-256 GiBmmio-like for 64-bit ready devices
over 256 GiBmemory-like main space.

§ 7.7. Memory map

From the system point of view, the physical address space is a bunch of devices, each of which is mapped to continuous address range. Everything is the memory-mapped device: memory RAM units are devices, external io devices are naturally memory-mapped devices, even processor cores are memory-mapped devices.

The bus controller which controls memory mapping is also the memory-mapped device. The special «device array» device maps all device configuration spaces (similar to PCI root complex). Each device has 4 KiB configuration space maximum in device array.

At least one block address in the physical memory map should be fixed in architecture: starting address in ROM for code execution after reset. Other blocks layout may be also fixed. Or may be known from the ROM code.

start end size description
0x0000000000000000
0x00000000ffffffff
4GiBreserved
0x0000000100000000
0x00000001000fffff
1MiBchipset control
0x00000001f0000000
0x00000001ffffffff
256MiBROM
0x0000000200000000
0x00000002ffffffff
4GiBPCIE ECAMs (16x256MiB)
0x0000004000000000
0x0000004fffffffff
64GiBPCIE BARs
0x0000010000000000
0x000003ffffffffff
2TiBRAM

The memory map should be consistent with memory attributes. Chipset control, PCIE config spaces, memory-mapped io: should be mapped to mmio-like ranges. Memory devices: should be mapped to memory-like ranges. ROM devices: may be both, but the startup ROM should be memory-like.

Instructions to clear the cache icbf (instruction cache block flush) and dcbf (data cache block flush) supplant the entire contents of the write buffers, whose addresses are no more than 32 bytes from the aligned address (at the boundary of 32 bytes), specified by icbf or dcbf, forcing the data to become visible. The icbf and dcbf instructions may also preempt additional write buffers.

Instruction without parameters msync (memory synchronize) – this is a hint for the processor to speed up the flushing out of all pending (buffered) stores, regardless of their addresses. This makes pending entries visible to other memory agents.

There is no way to know when the preemption of writes will be completed. The ordering of joined records is not guaranteed, so that later writes may occur before previous writes. To ensure that preceding linked entries are made visible before later entries, software must serialize between entries.

The processor can at any time flush connected writes to memory in any order before the software explicitly requires it.

Pages that allow writes to be joined are not necessarily coherent with write buffers or caches of other processors, or with local processor caches. Downloads to connected pages of memory by the processor see the results of all previous writes by the same processor in the same connected page of memory. Memory calls made by a connecting buffer (such as buffer streams) have an unordered non-sequential memory ordering attribute.

The MMGR family includes instructions for working with special registers, barrier instructions, cache management, dynamic procedure calls, interprocess communication, etc.

41403938373635343332313029282726252423222120191817161514131211109876543210
opcode sntopx label28
opcode sntopx base simm21
opcode sntopx base index scale sm disp

The second register contains the base address (only the address register). The rest of the instruction is reserved for storing the offset, a 9-bit signed number. Formulas to get the effective address:

ip + 16 × sign_extend(label23)

gr [base] + sign_extend(disp)

gr [base] + (gr [index] << scale) + sign_extend(disp)

Instructions ECB (Evict cache block), FETCH (Prefetch data), FETCHM (Prefetch data, modify intent), WH64 (Write hint 64 bytes) regulate the use of cache resources.

FETCH - load the block into the cache for reading N times (if N=0, then free the block).

FETCHM - load a block into the cache for modification N times (if N=0, then push the block out of the cache into memory).

Chapter 8. Virtual memory

This chapter additionally defines operating system resources to translate 64-bit virtual addresses to physical addresses. The virtual memory model introduces the following key features that distinguish it from the simplified presentation of application programs:

Translation lookaside buffer (TLB) support high-performance paged virtual memory systems. Software handlers for populating and protecting TLBs allow the operating system to control translation policies and protection algorithms.

Page table (PT) with hardware browsing capabilities has been added to increase TLB performance. PT is a continuation of the processor TLB, which is located in RAM and can be automatically viewed by the processor. The use of PT, its size, is entirely under software control.

Sparse 64-bit virtual addressing is supported by provisioning large translation structures (including multi-level hierarchies, like a cache hierarchy), effective support for processing translation misses, pages of different sizes, fixed (non-replaceable out) translations, mechanisms for sharing TLBs and page table resources.

The main addressable object in the architecture is an 8-bit byte. Virtual addresses are 64 bits long. An implementation may support less virtual address space. Virtual addresses visible by the program are translated into physical memory addresses by the memory management mechanism.

§ 8.1. Virtual addressing

From an application point of view, the virtual addressing model represents a 64-bit single flat linear virtual address space. General purpose registers are used as 64-bit pointers in this address space.

Less than 64 bits of a virtual address may be implemented in hardware. Unimplemented address bits must be filled with copies of the last implemented bit (be a sign extension of the implemented part of the address). Addresses in which all unimplemented bits match the last implemented bit are called canonical. Implemented virtual address space in this case consists of two parts: user and kernel. For N implemented virtual address bits, the user addresses ranges from 0 to 2N-1-1, and the kernel addresses ranges from 264-2N-1 to 264-1.

So, for example, for 48 bits:

0x0000000000000000 - start of user range
0x00007FFFFFFFFFFF - end of user range
0xFFFF800000000000 - beginning of the kernel range
0xFFFFFFFFFFFFFFFF - end of kernel range

Each virtual address consists of a page table index (1 bit), virtual page number (VPN) and page offset. The least significant bits form the page offset. The virtual page number consists of the remaining bits. Page offset bits don't change during translation. The border between page offset and VPN in the virtual address changes depending on the page size, used in virtual display. In the current implementation, 16 Kib page sizes are available, and super pages are multiples of 16 Kib (32 MiB and 64 GiB).

Virtual address, unimplemented bits, 16 KiB pages
6
3
6
2
6
1
6
0
5
9
5
8
5
7
5
6
5
5
5
4
5
3
5
2
5
1
5
0
4
9
4
8
4
7
4
6
4
5
4
4
4
3
4
2
4
1
4
0
3
9
3
8
3
7
3
6
3
5
3
4
3
3
3
2
3
1
3
0
2
9
2
8
2
7
2
6
2
5
2
4
2
3
2
2
2
1
2
0
1
9
1
8
1
7
1
6
1
5
1
4
1
3
1
2
1
1
1
0

9

8

7

6

5

4

3

2

1

0
sign extension virtual page number 16 KiB page offset

Switching between physical and virtual addressing modes is controlled by the privileged special register pta. The mode field sets the page translation mode. After restarting the processor, this flag is zero. Virtual addressing is allowed via pta.mode!=0.

Table 8.1: PTA modes
pta.mode description
0without translation (physical addressing)
1reserved
22 translation levels
33 translation levels
44 translation levels

A variable page size is needed to help the software display system resources, to improve TLB utilization. Typically, operating systems choose a small range of page sizes to implement their virtual memory algorithms. Large pages can be statically distributed. For example, large areas of virtual addressing space can be allocated to the kernel of the operating system, frame buffers, or mapped IO regions. The software can also selectively pin these translations by placing them in translation registers.

Page size can be specified in: translation cache, translation registers, and PT. Page size can also be used as a parameter for TLB cleanup instructions.

The page sizes are encoded as a 4-bit field ps (pagesize). Each field defines the display size of 2ps+12 bytes.

Virtual and physical pages should be aligned on their natural border. For example, 64 kilobyte pages are aligned at the 64KiB border, and 4 megabyte along the border of 4 megabytes.

Processors using variable virtual page sizes, are characterized by the need for hardware implementation of the fully associative TLB buffer. Processors that use only one page size can be bypassed in part by an associative buffer, although usually fully associative.

Table 8.2: Page permissions
abbreviation designation description
r read read access with the usual load/store instructions
w write write access with normal load/store instructions
x execute code execution access
b backstore saving/restoring registers from the hardware register stack
f finalized final state, page rights cannot be changed, gives the right to read addresses for indirect call instructions through trusted import tables and virtual function tables
p promote the right to elevate privileges of the current thread to the kernel level

The software can check page level permissions with the instructions mprobe, mprobef, which check the availability of this virtual page, privilege level, read/write permissions at the page level, and read/write permissions with a security key.

Executable-only pages may be used, to increase privileges on entering operating system code. User level code should usually go to such a page (managed by the operating system) and execute the instruction epc (Enter Privileged Code). When epc has successfully elevated privileges, the subsequent instructions are executed at the target privilege level indicated by the page. A branch can (optionally) lower the current privilege level if the page where the branch is made has a lower privilege level.

§ 8.2. Translation lookaside buffers

Virtual addresses are translated to physical addresses using a hardware structure called Translate Lookaside Buffer (TLB) or translation cache. Using the virtual page number (VPN), the TLB finds and returns the physical page number (PPN). A processor usually have two TLB architectural buffers: instruction TLB (ITLB) and data TLB (DTLB). Each TLB buffer translates, respectively, references to instructions and data. In a simplified implementation, a single (combined) buffer used for both types of translation can be implemented. The term TLB itself refers to the union of instructions, data, and translation cache structures.

When the processor accesses the memory in the TLB, a translation record is searched with the corresponding VPN value. If the corresponding translation record is found, the physical page number PPN (physical page number) is combined with the page offset bits, to form a physical address. In parallel with the translation, page permissions are checked by privilege level and the permissions granted for reading, writing, and executing are checked.

If the required translation is not found in the TLB, the processor itself can search the page table in memory and install it in the TLB. If the required input cannot be found in the TLB and/or page table, the processor generates a miss error in the TLB so that the operating system establishes the translation. In a simplified implementation, the processor may generate an error immediately after a miss in the TLB. After the operating system installs the translation in the TLB and/or page table, the erroneous instruction may be restarted and execution continues.

Translation format in TLB
6
3
6
2
6
1
6
0
5
9
5
8
5
7
5
6
5
5
5
4
5
3
5
2
5
1
5
0
4
9
4
8
4
7
4
6
4
5
4
4
4
3
4
2
4
1
4
0
3
9
3
8
3
7
3
6
3
5
3
4
3
3
3
2
3
1
3
0
2
9
2
8
2
7
2
6
2
5
2
4
2
3
2
2
2
1
2
0
1
9
1
8
1
7
1
6
1
5
1
4
1
3
1
2
1
1
1
0

9

8

7

6

5

4

3

2

1

0
ppn pl ma a d 0 p ar v
vpn rv ps
rv asid
Table 8.3: TLB translation record fields
Translation field Description
vValid bit. If this bit is 1, then translation can be used in the search.
arGlobal permissions for the virtual page.
pPresent bit. This bit indicates that the mapped physical page is present in the physical memory and not ejected to disk.
mamemory attributes. Describes caching, coherence, writing method, and other attributes of the displayed physical page.
aAccess bit. This bit may cause an error on access for tracing or debugging purposes. The processor doesn't modify Access bit when referenced.
dDirty bit. There was a write or semaphore instruction on this page.
plPrivilege Level or Page Level.
ppnPhysical page number.
psPage size 2ps bytes.
vpnVirtual Page Number.
asidAddress Space identifer.
rvReserved (doesn't exist)

TLB is local processor resource (local insertion or clearing of translation entries on one processor doesn't affect the TLB of another processor). Global TLB cleanup is provided to clean translations in all processors within a coherent TLB region in a multiprocessor system.

Translation Cache (TC) is an implementation-defined structure, designed to store a small working set of dynamic translations for links to memory. The processor directly controls the record replacement policy in the TC.

Purge translation cache ptc (purge translation cache) produces cleaning ITC/DTC local processor entries that match the specified range of virtual addresses. The software should handle the case where cleaning should be extended to all processors in a multiprocessor system. Flushing the translation cache doesn't affect fixed TC inputs.

The translation cache has at least 16 inputs for itc and 16 inputs for DTC. An implementation may have additional levels of a TLB hierarchy to increase efficiency.

The translation cache is controlled by both software and hardware. Generally speaking, the software cannot assume how long any installed translation will remain in the cache. This term, as well as the replacement (extrusion) algorithm, depends on the implementation. A processor can push translations out of the cache at any given time for various reasons. TC cleanups can remove more inputs than is explicitly required.

Records in the translation cache must be maintained in a consistent state. When you insert or clean a TLB, all existing entries must be deleted. which partially or completely overlap with the given translations. In this context, overlap refers to two translations with partially or completely overlapping ranges of virtual addressing. For example: two 64K pages with the same virtual addressing, or a 128K page with the virtual address 0x20000 and a 64K page with the address 0x30000.

Translation registers (TR) is part of the TLB, which contain translations whose replacement policies are controlled directly by the software. Each translation cache entry can be fixed and turned to a software-controlled translation register or unlocked and sent to a common pool. Fixed translations are not replaced when the TC overflows (but are flushed when overlapping with new translations). Fixed insert into the previously unfixed TC entry removes the cached translation in this entry. The software can explicitly embed translations in TR by determining the entry number in the cache. Translations are deleted from TR when the translation register is cleared, but not when the translation cache is cleared.

Translation registers allow the operating system to pin critical virtual memory translations into TLB, for example, IO spaces, kernel memory areas, frame buffers, page tables, sensitive interrupt code, etc. The interrupt handler instruction fetching is performed using virtual addressing, and therefore, virtual address ranges containing software translation miss handlers and other critical interrupt handlers should be fixed, otherwise, additional recursive misses in the TLB may occur. Other virtual mappings may be pinned for performance reasons.

Insertion record will be fixed if it is done with the fix bit turned on. Once such a translation falls into the TLB, the processor will not replace this translation in order to make room for other translations. Fixed translations can only be deleted by the TLB software cleanup. Insertions and cleanups of translation registers can selectively delete other translations (from the translation cache).

A processor must have at least 8 fixed translation registers for itc and 8 for dtc. An implementation may have additional translation registers to increase efficiency.

§ 8.3. Search for translations in memory

In case of a miss to the TLB translation hardware cache (lack of the necessary record), an interrupt occurs and the software miss handler comes into play. He should find the necessary translation in the page table in memory and place it in TLB, after which the instruction that caused the interrupt is restarted. However, many systems contain a hardware (or half-hardware) implemented unit translations Then, in case of a miss in TLB, the hardware block for searching for translation in memory comes into play, and only if this block doesn't detect the desired translation, an interrupt occurs and a system (software) miss handler is called.

If the processor implements an automatic search block for translations in memory, then the format of individual translation records, the format of the translation table as a whole, and the search algorithm in the translation table ceases to be the free choice of the operating system. At the same time, the system (owned by the OS) translation structures should work in close cooperation with the hardware translation search unit.

Page Table Walker (PTW) is a hardware unit for independent search for translations in RAM in case of their absence in the TLB. PTW is designed to increase the performance of a virtual address translation system.

Page Table (PT) is a translation table in memory, viewable by the PTW hardware unit (must be configured according to the requirements of the PTW equipment).

The processor PTW block can be (optionally) configured to search for translation in PT after a failed search in the TLB for instructions or data. PTW unit provides a significant increase in productivity by reducing the number of interrupts (and therefore delays and cleanup of the processor pipeline), caused by misses in the TLB, and by ensuring the parallel operation of the PTW block to populate the TLB translation cache at the same time as other processor actions.

To organize a page table in memory, traditionally in different architectures, the following schemes are used with varying success:

top-down is a traditional multi-level translation search scheme based on direct downward parsing of a virtual address, when each level of the table tree is directly indexed by the next portion of the virtual address. The easiest way for a hardware implementation. All tree tables are placed in physical memory. The number of memory accesses for searching for translation is equal to the number of levels (depth of the tree) - 2 for X86, 3 for DEC Alpha, 4 for X64, 5-6 for IBM zSeries. It has problems with sparseness and fragmentation, limited support for variable page sizes. It takes up too much space for translation tables (proportional to the size of virtual memory) and inefficiently uses table space with large fragmentation.

guarded top-down is an improved multi-level translation search scheme based on direct downward parsing of a virtual address, when each level of the table tree is directly indexed by the next portion of the virtual address, and omissions of some levels are possible. Harder for hardware implementation. All tree tables are placed in physical memory. The number of memory accesses for translation search may be less than the maximum number of levels. Reduces problems with sparseness and fragmentation, limited support for variable page sizes.

bottom-up is a scheme of the reverse recursive ascending order of viewing translation tables, when recursive misses are used in one large linear table located in virtual memory. Requires hardware implementation of nested interrupts. The number of memory accesses for searching for a translation depends on the number of recursive misses in the TLB and, at best, is 1, but in the worst case, it is proportional to the top-down method. Has problems with sparseness and fragmentation, limited support for variable page sizes. It takes up too much space for translation tables (in the worst case, it is proportional to the size of virtual memory) and inefficiently uses table space with large fragmentation.

inverted hash page table of pages. Its size is proportional to the size of physical (rather than virtual) memory and doesn't depend on the degree of fragmentation of virtual space. The number of memory accesses for translation search doesn't depend on the size of the page table and, if the hash function is correctly selected and the hash table size is usually 1. It copes well with sparseness and fragmentation, limited support for variable page sizes. It caches poorly when looking for translations for neighboring pages.

In the architecture POSTRISC, a multi-level translation search scheme was chosen to implement the page table based on the direct top-down order of viewing the translation tables, when each next level is directly indexed by a new portion of the virtual address. The number of memory accesses for translation search is equal to the number of levels (variable, currently 3 levels). The page table is located in the physical memory space as a multi-level structure of service tables.

Virtual address: 16KiB pages and possible translation levels
6
3
6
2
6
1
6
0
5
9
5
8
5
7
5
6
5
5
5
4
5
3
5
2
5
1
5
0
4
9
4
8
4
7
4
6
4
5
4
4
4
3
4
2
4
1
4
0
3
9
3
8
3
7
3
6
3
5
3
4
3
3
3
2
3
1
3
0
2
9
2
8
2
7
2
6
2
5
2
4
2
3
2
2
2
1
2
0
1
9
1
8
1
7
1
6
1
5
1
4
1
3
1
2
1
1
1
0

9

8

7

6

5

4

3

2

1

0
sign extension 11 bits 11 bits 11 bits 16 KiB page offset

In the event of a miss in the TLB translation hardware cache (lack of the necessary record), the hardware translation search block in memory comes into play, and if this block doesn't find the required translation, an interrupt occurs and the program miss handler is called.

Special register page table address (pta) defines the search parameters for translation in memory for the virtual space, describes the location and size of the PT root page in the address space. The operating system must ensure that page tables are aligned naturally.

Special register pta (root level) and translation records for the next levels
6
3
6
2
6
1
6
0
5
9
5
8
5
7
5
6
5
5
5
4
5
3
5
2
5
1
5
0
4
9
4
8
4
7
4
6
4
5
4
4
4
3
4
2
4
1
4
0
3
9
3
8
3
7
3
6
3
5
3
4
3
3
3
2
3
1
3
0
2
9
2
8
2
7
2
6
2
5
2
4
2
3
2
2
2
1
2
0
1
9
1
8
1
7
1
6
1
5
1
4
1
3
1
2
1
1
1
0

9

8

7

6

5

4

3

2

1

0
reserved ppn 0 mod
reserved ppn ma 0 s v
reserved ppn ma 0 s v
reserved ppn ar g d a p 0 v
Table 8.4: Translation record fields
Field Bit Description
mod 3 Translation mode: 0 - no translation, 1,2,3 and so on - the number of indexing levels when searching.
v 1 Bit of validity. For intermediate and final formats, if 1 - the page entry is valid, otherwise a search error occurs.
ppn varied, 30-50 Physical page number if p=1, or other system data if p=0.
s 1 Superpage bit, stop the search (final format instead of intermediate).
p 1 The page is in memory
ma4Page Physical Attributes. Should be defined per superpage.
a1Access Bit
d1Dirty bit - indicates whether there were any changes in the page. When a page is pushed into a swap, it may not be saved if the page is already in the swap and has not changed.
ar 6 Permissions
rv Reserved (must be zeros)

The format of the page tables should take into account the mapping of virtual addresses to a physical address space of a total depth of 64 bits.

§ 8.4. Translation instructions

List of translation instructions. The processor doesn't guarantee that the modification of translation resources is observed by subsequent samples of instructions or by accessing data in memory. The software should provide serialization (serialization by issuing a synchronizing barrier instruction) according to instructions before any dependent selection of instructions and serialization according to data before any dependent reference to data.

Table 8.5: Instructions that modifies TLB
Syntax Description
ptc     ra,rb,rc
Purge translations cache
ptri    rb,rc
Clear the instruction translation register. ITR ← gr[rС], ifa
ptrd    rb,rc
Cleans the data translation register. DTR ← gr[rС], ifa
mprobe  ra,rb,rc
Returns page permissions for the privilege level gr[rC]
tpa     ra,rb
Translates the virtual address to the physical address

The ptc instruction invalidates all translations from the local processor cache specified with the address and ASID. The processor determines the ASID-specifiec page that contains that address and invalidates all TLB entries for that page. The instruction deletes all translations from both translation caches that intersect with the specified address range. If the paging structures map the linear address using a large pages and/or there are multiple TLB entries for that page, the instruction invalidates all of them.

Format of instructions ptc
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode base address asid 0 opx

Translation records can be inserted into fixed translation registers by instructions mtitr (move to instruction translation register) and mtdtr (move to data translation register). The data for the inserted translation is taken from the first register argument of the instruction and special registers ifa. The translation register number is taken from the second argument register.

Translation records can be deleted from translation registers by instructions ptri (Purge Translation register for Instruction) and ptrd (Purge Translation register for Data). The first argument is the base address register number, the second argument is the register number that stores the translation register number. The instructions also delete all translations from both translation caches that intersect with the specified address range. The instructions only remove translations from the local processor registers.

Instruction format mtitr, mtdtr, ptri, ptrd
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode target base asid 0 opx

Permissions for a virtual page can be specified by instructions mprobe (memory probe), mprobef (memory probe faulting). The mprobe instruction for a given base address and privilege level returns the available rights mask. The privilege level is set as a value in the register. The mprobef instruction doesn't return rights, but tests for the necessary access rights for a given base address and privilege level. If there are no rights, the mprobef instruction raises a «Data Access rights fault» error, otherwise the instruction doesn'thing.

Instruction format mprobe
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst base pl 0 opx
Instruction format mprobef
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode 0 base pl 0 opx

Privileged instruction tpa (translate to physical address) returns the physical address corresponding to the given virtual address.

Instruction format tpa
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst base 0 0 opx

The common sequence of TLB/PT search looks like this. If the TLB search fails, if the PTW is blocked (pta.v=0), a miss error in ITLB/DTLB occurs. If PTW is enabled (pta.v=1), then PTW calculates the index for access to the root page table, and tries to find the missing translation in the PT memory, looking through the table tree. If additional misses in the TLB occur during PTW operation, PT generates an error. If the PTW doesn't find the required translation in the memory (that is, the PT doesn't contain it), or the search is interrupted, an instruction/data miss TLB error occurs. Otherwise, the record is loaded into ITC or DTC. The processor can upload records to ITC or DTC, even if the program did not require translation.

Insertions from PT to TC follow the same «cleanup rules before inserting» as program inserts. PT insertion of entries that exist in TR registers is not allowed. Specifically, PT can search for any virtual addressing, but if the address is mapped to TR, such a translation should not be inserted into the TC. The software should not be placed in PT translations that intersect with current TR translations. An insert from PT may result in a machine abnormal termination if there is overlap between the TR and the inserted PT record.

After the translation record is loaded into the TLB, additional translation errors are checked (in order of priority): lack of rights to the page, enabled access bit, enabled dirty bit, lack of a page in memory.

Chapter 9. The floating-point facility

This chapter describes the floating-point and vector subsystem of the virtual processor instruction set.

§ 9.1. Floating-point formats

IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE 754-1985) defines two floating-point formats – single and double precision, in two groups – main and advanced. The architecture supports all four formats according to IEEE terminology: basic single and double formats and extended dual format. The basic dual format serves simultaneously as an extended single format.

The architecture defines the representation of floating-point values in four different fixed-length binary formats. The format can be 16-bit for half-float precision values, 32-bit for single precision values, 64-bit for double precision values, 128-bit for quadruple precision values. Values in each format are composed of three fields: sign bit (S), exponent (E), fractional part or mantissa (F).

float number format - half
1514131211109876543210
S Exp Fraction
float number format - single
1514131211109876543210
S Exp Fraction
Fraction
float number format - double
1514131211109876543210
S Exp Fraction
Fraction
Fraction
Fraction
float number format - quadruple
1514131211109876543210
S Exp
Fraction
Fraction
Fraction
Fraction
Fraction
Fraction
Fraction

Single precision numbers occupy four adjacent bytes of memory, starting with an arbitrary address multiple of 4. Double precision numbers occupy eight adjacent bytes of memory, starting with an arbitrary address multiple of 8. Quadruple numbers occupy sixteen contiguous bytes of memory, starting with an arbitrary address multiple of 16.

The values represented within each format are determined by two integer parameters – the size of the format S and the number of bits of the exponent P. All other – parameters are derived from these two.

Table 9.1: Parameters of the formats of float numbers
Format Options Half Single Double Quadruple
Format bits B 16 32 64 128
Exponent bits P (P<B) 5 8 11 15
Sign bit S (1) 1 1 1 1
Fraction bits FB: (B−P−1) 10 23 52 112
Fraction significant bits (B−P) 11 24 53 113
Significant decimal digits log10(2B−P) 3.311 7.225 15.955 34.016
Maximum exponent EMAX: (2P−1−1) 15 127 1023 16383
Minimum exponent EMIN: −(2P−1−2) −14 −126 −1022 −16382
Exponent bias (2P−1−1) 15 127 1023 16383
Maximum biased exponent EBMAX: (2P−1) 31 255 2047 32767
bias adjustment 3×2P–2 24 192 1536 24576

The following table shows the exact limit values for the three formats decimal places:

Limit Value
Normalized values(−1)S×1.F×2E−EMAX
Maximum normalized values(2.0−2−FB)×2EMAX
Single absolute maximum3.40282347e+38
Double absolute maximum1.7976931348623158e+308
Quadruple absolute maximum1.1897314953572317650857593266280070162e+4932
Minimum normalized values1.0×2EMIN
Single absolute minimum1.17549435e−38
Double absolute minimum2.2250738585072013e−308
Quadruple absolute minimum3.3621031431120935062626778173217526026e−4932
Subnormalized values(−1)sign×0.fraction×2EMIN
Maximum subnormalized values(1−2−FB)×2EMIN
Quadruple maximum subnormal3.3621031431120935062626778173217519551×10−4932
Minimum subnormalized values1.0×2EMIN−FB
Single minimum (subnormal)1.401298464324817071e−45 (inaccurate)
Double minimum (subnormal)4.940656458412465442e−324 (inaccurate)
Quadruple minimum (subnormal)6.4751751194380251109244389582276465525×10−4966

The following objects are allowed within each format:

NAN – short for «not a number» (Not A Number). NAN is an IEEE is a binary floating-point representation that is something other than a number. NANs come in two forms: «signaling» NANs and «quiet» NANs.

Arithmetic with infinities is treated as if the operands are arbitrary large amount. Negative infinity is less than any finite number; positive infinity is greater than any finite number.

Denote: S is a sign bit (sign), EXP is an exponent with offset, i.e. reduced to unsigned (biased exponent), F is a fractional part or mantissa (fraction), XXXXX as an arbitrary but non-zero sequence of bits, EBMAX is a maximum offset unsigned exponent. The value of a float number is interpreted as follows.

If EXP = EBMAX (consists of one bit units), then this is a special IEEE value. To recognize special values, F. is further investigated. If F is not equal to zero, then it is + NAN or −NAN. In particular, if the first bit of the mantissa is 0, then it is a signal NAN (signaled), and if 1 – then it is «quiet» NAN. If EXP = EBMAX and F = 0, then it is «infinity» + INF or −INF depending on S. If 0 < EXP < EBMAX, then this is a finite normalized number. If EXP = 0, and the mantissa is not equal to zero, then this is a finite unnormalized number. If EXP = 0 and F = 0, then this is +0 or −0 depending on S.

Exponent Fraction IEEE value
EBMAX0XXXXXXQNAN
EBMAX1XXXXXXSNAN
EBMAX0INF
0<E<EBМAXanyFinite (Normalized): (−1)S × 2(E−BIAS) × 1.F
0XXXXXXXFinite (Denormal): (−1)S × 2(− EMIN) × 0.F
00±0

Floating-point operations can raise arithmetic exceptions for many reasons, including invalid operations, overflow from above or below, division by zero, inaccurate result.

§ 9.2. Special floating-point values

NAN is the abbreviation for the concept of «not a number». NAN is an IEEE bitmap floating-point that represents something other than a number. These are the values that have the maximum value of the offset exponent and non-zero fractional part. The sign bit is ignored (NAN is neither positive nor negative), although it can be determined. NANs come in two forms: Signaling NANs and Silent NANs. If the high bit of the mantissa is zero, then this is a signaled NAN, otherwise a quiet NAN.

Signaled NAN (SNAN) is used to provide values for uninitialized variables and for extension arithmetic. The signaled NAN reports an invalid operation when it is the operand of an arithmetic operation, and may throw an arithmetic exception. The signaled NAN is used to raise a signal exception when such a value appears as the operand of the computational instruction.

Quiet NAN (QNAN) provides the retrospective diagnostic information relative to previous invalid or inaccessible data and results. Quiet NANs propagate through almost every operation without generating arithmetic exceptions.

QNAN is used to present the results of some invalid operations, such as invalid arithmetic operations at infinity or on a NAN, when the generation of an exception for an invalid operation is blocked. Quiet NANs propagate through all floating-point operations except ordered comparisons (LT, LE, GT, GE) and conversions to an integer, otherwise they report exceptions. QNAN codes can thus be stored through a sequence of floating-point operations and used to transmit diagnostic information, helping to identify the consequences of illegal operations.

When QNAN is the result of a floating-point operation, because one of the NAN operands or because the QNAN was generated due to a blocked exception on an invalid operation, then the following rule applies to determine the NAN with the high bit of mantissa 1, which should be saved as a result. If either operand is an SNAN, then the SNAN is returned as the result of the operation. Otherwise, if a QNAN is generated due to a prohibition on the exclusion of an invalid operation, then this QNAN is returned as a result. If the QNAN is generated as a result, then the QNAN has a positive sign, an exponent of all 1, and the most significant bit of the mantissa 1 (all other 0). An instruction that generates a QNAN as a result of an exception ban due to an invalid operation should generate such a QNAN (e.g. 0x7FF8000000000000 for double).

§ 9.3. Selection for IEEE options

Floating-point instructions provide a subset of the IEEE standard for binary floating-point arithmetic (ANSI/IEEE Standard 754-1985 for Binary Floating-Point Arithmetic). The following describes how to create a full implementation of IEEE.

Four IEEE rounding modes are supported in hardware: normal, truncation, plus infinity, and minus infinity. The hardware supports IEEE enable/disable software traps for special situations. Addition, subtraction, multiplication, division, conversion between floating formats are supported in hardware, rounding to an integer in floating-point format, conversion between floating and integer formats, comparison, square root calculation. The remainder of division is supported in software, conversion of binary format to decimal number. Copying (possibly with a change in sign) without changing the format is not considered an operation (non-finite numbers are not checked). Operations with different formats are not provided, calculations occur with the maximum accuracy available for this vector format.

Conversion precision between decimal strings and binary numbers floating-point - no less than the requirements of the IEEE standard. Depends on the implementation, whether the conversion procedures to decimal format are processed any excess numbers (over 9, 17 or 36 digits) as zeros.

Overflows above and below, NAN, INF, which the binary to decimal conversion software encounters, return strings that define these states.

The hardware supports comparisons of numbers of the same format. You can programmatically compare numbers with a different format. The result of the comparison is true or false. The hardware supports the required six predicates and the predicate of incomparability of numbers. The other 19 optional predicates can be created from comparisons and bitwise operations. Infinity is supported in hardware in comparison instructions.

QNANs provide retrospective diagnostic information. Copying NAN signals without changing the format doesn't report an invalid exception (fmerge instructions also do not check for non-finite numbers.)

The hardware fully supports negative null operands and follows IEEE rules to create negative null results. The hardware support bottom overflow and denormal numbers.

Tiny is detected by hardware after rounding, and a loss of accuracy is detected by the software as an inaccurate result.

§ 9.4. Representation of floats in registers

Universal registers with a width of 128 bits each can store in themselves one float number of quadruple precision (quadruple float), 2 double precision, 4 single precision, 8 half-float precision, or integer vector length of 1, 2, 4 or 8 bytes.

Table 9.4: Representation format for real data
register bytes
15141312 111098 7654 3210
half half half half half half half half
single single single single
double double
quadruple

The special register fpcr regulates the execution of material and vector operations. It controls the arithmetic rounding mode for all instructions except explicit rounding instructions, indicates the allowed traps of the user level, stores the exceptions that have occurred (exceptions), stores excepted and masked exceptions.

FPU control register format
313029282726252423222120191817161514131211109876543210
IEEE masked flags IEEE masked traps IEEE nonmasked traps control bits
0 im um om zm dm vm 0 i u o z d v 0 i u o z d v 0 td ftz 0 rm
Table 9.5: SF Field Bits
bits description
vInvalid Operation
dDenormal/Unnormal Operand
zZero Divide
oOverflow
uUnderflow
iInexact result
tdTraps disabled
rmRounding mode
ftzFlush-to-Zero mode (zeroing without underflow)

The rm (rounding mode) bits control the rounding mode of the results. The rounding mode doesn't affect the execution of explicit rounding instructions, for which only the rounding mode specified directly in the instructions matters.

Rounding mode (RM) Description
0Round to nearest (round)
1Round toward minus infinity (floor)
2Round toward plus infinity (ceil)
3Round toward zero (chopping)

The masked flags vector stores a mask of flags allowing IEEE interrupts of the corresponding type. The bits of the vectors nonmasked traps and masked traps store flags of the exceptions that occurred. the occurrence of which was allowed (or, accordingly, prohibited) in the vector masked flags.

The fldi instruction is used to load direct real constants into the registers. It allows you to load real constants presented in formats up to extended (80 bits) without loss of accuracy. The instruction doesn't allow to set zero values, special values, and has restrictions on the order value (6 bits). The instruction stores numbers 28 bits long (or 70 bits for a double instruction).

Instruction format fldi
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode target s exponent mantissa (high 21 bits)
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
mantissa (full 63 bits)

§ 9.5. Floating-point computational instructions

All computational operations are performed only on registers. The basic operation for maximum performance is vector (or scalar) operation «multiply-add» MAC (multiply-accumulate fused). Floating-point arithmetic instructions that fuse multiplication with addition and possibly sign change, formed according to the FMAC rule.

Ternary «fused» floating-point instruction format
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst src1 src2 src3 opx
Binary floating-point instruction format
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst src1 src2 0 opx
Unary floating-point instruction format
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst src 0 0 opx
Unary floating-point instruction format with rounding
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst src 0 rm opx

The following table lists these instructions. They exist in all variants of – single, double or quadruple respectively.

Table 9.10: Floating point computational instructions
Scalar Packed Description
Fused instructions
fmadds[h|s|d|q] fmaddp[h|s|d|q] a = b × c + d
fmsubs[h|s|d|q] fmsubp[h|s|d|q] a = b × c − d
fnmadds[h|s|d|q] fnmaddp[h|s|d|q] a = − b × c + d
fnmsubs[h|s|d|q] fnmsubp[h|s|d|q] a = − b × c − d
fmaddap[h|s|d] ax = bx × c.x + dx, ay = by × cy − dy
fmsubap[h|s|d] ax = bx × cx − dx, ay = by × c.y + dy
binary instructions
fadds[h|s|d|q] faddp[h|s|d|q] a = b + c
faddhp[h|s|d] ax = b.x + by, ay = c.x + cy
faddcp[h|s|d] ax = b.x + cx, ay = bx − cy
fnadds[h|s|d|q] fnaddp[h|s|d|q] a = − (b + c)
fsubs[h|s|d|q] fsubp[h|s|d|q] a = b − c
fsubhp[h|s|d] ax = bx − by, ay = cx − cy
fsubcp[h|s|d] ax = bx − cx, ay = b.x + cy
fabsds[h|s|d|q] fabsdp[h|s|d|q] a = abs (b − c)
fnabsds[h|s|d|q] fnabsdp[h|s|d|q] a = − abs (b − c)
fmuls[h|s|d|q] fmulp[h|s|d|q] a = b × c
fdivs[h|s|d|q] fdivp[h|s|d|q] a = b/c
fmins[h|s|d|q] fminp[h|s|d|q] a = min (b, c)
fmaxs[h|s|d|q] fmaxp[h|s|d|q] a = max (b, c)
famins[h|s|d|q] faminp[h|s|d|q] a = min (abs (b), abs (c))
famaxs[h|s|d|q] famaxp[h|s|d|q] a = max (abs (b), abs (c))
fcmps[h|s|d|q]oeq fcmpp[h|s|d]oeq fp compare ordered and equal
fcmps[h|s|d|q]one fcmpp[h|s|d]one fp compare ordered and not-equal
fcmps[h|s|d|q]olt fcmpp[h|s|d]olt fp compare ordered and less
fcmps[h|s|d|q]ole fcmpp[h|s|d]ole fp compare ordered and less-equal
fcmps[h|s|d|q]o fcmpp[h|s|d]o fp compare ordered
fcmps[h|s|d|q]ueq fcmpp[h|s|d]ueq fp compare unordered or equal
fcmps[h|s|d|q]une fcmpp[h|s|d]une fp compare unordered or not-equal
fcmps[h|s|d|q]ult fcmpp[h|s|d]ult fp compare unordered or less
fcmps[h|s|d|q]ule fcmpp[h|s|d]ule fp compare unordered or less-equal
fcmps[h|s|d|q]uo fcmpp[h|s|d]uo fp compare unordered
p[s|d]pk pack two vectors into one
Conversion to integer with rounding
fcvtiw2s[h|s|d|q] fcvtiw2ps convert signed word to floats
fcvtuw2s[h|s|d|q] fcvtuw2ps convert unsigned word to floats
fcvts[h|s|d|q]2iw fcvtps2iw convert floats to signed word
fcvts[h|s|d|q]2uw fcvtps2uw convert floats to unsigned word
fcvtid2s[h|s|d|q] fcvtid2pd convert signed doubleword to floats
fcvtud2s[h|s|d|q] fcvtud2pd convert unsigned doubleword to floats
fcvts[h|s|d|q]2id fcvtpd2id convert floats to signed doubleword
fcvts[h|s|d|q]2ud fcvtpd2ud convert floats to unsigned doubleword
fcvtiq2s[h|s|d|q] convert signed quadword to floats
fcvtuq2s[h|s|d|q] convert unsigned quadword to floats
fcvts[h|s|d|q]2iq convert floats to signed quadword
fcvts[h|s|d|q]2uq convert floats to unsigned quadword
Conversion to narrower float with rounding
fcvts[s|d|q]2sh convert float to half-float
fcvts[d|q]2ss convert float to single float
fcvtsq2sd convert float to double float
Extending to wider float instructions
fextsh2ss extend float to single float
fexts[h|s]2sd extend float to double float
fexts[h|s|d]2sq extend float to quadruple float
Rounding instructions
frnds[h|s|d|q] frndp[h|s|d] floating-point round
unary instructions
fnegs[h|s|d|q] fnegp[h|s|d] floating-point negate value
fabss[h|s|d|q] fabsp[h|s|d] floating-point absolute value
fnabss[h|s|d|q] fnabsp[h|s|d] floating-point negate absolute value
frsqrts[h|s|d|q] frsqrtp[h|s|d] floating-point reciprocal square root
fsqrts[h|s|d|q] fsqrtp[h|s|d] floating-point square root
funphp[h|s|d] unpack high half the vector into wider precision vector
funplp[h|s|d] unpack lower half the vector into wider precision vector

The instructions fcmp are intended for generating predicates from the results of floating-point comparisons. They produce boolean scalar/vectors as a result of real vector comparison. Comparison of real numbers is done by elementwise comparison of two vectors and recording the result in the third real vector. All bits of the result vector, for elements of which the condition is satisfied, are set to 1, the rest to 0. After comparison, you can get a single predicate bit performing respectively conjunction and disjunction of all bits of the result vector.

For some instructions, the second operand is replaced with the 7-bit immediate value count from 0 to 127, which describes the accuracy of a non-pipelined unary operation, e.g. fsqrt or frcp.

The accuracy of executing the instructions fsqrt, frcp and frsqrt is indicated by the constant count directly in the instruction. The instruction is executed with minimal accuracy at the same time as a regular MAC, without pipeline delays.

§ 9.6. Floating-point branch and nullification instructions

Quadruple scalar Scalar Double Scalar Single Description
branch if compare is true
bfsqoeq bfsdoeq bfssoeq ordered and equal
bfsqone bfsdone bfssone ordered and not-equal
bfsqolt bfsdolt bfssolt ordered and less
bfsqole bfsdole bfssole ordered and less-or-equal
bfsqo bfsdo bfsso ordered
bfsqueq bfsdueq bfssueq unordered or equal
bfsqune bfsdune bfssune unordered or not-equal
bfsqult bfsdult bfssult unordered or less
bfsqule bfsdule bfssule unordered or less-or-equal
bfsquo bfsduo bfssuo unordered
branch if classification is true
bfsqclass bfsdclass bfssclass compare
nullify if compare is true
nulfsqoeq nulfsdoeq nulfssoeq ordered and equal
nulfsqone nulfsdone nulfssone ordered and not-equal
nulfsqolt nulfsdolt nulfssolt ordered and less
nulfsqole nulfsdole nulfssole ordered and less-or-equal
nulfsqo nulfsdo nulfsso ordered
nulfsqueq nulfsdueq nulfssueq unordered or equal
nulfsqune nulfsdune nulfssune unordered or not-equal
nulfsqult nulfsdult nulfssult unordered or less
nulfsqule nulfsdule nulfssule unordered or less-or-equal
nulfsquo nulfsduo nulfssuo unordered
nullify if classification is true
nulfsqclass nulfsdclass nulfssclass
Format of fp scalar compare branch instructions
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode src1 src2 opx disp17x16
Format of fp scalar compare nullification instructions
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode src1 src2 opx dist-no dist-yes opx

The instructions for branch on floating-point classification check floating-point value class. The floating-point classification instructions use 7-bit immediate mask with flags describing which floating-point value types are meet condition.

Classification flag Description Assembler mnemonic
0x01Zero@zero
0x02Negative@neg
0x04Positive@pos
0x08Infinity@inf
0x10Normalized@norm
0x20Denormalized@denorm
0x40NaN (Quiet)@nan
0x80fixme: no place for Signaling NaN@snan
Format of fp scalar classification branch instructions
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode src classify opx disp17x16

The instructions for nullification on floating-point classification nfclsd, nfclsq, nfclss check floating-point value class.

Format of fp scalar classification nullification instructions
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode src classify 0 dist-no dist-yes opx

§ 9.7. Logical vector instructions

The instructions for manipulating real registers as bit vectors are independent of the type of data stored in the registers. They are intended for conditional movements, operations on bit masks, generation of predicates.

41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst src1 src2 0 opx
Name Description
vsllshift left
vsrlshift right
vrllrotate left
vrrlrotate right
p1permpermute bytes
lvsrvector load for shift left (permutation)

Instruction vsel (vector bitwise select) produces a bitwise selection of two registers based on the contents of the third register, where the bit mask is the preliminarily computed result of a logical operation or a comparison operation.

Instruction format vsel, p1perm
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst src1 src2 src3 opx

Instructions dep16 (vector deposit) and srp16 (vector shift right pair) produce a bitwise selection of two registers. The instruction dep16 takes the first count bit of the result from the first operand register, the remaining bits are from the second operand register. The instruction srp16 takes the first count bit of the result from the upper part of the first operand register, the remaining bits are from the lower part of the second operand register.

Instruction format dep16, srp16
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst src1 src2 count opx

§ 9.8. Integer vector operations

These are DSP (digital signal processing) instructions for working with multimedia integer data. Instructions are generated according to the FBIN rule (format). The first register is the result. The second and third are operands.

41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst src1 src2 0 opx

The size of vector data can be 1, 2, 4, and 8 bytes. It is possible to carry out calculations with rounding modulo or with saturation (saturate). Saturation can be signed or unsigned. Modular rounding can be truncated or carried back (carry-out).

Name Description Element Size
vaddc*add carryout unsigned1,2,4,8
vaddu*add unsigned modulo1,2,4,8
vaddo*add overflow1,2,4,8
vaddss*add signed saturate1,2,4,8
vaddus*add unsigned saturate1,2,4,8
vavgs*average signed1,2,4,8
vavgu*average unsigned1,2,4,8
vcmpeq*compare equal1,2,4,8
vcmplts*compare less than signed1,2,4,8
vcmpltu*compare less than unsigned1,2,4,8
vmaxs*maximum signed1,2,4,8
vmaxu*maximum unsigned1,2,4,8
vmins*minimum signed1,2,4,8
vminu*minimum unsigned1,2,4,8
vmrgh*merge high1,2,4,8
vmrgl*merge low1,2,4,8
vpkssm*pack signed as signed modulo2,4,8
vpksss*pack signed as signed saturate2,4,8
vpksum*pack signed as unsigned modulo2,4,8
vpksus*pack signed as unsigned saturate2,4,8
vpkuum*pack unsigned as unsigned modulo2,4,8
vpkuus*pack unsigned as unsigned saturate2,4,8
vrol*rotate left1,2,4,8
vror*rotate right1,2,4,8
vsll*shift left logical1,2,4,8
vsra*shift right alfebraic1,2,4,8
vsrl*shift right logical1,2,4,8
vsubb*subtract carryout unsigned1,2,4,8
vsubu*subtract unsigned modulo1,2,4,8
vsubus*subtract unsigned saturate1,2,4,8
vsubss*subtract signed saturate1,2,4,8
vupkhs*unpack high signed1,2,4
vupkls*unpack low signed1,2,4

In the table, the asterisk* replaces the size of vector elements: 1, 2, 4, 8.

Chapter 10. Extended instruction set

This chapter describes the extended virtual processor instruction set which was not included in the basic set.

§ 10.1. Helper Address Calculation Instructions

To simplify addressing, several instructions have been introduced that calculate effective addresses without going to memory. The ldax instruction returns the effective address as indexed addressing.

Instruction format ldax
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst base index scale sm disp

The instruction ldar (load address relative) calculates the ip-relative base address as jump instruction. The first argument is the number of the result register, the second is the distance in the instruction bundles from the current position (in assembler, this is a label in the code section, or a label in the immutable data section, aligned on a 16-byte boundary). It is used to get the base address of immutable data from a code section, function address or label. The instruction doesn't generate interrupts.

ldar dst, label

This instruction is necessary for position-independent code to get the absolute address of objects, stored at a fixed distance from the current position, for example, intra-module procedures or unchanged local module data. On systems like MAS (Multiple Address Spaces) with multiple address spaces, where the module's private data is stored at a fixed distance from the code section, it can also be used to obtain the base absolute address of the module's private data.

Instruction format ldar
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst label (28 bits)
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
0 label (expanding to 60 bits instead of 28)

The instruction is formed according to the ldar rule. The result register is followed by a 28-bit field for encoding the offset relative to the instruction counter. The data block must be aligned with at least a 16-byte boundary, since the offset expresses the distance in instruction bundles, not bytes. The general formula for obtaining the address:

gr[dst] = ip + 16 × sign_extend(label)

Offset field 28 bits long (64 bits for double instruction) after sign extension and left shift by 4 positions is added to the contents of the instruction counter ip, to produce a 64-bit effective address. The maximum distance for a one-slot instruction is 2 GiB on either side of the instruction counter. The ldar instruction allows the continuation of the immediate value in the instruction code to the next slot of the bundle with the formation of a dual-slot instruction.

The ldar instruction might be used to compute address of the static module data. But specially for this the another instruction ldafr is intended (load address forward relative), which allow to address any address, not only 16-byte bundle-aligned. It computes effective address same as all ip-relative load/store instructions. This reduces the maximal available distance 16 times, so only forward references with unsigned offset are possible, so distance reduction is 8 times only. To use ldafr, the distance from the current bundle to the data should not exceed 256 MiB. Usually 256 MiB is enough for any module.

Instruction format ldafr
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst label (28 bits)
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
0 label (expanding to 64 bits instead of 28)

gr[dst] = ip + zero_extend(label)

If a constant with a sign is fits in 28 bits, then it's more efficient to use the ldi instruction, and if fits in 56 bits then ldi along with ldan. However, when loading constants in bulk, a single ldar instruction falls on several loading instructions, and then a pair of ldi and ldan – instructions is less compact than a single ldw instruction. As for loading 8-byte integer constants, real constants, vector constants, then using ldar along with ld8, ld4 and other download instructions, is the recommended, and often the only possible way to load such constants.

Base with offset addressing allows 1 MiB addressing to both sides of the base address when using one-slot instructions (21-bit offset). If the object is beyond 1 MiB, you will have to use dual-slot instructions. But, according to the principle of access locality, with a high probability the program will access next objects located near the first one. This fact can be used, and once calculate the base address, from which several necessary objects are located no further than 1 MiB, and then use one-slot instructions to address them.

Nearest Base Address Calculation Instructions ldan (load address near). It is used to optimize local (by place and time) memory access without using dual-slot instructions and long offsets. Another nearest base address instruction ldanrc (load address near relative consistent).

ldan   dst, base, simm
ldanrc dst, base, simm

First argument is the result register number, second is base address register number, the third is an immediate value is 21 bits long (or 63 bits for a long instruction), extended to 64 bits. The instruction allows the continuation of the immediate value in the instruction code up to 63 bits to the next slot of the bundle with the formation of a dual-slot instruction.

Instruction format ldan, ldanrc
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst base simm (21 bits)
838281807978777675747372717069686766656463626160595857565554535251504948474645444342
0 (44 bits instead of 21)

The target 64-bit address is calculated (for ldan and ldanrc) as:

gr[dst] = gr[base] + (simm << 20)

gr[dst] = ip + gr[base] + (simm << 20)

The following example shows how to use the ldan instruction to access a group of closely spaced (no more than 512 KiB from each other), but far-away data (the distance to the sym object is more than 512 KiB from the base address).

Without using ldan (4 double instructions, 8 slots)

    ldsw.l  %r1, base, sym + 4
    ldw.l   %r2, base, sym + 8
    std.l   %r2, base, sym + 16
    ldd.l   %r3, base, sym + 32

Using ldan (5 single instructions, 5 slots)

    ldan  tmp, base, data_hi (sym); put the nearest address in tmp
    ldsw  g11, tmp, data_lo (sym) +4; tmp addressing
    ldw   g12, tmp, data_lo (sym) +8
    std   g12, tmp, data_lo (sym) +16
    ldd   g13, tmp, data_lo (sym) +32

§ 10.2. Multiprecision arithmetic

For hardware support for long arithmetic, it is advisable to add special instructions. In the general case, for intermediate addition/subtraction of parts of high precision numbers it is required to specify the incoming carry (borrow), two operands, the result and the outgoing carry (borrow).

When explicitly coding all dependencies and not using global flags (which is good for parallel/pipeline execution of instructions) it requires 5 parameters: the result, two operands, input and output carry/borrow. There is not enough space in the instructions for all five parameters. Therefore, the high part of 128-bit registers is used to return the carry/borrow.

A special instruction mulh (multiply high) was introduced for hardware support for multiplying long numbers calculating the upper half of a 128-bit product of two 64-bit numbers.

Instruction format addc, subb, mulh
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode ra rb rc 0 opx
Instruction format addaddc, subsubb
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode ra rb rc rd opx

Syntax:

addc      ra, rb, rc
addaddc   ra, rb, rc, rd
subb      ra, rb, rc
subsubb   ra, rb, rc, rd
mulh      ra, rb, rc
Table 10.1: Fused instructions
Name Operation Description
addcadd with carryra = carry (rb + rc), sum (rb + rc)
subbsubtract with borrowra = borrow (rb − rc), rb-rc
addaddcadd and add with carryra = carry (rb + rc + rd. high), rb + rc + rd.high
subsubbsubtract and subtract with borrowra = borrow (rb − rc −rd.high),rb−rc−rd.high

It is assumed that numbers of arbitrary length are already loaded into the registers. For example, the addition of 256-bit numbers will occur as follows:

addc      a1, b1, c1      ; sum of lower parts, first carry-out
addaddc   a2, b2, c2, a1  ; sum of middles and carry-in, next carry-out
addaddc   a3, b3, c3, a2  ; sum of middles and carry-in, next carry-out
addaddc   a4, b4, c4, a3  ; sum of higher and carry-in, last carry-out

§ 10.3. Software interrupts, system calls

The instruction syscall (system call) does the call to the kernel of the system to process the system request. The system call number is obtained from r1, arguments from subsequent registers.

Unlike interrupts, a system call is an analogue of a function call, and has similarly implemented return from it to the next bundle. Therefore, after the instruction syscall in assembler, you need to put a label to ensure that the subsequent instructions fall into the new bundle. Bits of future predication are cleared.

The first frame registers are rotated, and return address is stored in zero register in the new frame. Subsequent local registers contain syscall arguments.

The sysret (system return) instruction returns from the system request handler, which was called using syscall. The instruction use the return address and frame state from zero register.

Instruction format syscall, sysret
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode 0opx

The instruction int (interrupt) is provided for sending interrupts to the current core themselves programmatically. The sent interrupt doesn't happen synchronously to the instruction thread, but it can be delayed until the moment when this vector is unmasked. For the user-mode program, when all interrupts are unmasked, the sent interrupt is happen synchronously to the instruction thread. The interrupt index is calculated as gr[src] + simm10. The instruction support both styles of interrupt code passing: hardcoded codes with zero register gz or dynamic code pasing.

Instruction format int
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode 0 src simm10 opx

The rfi instruction (return from interruption) returns from the interrupt handler. It returns to the beginning of the bundle containing the interrupted incomplete instruction (in case of an error), or to a bundle containing the subsequent instruction (in the case of a trap).

Instruction format rfi
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode 0opx

§ 10.4. Cipher and hash instructions

Table 10.2: AES/hash instructions
Name Operation
aesdec          ra, rb, rc
aes decrypt round
aesdeclast      ra, rb, rc
aes decrypt last round
aesenc          ra, rb, rc
aes encrypt round
aesenclast      ra, rb, rc
aes encrypt last round
aesimc          ra, rb
aes inverse mix columns
aeskeygenassist ra, rb, uimm8
aes key generation assist
clmulll         ra, rb, rc
carry-less multiply low parts
clmulhl         ra, rb, rc
carry-less multiply high and low parts
clmulhh         ra, rb, rc
carry-less multiply high parts
crc32c          ra, rb, rc, rd
crc32c hash
Instruction format aesenc, aesenclast, aesdec, aesdeclast, clmul
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst src1 src2 0 opx
Instruction format aesimc
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst src 0 0 opx
Instruction format aeskeygenassist
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst src round constant opx
Instruction format crc32c
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst prev data len opx

The crc32c instruction computes crc32c hash. The new hash value is based on previous hash value «prev». The hashed data is in register «data». The len parameter may be any value. If it is bigger than 16, only 16 bytes of data in register data is used.

§ 10.5. Random number generation instruction

The special instructions random are designed to generate random variables. Reading from it returns the next 64-bit random number. The instruction returns random numbers that are compliant to the «U.S. National Institute of Standards and Technology (NIST)» standards on random number generators.

Instruction format random
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst src 0 0 opx

The src operand specifies the used random generator.

InstructionSourceNIST Compliance
rdrand(0)Cryptographically secure pseudorandom number generatorSP 800-90A
rdseed(1)Non-deterministic random bit generatorSP 800-90B & C (drafts)

The numbers returned by rdseed are referred to as "seed-grade entropy" and are the output of a true random number generator (TRNG), or an enhanced non-deterministic random number generator (ENRNG) in NIST-speak. rdseed is intended for use by software vendors who have an existing PRNG, but would like to benefit from the harsware entropy source. With rdseed you can seed a PRNG of any size.

The numbers returned by rdseed have multiplicative prediction resistance. If you use two 64-bit samples with multiplicative prediction resistance to build a 128-bit value, you end up with a random number with 128 bits of prediction resistance (2128×2128 = 2256). Combine two of those 128-bit values together, and you get a 256-bit number with 256 bits of prediction resistance. You can continue in this fashion to build a random value of arbitrary width and the prediction resistance will always scale with it. Because its values have multiplicative prediction resistance rdseed is intended for seeding other PRNGs.

In contrast, rdrand is the output of a 128-bit PRNG that is compliant to «NIST SP 800-90A». It is intended for applications that simply need high-quality random numbers. The numbers returned by rdrand have additive prediction resistance because they are the output of a pseudorandom number generator. If you put two 64-bit values with additive prediction resistance togehter, the prediction resistance of the resulting value is only 65 bits (264+264=265). To ensure that rdrand values are fully prediction-resistant when combined together to build larger values you can follow the procedures in the «DRNG Software Implementation Guide» on generating seed values from rdrand, but it's generally best and simplest to just use rdseed for PRNG seeding.

The decision for which generator to use is based on what the output will be used for. Use rdseed if you wish to seed another pseudorandom number generator (PRNG), use rdrand for all other purposes. rdseed is intended for seeding a software PRNG of arbitrary width. rdrand is intended for applications that merely require high-quality random numbers.

§ 10.6. CPU identification instructions

The cpuid instruction is used to dynamically identify which features of POSTRISC are implemented in the running processor. The realization of the functional characteristics of these instruction systems is recorded in the series of configuration information words. One configuration information word can be read once the cpuid instruction is executed. The configuration information word number to be accessed is computed as gr[index]+sext(imm10). The 64-bit configuration information is written into the general register dst.

cpuid instruction format
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode dst index simm10 opx

Syntax:

cpuid ra, rb, simm10

The configuration information word contains series of configuration bits (fields). For example, The PALEN field of the number of physical address bits supported by the 11th to 4th digits of the configuration word No.1 is recorded as cpuid.1.PALEN[11:4].

The configuration information accessible by the cpuid instruction is listed in the table below. cpuid access to undefined configuration words causes general protection exception. The reserved fields in the defined configuration words read back zero values.

Word number Bit field Description
031:0 number of implemented configuration words
1 47:32 vendor
31:16 version
15:0 revision
163:0capabilities flags
263:0L1I info
363:0L1D info
463:0L2D info
563:0L3D info
663:0L1 ITLB
763:0L1 DTLB
863:0L2 TLB
963:0PMR info

§ 10.7. Instructions for the emulation support

Currently the OS and standard libraries are not implemented for the virtual processor. Therefore, a few special instructions have been added to mimic their minimal emulation.

The write instruction is for outputting a formatted string. It uses the forward ip-relative addressing to address the format string. An unsigned 28-bit ip-relative offset gives a maximum distance of 256 MiB forward from the current position for a one-slot instruction and all available address space for a long instruction. The write instruction allows the continuation of the immediate value in the instruction code to the next bundle slot with the formation of a dual-slot instruction. It is assumed that the effective address point to a zero-terminated string. In assembler, you can use both labels on strings in the rodata section, and directly strings (the assembler will place them in the rodata section and insert the offset into the instruction).

ea = ip + zext(disp)

Instruction format write
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode opx disp (28 bits)

The following formatters are used to display the content of the current core registers. The common syntax is «%formatter(register)» or «%m(command)».

Table 10.5: write formatters
formatter description
%%%
%clow part of general register as a 1-byte character
%i8, %i16, %i32, %i64low part of general register as a signed decimal value
%u8, %u16, %u32, %u64low part of general register as a unsigned decimal value
%x8, %x16, %x32, %x64, %x128low part of general register as a unsigned hexadecimal value
%b8, %b16, %b32, %b64low part of general register as a binary value
%f32, %f64, %f128low part of general register as a floating-point value
%vf32, %vf64general register as a vector of floating-point values
%vi8, %vi16, %vi32, %vi64general register as a vector of signed decimal value
%vu8, %vu16, %vu32, %vu64general register as a vector of unsigned decimal value
%vx8, %vx16, %vx32, %vx64general register as a vector of hexadecimal value
%m(dump)full core state dump

The instruction halt without parameters is intended to turn off the processor core, switching it to the deepest level of sleep, without saving a state, from which core may exit only by the reset signal. But in emulator this instruction serves to shut down the emulator.

Instruction format halt
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode 0opx

Notes: The halt instruction is not used in unittests, because it is automatically added for each test source by test scripts.

Chapter 11. Application Model (Application Binary Interface)

This chapter gathers information about the ABI – software binary interface. This includes questions about what the runtime and program model are, what are the sections and segments of the program, how the program finds its private data, what are the available addressing methods, what are the accepted agreements on the relationships between the procedures and register preservation, relocation types and object file formats are described.

Depending on the execution environment (hardware capabilities of the target architecture, type of operating system) program models are (in order of the history of the emergence and growth of universality):

For the POSTRISC system, a combined SAS/MAS environment with implicit segmentation was selected, when each segment can be configured as SAS or MAS.

§ 11.1. Sections and segments

The compiler divides the different parts of the generated object code and data into different sections. During linking, when combining object files, sections with the same name come together and consolidate, getting an output file with one instance of each type of section. These sections of the output file are further grouped into several segments, which are processed by the loader as indivisible units.

The purpose of sections is to allow the compiler to generate separate pieces of code and data, which can be combined with other similar parts from other object files at the build stage. This makes it possible to achieve link locality, and confidence in the correct addressability of the contents of these sections. The most important section attributes are the type of access to the section pages; all data in one section shares the same minimum set of permissions.

The purpose of the segments is to allow the linker to group sections into fewer program units. Each segment has unique addressing methods for it, and sections of one segment are addressed in the same way. The compiler may make assumptions that any two objects in the same segment have a fixed offset relative to each other when the program is executed, but cannot assume the same for two objects in different segments.

The runtime architecture also defines some additional segments, which do not get their contents directly from the compiled object file. These segments are the – heap, stack, and shared memory segments – are created at program startup time or dynamically at runtime.

Table 11.1: Standard scheme of a software module (using ELF format as an example)
segment section type of program Description
TEXTheaderallfile header
sectaballsection heading table
shstrtaballsection names
.dynamicsharedDynamic linking information Header
.liblistsharedA list of the names of the required spherical libraries
.rel.dynsharedRelocation for DATA process data
.rel.tdatasharedRelocation for TDATA thread data
.conflictsharedAdditional dynamic linking information
.msymsharedAdditional dynamic linking symbol table
.dynstrsharedName of linking external functions
.dynsymsharedLink table of external functions
.hashsharedHash table for quick search in the export table
.rconstallRead-only constants (no configuration)
.rodataallImmutable global data (setting at first boot into the system)
.litanonsharedLiteral address pool section
.litallLiterals (Literal pool section)
.tlsinitallInitial copy of TDATA data
.pdataallException procedure table
.textallMain program code (not corrected during loading, it is possible to configure it at the first load in the system)
.initallSection of the program initialization code
.finiallProgram Termination Code Section
.commentallComment Section
TEXT but not downloadablersrcallCompiled resources
lineallDebug information
debugallDebug information
unwindallTable for stack rollback after exceptional situations
unwind_infoallBlocks of information to roll back the stack after exceptional situations
DATA.dataallInitialized private process data (setting at boot)
.xdataallException scope table
.sdataallNear-address small data initialized private process data (setting at boot)
.gotsharedGOT table (Global offset table) for references to DATA variables of other modules
.sbssallSmall-address (small bss) uninitialized private process data
.bssallUninitialized private process data
TDATA.tdataallInitialized thread local data (setting at boot)
.tsdataallNear-address small data initialized thread local data (setting at boot)
.tgotsharedModule GOT table for the thread (links to TDATA variables of other modules)
.tsbssalladdressable small (bss) uninitialized thread local data
.tbssallUninitialized thread local data

A program in the POSTRISC architecture consists of a main program module, dynamically loaded libraries (the same program modules), stacks of the main and other threads, several heaps. Each program module consists of four types of sections.

The TEXT segment is shared by all processes in the system and is read-only and executable. The addressing within the segment is relative to the instruction pointer. Its CODE section contains program code. Its RODATA section contains immutable data, placed after the CODE section.

The DATA segment contains private process data. The segment is read-write. The addressing within the segment is relative to the instruction pointer. The DATA segment of the main software module, in addition to its private data, contains a table of base addresses for all DATA segments of dynamically loaded libraries.

The TDATA segment contains private process data. The segment is read-write. The segment after creation is unknown distance from everyone else segments and is addressed relative to the selected base register tp. The TDATA segment of the main software module, in addition to its private data, contains a table of base addresses for all segments of TDATA dynamically loaded libraries.

§ 11.2. Data model

There are several data models for binding fundamental integer scalar data types from programming languages to architectural data types.

Table 11.2: Dimensions of fundamental types
Data model Architectural types
1-byte 2-byte 4-byte 8-byte
ILP16 char short int, int, long int, near pointer
LP32 char short int, int, near pointer long int, far pointer
ILP32 char short int int, long int, pointer long long int
LLP64 char short int int, long int long long int, pointer
LP64 char short int int long int, long long int, pointer
ILP64 char short int wchar_t int, long int, long long int, pointer

The ILP16 variant was used by very ancient 16-bit systems, LP32 uses MS DOS, ILP32 use all 32-bit systems, LLP64 chose Microsoft for Windows-64, LP64 is selected for Linux-64 and most other 64-bit Unix systems, ILP64 is used in some versions of Unix systems.

The choice between LLP64, LP64, and ILP64 is determined by different criteria. If you need support (without recompiling) an existing array of 32-bit software when migrating to 64-bit systems, then LLP64 is the best choice. The disadvantage of – alteration for 64 bits requires a deep modernization of the program. If you want the existing code array to take advantage of 64-bit addressing with minimal code rework, then ILP64 is a good fit. The disadvantage of a – superficial code upgrade leads to memory overrun where 64 bits are not needed. If you follow a balanced approach between the complexity of converting to 64-bit systems and the need to support existing 32-bit programs, then choose LP64. ILP64 was chosen for POSTRISC, with the addition of the new fundamental type long char to describe four-byte numbers (wchar_t).

Table 11.3: Binding to fundamental types
Data Type Size and alignment Machine Type
signed char1 (1)signed byte
unsigned char1 (1)unsigned byte
char1 (1)byte, the sign depends on the compiler
bool1 (1)unsigned byte, 0 or 1
[signed] short int2 (2)signed 2-byte
unsigned short int2 (2)unsigned 2-byte
[signed] long char4 (4)signed 4-byte
unsigned long char4 (4)unsigned 4-byte
enum1,2,4,8depends on the range of values
[signed] int8 (8)signed 8-byte
unsigned int8 (8)unsigned 8-byte
[signed] long int8 (8)signed 8-byte
unsigned long int8 (8)unsigned 8-byte
[signed] long long int8 (8)signed 8-byte
unsigned long long int8 (8)unsigned 8-byte
data pointer: type *8 (8)unsigned 8-byte
function pointer: type (*) ()8 (8)unsigned 8-byte
float4 (4)IEEE single
double8 (8)IEEE double
long double16 (16)IEEE quadruple

Aggregate data types (structures – struct, class – and arrays) and unions (union) are aligned with their most strictly aligned component. The size of any object, including aggregates and associations, is always a multiple of the alignment of the object. An array uses the same alignment as its elements. Structure and join objects may require inserts to meet size and alignment restrictions. The content of any padding is undefined.

C structures and associations can contain bit fields that define integer objects with a specified number of bits. The table shows the permissible values of bit fields for each base type, and the corresponding limits.

Table 11.4: Binding of bit fields to fundamental types
Data Type Field Width W Limits
char, signed char1-8 − 2 W−1 … 2 W−1 − 1
long char, signed long char1-16
short, signed short, enum1-32
int, signed int1-64
long, signed long1-64
long long, signed long long1-64
unsigned char1-8 0 … 2W − 1
unsigned long char1-16
unsigned short1-32
unsigned int1-64
unsigned long1-64
unsigned long long1-64

Bit fields whose base type (with the exception of enumerated types) is represented without an explicit signed or unsigned definition, considered as unsigned (fixme). Bit fields of enumerated types are considered to be signed, unless an unsigned type is needed to represent all constants from the enumeration type. Bit fields obey the same size and alignment rules as other fields in a structure or union, with the following additions:

Bit fields like int and long (signed and unsigned) are usually packed denser, the smaller the base types (less restrictions on crossing the boundaries of the base type). You can use bit fields and types char and short, to force placement within those types, but int is generally more efficient.

§ 11.3. Reserved registers

Although all 128 general-purpose registers are physically equal (except for the difference between global and rotate registers, and some other differences) the software binary interface reserves several general purpose registers for its (special) purposes. Unlike real special purpose registers, these registers are special only in the sense that that the program is obliged to use them only in an authorized way. The choice of numbers for these registers is (almost) arbitrary and not part of the architecture.

The initial contents of the registers sp, tp set by the loader at the start of the process / thread and should be changed by the program only according to ABI rules. The contents of the reg sp should always correctly display the state of the stack and be aligned with the strictest boundary for the base types – 16 bytes. Register r0 must contain the return info when the procedure is called.

Table 11.5: Dedicated General Purpose Registers
Register Content
r0link pointer – return address from the procedure. The called procedure receives the return address in the first register of the new frame of the local registers, register r0.
spstack pointer – pointer to the top of the stack.
tp thread pointer – pointer to the beginning of thread local data for the main (static) module. Used by load/store instructions and ldan only inside the main module.

§ 11.4. Position independent code and GOT

The code segment must not contain relocations (PIC). To create a PIC, the compiler must:

  1. Use for all internal branches the ip-relative branches only rather than branches to absolute addresses.
  2. Similarly, do not use absolute references to static data, instead use addressing with an offset relative to some standard base register. If the code and data segments are guaranteed to be located at a known distance from each other (MAS), then the function from the shared library can calculate the corresponding base address using ip. Otherwise (SAS), the caller must set the base register as part of the call sequence.
  3. Use an additional level of indirection for each control transfer outside the monolithic PIC segment, and for each call to static memory outside the corresponding data segment. Indirectness allows you to save non-PIC target addresses in the DATA segment private for each instance of the program.

The position-independent code cannot contain absolute addresses directly in the instruction code, but uses for addressing data and an offset code relative to the instruction counter. A data-binding-independent code uses for code addressing the offset relative to the instruction counter, but cannot address private data in the same way, but only relative to to the base registers.

The Global Offset Table (GOT) stores absolute addresses and is part of the process's private data, which makes addresses accessible without violating positional independence and sharing of program code. Each program module refers to its GOT table in a position-independent manner and extracts absolute addresses from it. So position-independent links are converted to absolute positions.

Initially, the GOT contains information about relocation points (annotations for the dynamic linker). After the system creates memory segments for the loaded object file, the dynamic linker processes relocation points, some of which will refer to the GOT. The dynamic linker determines the symbolic names associated with them, calculates their absolute addresses, and sets the appropriate values in the corresponding GOT entries. Although the absolute addresses are unknown to the link editor when it builds the object file, but the dynamic linker knows the addresses of all memory segments and can therefore calculate the absolute addresses of the objects contained therein.

If the program requires direct access to the absolute address of the object, this object will have an entry in the GOT. Since the executable file and each shared object have separate GOTs, the address of a symbolic name may appear in several tables. The dynamic linker processes all GOT relocations before transferring control to the process code, which guarantees the availability of absolute addresses at runtime.

Thanks to GOT, the system can select different addresses of memory segments for one shared object in different programs. She can even choose different library addresses for different executions of the same program. At the same time, memory segments do not change addresses after the process image is installed. As long as the process exists, its segments are located at fixed addresses.

Short summary: if the program has several data segments (private or shared), then they are accessed indirectly through the GOT address table. The GOT table is part of one selected – DATA private data segment. However, objects in the DATA segment itself can be addressed indirectly through the GOT table in DATA (for example, if the distance of the relative displacement is too large for implementation in the instruction).

§ 11.5. Program relocation

The relocation or unresolved link is a place in the code or static data, reserved by the compiler to substitute the later calculated value – at the stage of compilation or even later at the stage of loading, or not containing data (field of zero bits), or containing incomplete information (an additional term may be stored to calculate the allowed link). Usually, a direct value is stored in the place of relocation, which is the absolute address or relative offset relative to the base address or counter of the bundles, and used when accessing memory or address calculations.

There are as many types of relocation as there are different ways in the processor architecture of the processor to put an immediate value in the code of the machine instruction (without taking into account the constants for the description of shifts and some other constants, too short and therefore not used for relocation) or in the data object. Link Editor uses these unfilled (under-calculated) line items. at the assembly stage for embedding in the previously compiled code its information about the links in the program between individual segments, sections, object modules and dynamically linked executable modules.

The compiler creates (for later use by the linker) a table of relocation (program moving) records as part of the object file. Moving records describe how the linker (or later the loader) should modify the instruction or data field.

The following distinct data relocation types are defined for data sections in the POSTRISC architecture:

  1. RELOC_WORD. A 4-byte boundary aligned 32-bit field in any data section.
  2. RELOC_DWORD. An 8-byte boundary aligned 64-bit field in any data section.

For code in the POSTRISC architecture (according to the format of single-slot instructions and their extensions to the second slot) The following distinct code relocation types are defined:

  1. RELOC_LDI. A 28-bit signed immediate (or 64 bits for a long instruction) is embedded in the ldi instruction.
    41403938373635343332313029282726252423222120191817161514131211109876543210
    opcode other simm 28 bits
    838281807978777675747372717069686766656463626160595857565554535251504948474645444342
    0 extended simm (64 bits instead of 27)
  2. RELOC_JUMP. A constant with a sign length of 28 bits (or 60 bits for a long instruction) for ip-relative offset in the program segment text (or rodata) embedded in callr, jmp or ldar instructions (distance ±2 GiB or ±8 EiB for long instructions).
    41403938373635343332313029282726252423222120191817161514131211109876543210
    opcode other simm 28 bits
    838281807978777675747372717069686766656463626160595857565554535251504948474645444342
    0 extended simm (60 bits instead of 28)
  3. RELOC_BRANCH. A signed 17-bit immediate (or 30-bit for a long instruction) for offset in the code segment embedded in an instruction like compare-and-branch as a branch distance (distance ±1 MiB or ±8 GiB for a long instruction).
    41403938373635343332313029282726252423222120191817161514131211109876543210
    opcode other simm 17 bits
    838281807978777675747372717069686766656463626160595857565554535251504948474645444342
    other extended (30 bits)
  4. RELOC_BINIMM. A constant with a 21-bit sign (or 63 bits for a long instruction) for instructions ld1, lds1, addi, subfi and others.
    41403938373635343332313029282726252423222120191817161514131211109876543210
    opcode other simm 21 bits
    838281807978777675747372717069686766656463626160595857565554535251504948474645444342
    extended simm (63 bits instead of 21)
  5. RELOC_BINIMMU. An unsigned constant of 21 bits (or 63 bits for a long instruction) for instructions maxui, minui.
    41403938373635343332313029282726252423222120191817161514131211109876543210
    opcode other uimm 21 bits
    838281807978777675747372717069686766656463626160595857565554535251504948474645444342
    extended uimm (63 bits instead of 21)
  6. RELOC_BRCI_SIMM. A 11-bit signed constant (or 40-bit for a long instruction) is embedded in an instruction like compare-with-constant-and-jump as a constant to be compared.
    41403938373635343332313029282726252423222120191817161514131211109876543210
    opcode other simm11 other
    838281807978777675747372717069686766656463626160595857565554535251504948474645444342
    extended simm (40 bits instead of 11) other
  7. RELOC_BRCI_UIMM. A 11-bit unsigned constant (or 40-bit for a long instruction) embedded in an instruction like constant-compare-and-jump as a constant to be compared.
    41403938373635343332313029282726252423222120191817161514131211109876543210
    opcode other uimm11 other
    838281807978777675747372717069686766656463626160595857565554535251504948474645444342
    extended uimm (40 bits instead of 11) other

Instruction-dependent basic relocation types further differ in different ways of forming the implemented constant and conditions checked in this case. The set of these methods depends on the program model used and additional features of the instruction set architecture.

For example, the Intel X86 architecture has only one basic type of code relocation (in 32-bit mode) – the 4-byte field as part of the instruction, for which there are only two ways of forming the constant – just an absolute address for the data or ip-relative address for the code.

The POSTRISC system is oriented towards a position-independent code. In addition, the special instructions ldan, ldar, designed to optimize relative addressing, require special support from the linker. Hence a large selection of possible ways of referencing an object of code or data depending on the location of the object, the remoteness of the object from the base of relative addressing, the presence of indirect links through the GOT table, the number of repeated calls to the same object or objects near it.

The method of referencing an object in the place of relocation (a method of converting a symbol name into an embedded constant) is usually set in assembler as a call to a special function when calculating the constant parameter of an instruction, or as a suffix added to the name of an object. It should be understood that this is actually not a function call, but a mark for the linker (hello to him from the compiler) – exactly how to construct a relocation constant from the name of the object. The set of relocation methods used depends on the architecture of machine instructions (if, for example, long constants are synthesized from parts and divided between several instructions, or several typical sizes of programs with different addressing methods are provided) and from the selected program model (absolute code for the system kernel or user program). The set of POSTRISC assembler functions below is generally traditional for the 64-bit PIC program model and is found (with some variations) in all 64-bit architectures: DEC Alpha, SGI MIPS, IBM PowerPC, Sun UltraSPARC, Intel Itanium.

Table 11.6: Assembler functions to set the relocation method
Group (scope) Function (Method) How, at runtime, an offset is obtained from the offset
Absolute addresses (for data only)symbolsymbol
exprsymbol+offset
got(symbol)mem8[offsettee
Private process datapcrel (expr)ip + offset
ltoff(expr)mem8[ip+offsettee
thread local data (main program)tprel (expr)tp + offset
@tprel@got(expr)mem8 [tp + offset]
Private flow data (dynamic modules) dtprel(expr)mid = mem4 [gp + mid_offset]
local_tp = mem8 [dtv + mid]
ea = local_tp + offset
Support for ldan instructions for all data types data_hi (expr) base + (offset << 15)
data_lo(expr)base+offset
Support for the ldar instruction for all data types text_hi(expr)ip+16×offset
text_lo(expr)base+offset
Miscellaneous Functions segrel(expr)segbase+offset
secrel(expr)secbase+offset

The mere mention of the symbol symbol means the absolute address of the symbol object. The expression expr means a formula from the absolute address of the object and constant offset: symbol + offset. Absolute addresses at runtime are not calculated and used as is. Absolute addresses can be embedded in the instruction code only if it is an absolute program (system core, drivers).

Function got (symbol) (global offset table) means the absolute address in the GOT table for indirect access to the symbol object. At the same time, this is a request to create a GOT record for the symbol object, if there is no such record yet.

The got function cannot be used by itself, but only with pcrel or tprel, e.g. like @ pcrel @ got (expr), since the GOT table is divided in two (depending on the locality of the link objects – process or thread) and is part of the DATA and TDATA segments, and therefore should be addressed accordingly.

The function pcrel (expr) (program counter relative) means the offset offset relative to the instruction counter. The absolute address of the object is computed at run time as ip+offset. Used to access the code and / or static data of the same module.

Function tprel (expr) (thread pointer relative) means offset offset relative to the base register tp when addressing the thread private data. The absolute address of the object is calculated at runtime as tp + offset. The expr object must belong to the TDATA segment of the main module.

Function dtprel (expr) (dynamic thread pointer relative) means the offset offset of the object expr regarding the beginning of the thread private data of this module dtv [ModID] (taken from the array dtv addressed by the register dtv). The absolute address of the object is calculated at run time as dtv[ModID†+offset. The expr object must belong to the TDATA segment of the ModID module itself.

The function data_lo(offset) describes the lower 15-bit part of the offset offset: data_lo(offset) = sign_extend(offset, 15). The offset offset is usually calculated for position-independent programs as gprel(expr) or tprel(expr) depending on the location of the expr object. Used for addressing relative to the intermediate base address, calculated earlier using the ldan instructions. This intermediate address can be reused for calls. to an object or its nearest neighbors using short read / write instructions (with offsets of minimum length – not more than 16 bits per offset).

The data_hi(offset) function describes the older part of the offset offset: data_hi(offset) = (offset- sign_extend(offset, 15)) >> 15. Used by ldan instructions to calculate an intermediate base address before using short read / write instructions. These instructions calculate the absolute address no further than 16 kilobytes from the relative addressable object expr. Long (over 32 kilobytes) offset offset relative to the base register split into two parts offset_hi and offset_lo, so that offset = (offset_hi << 15) + offset_lo, so the younger part is offset_lo always placed in a 16-bit constant with a sign, and the older part offset_hi will be the argument of the ldan instructions for calculating the intermediate base address.

Objects in the TEXT segment (RODATA section with read-only data) in a position-independent program, it should be addressed relative to the ip instruction counter. However, with the ldar instruction, you can get only the starting address of the 16-byte bundle, that is, the address closest to the target is 16 bytes aligned. The following instructions for accessing the memory should take into account the short offset (from 0 to 15 bytes long) from this starting address to the object.

Function text_hi (expr) means that part of the ip-relative offset in the segment text used by the ldar instruction to calculate the absolute address closest to the object, aligned on a 16-byte boundary. The ldar instruction computes ip+16× text_hi(symbol), where text_hi(symbol) is calculated by assembler as ((symbol− text_lo(symbol)) >> 4). The paired function text_lo(symbol) describes the younger part ip -relative offset as sign_extend(symbol, 4), that is, the difference between the address of the object and the nearest 16-byte boundary. This value is used for direct addressing in load/store instructions. after calculating the intermediate address of the 16-byte bundle using the ldar instruction.

Function segrel (expr) (segment relative) describes the offset of the object expr relative to the start of the segment. This relocation is for data structures, which are placed in read-only shared segments but must contain pointers. In this case, the relocation point and the relocation object must be located in one segment. Applications using such relative pointers should be aware of their relativity and add the base address of the segment to them at runtime.

Function secrel (expr) (section relative) describes the offset of the expr object relative to the beginning of the section. This relocation is for links from one section to another within the same segment.

As a result, combining the type of relocation and the method of linking to an object, we get a complete set of all valid types of unresolved links, which the linker should be able to handle (minus some never-seen combinations).

Table 11.7: Types of Relocation Entries
Group Name Relocation Method
absolute addressing (data only) R_ADDR_WORDsym+addend
R_ADDR_DWORDsym+addend
relative to ip R_PCREL_JUMP pcrel (sym + addend), jump/call
R_PCREL_JUMP_EXT pcrel (sym + addend), jump/call
R_PCREL_BRANCH pcrel (sym + addend), compare-and-branch
R_PCREL_BRANCH_EXT pcrel (sym + addend), compare-and-branch
R_PCREL_LDAR text_hi (pcrel (sym + addend), ldar
R_PCREL_LDAR_EXT text_hi (pcrel (sym + addend), ldar
section-base relative R_SECREL_WORD sym - SC + addend, .mem4
R_SECREL_DWORD sym - SC + addend, .mem8
segment-base relative R_SEGREL_WORD sym - SB + addend, .mem4
R_SEGREL_DWORD sym - SB + addend, .mem8
base-relative R_BASEREL_LDI L (sym - base + addend)
R_BASEREL_LDI_EXT L (sym - base + addend)
R_BASEREL_BINIMM sym - base + addend
R_BASEREL_BINIMM_EXT sym - base + addend
dynamic layout? R_SETBASE Set base
R_SEGBASE Set SB
R_COPY dyn reloc, data copy
R_IPLT dyn reloc, imported PLT
R_EPLT dyn reloc, exported PLT
tp -relative R_TPREL_WORD tprel (sym + addend), .mem4
R_TPREL_DWORD tprel (sym + addend), .mem8
R_TPREL_LDI tprel (sym + addend), LDI
R_TPREL_LDI_LONG tprel (sym + addend), LDI
R_TPREL_HI_BINIMMdata_hi(tprel(sym+addend))
R_TPREL_HI_BINIMM_EXTdata_hi(tprel(sym+addend))
R_TPREL_LO_BINIMMdata_lo(tprel(sym+addend))
R_TPREL_BINIMM tprel (sym + addend), load/store
R_TPREL_BINIMM_EXT tprel (sym + addend), load/store

The assembler syntax must be consistent with the set of types of unresolved references that the linker can handle.

For example, almost none of the assemblers/compilers can take into account and handle the subtraction of two addresses from the same segment as a immediate, although it is. At the compilation stage, this subtraction is still unknown, but at the linking stage, when they can be defined, the corresponding types of relocation are not provided in order to pose a similar task to the linker. As a result, the compiler is forced to take these calculations to the stage of loading or executing the program.

The most «advanced» compilation/linking systems support the ability to postpone to the link stage the unresolved links of arbitrary complexity, if they would be reduced to a constant result.

§ 11.6. Thread local storage

Managing thread local storage (TLS) which is private for a thread isn't as simple as per-process private data. TLS sections cannot simply be loaded from a file into memory and made available to the program. Instead, multiple copies should be created (one for each thread) and all of them must be initialized according to the primary image of the TLS section in the program file. The creation of new threads can continue dynamically throughout the entire period of the program.

TLS support should avoid creating TLS data blocks if possible, for example, using deferred memory allocation on the first request (first attempt to access TLS). Most threads will probably never use private data of all dynamic modules at once. Unfortunately, the mechanism of deferred memory allocation requires at least introducing a separate functional level (layer) to control access to TLS objects, which may be too inefficient.

The problem is the very process of compiling TLS data and accessing it when there are many copies of it. The TLS variable is characterized by two parameters: a reference to the TLS block of a particular dynamic module and an offset within this block. To get the address of a variable, you need to somehow map these two parameters to the virtual address space at runtime.

The traditional TLS mapping approach is as follows. One of the general registers (tp or thread pointer) permanently stores the address of the static TLS block of data associated with the current thread. The data block is conditionally divided into two parts: a statically allocated single TLS data block of the main module (exe file) and the dtv vector (dynamic thread vector), storing addresses of dynamically (possibly lazy) dedicated TLS blocks for dynamically loaded dynamic modules. If the dynamic module is loaded into the program, then it is allocated one slot (a place to store the address) in the dtv vector.

Knowing your mid number, the dynamic module can find the beginning of its TLS data for the current thread in dtv[mid] or MEM (tp + mid + offset), where offset is the position of dtv relative to tp (usually 0). Next, you can find the address of the variable as dtv[mid] + var_offset, where var_offset is the position of the variable relative to the dynamic TLS block.

General dynamic model TLS (general dynamic) is the most universal. The code compiled for it can be used anytime, anywhere, and it can access TLS variables defined anywhere. For example, from one dynamic module, access the TLS variable in another dynamic module. By default, the compiler generates code for this model, and can use more limited TLS models only when explicitly allowed by the compiler options.

For the code of this model for the TLS variable are unknown at the build stage (and especially compilation) neither the module number (slot) in which it is located, nor the offset inside the TLS block of this module. Module number (ModuleID) and offset in the TLS block are determined only at runtime (taken from the GOT table where the loader writes them) and passed to a special function __tls_get_addr (the standard name for many Unix), which checks for the existence of a TLS block, creates if it is not, and returns the address of the variable for the current thread. The implementation of this function is also a problem requiring the assistance of the OS.

addr1 = __tls_get_addr (GOT [ModuleID], GOT [offset1])
addr2 = __tls_get_addr (GOT [ModuleID], GOT [offset2])

The code size and runtime are such that it is best to avoid this model altogether. If the module number and / or offset are known, optimization or simplification is possible.

Local dynamic model TLS (local dynamic) is an optimization of the general dynamic model. The compiler uses this model if it knows that TLS variables are used in the same module in which they are defined. Now the variable offsets (at least in the TLS block of this module itself) will be known at the linking stage. The module number is unknown. Still need to call the function __tls_get_addr, but now it can be called only once (with offset 0) to determine the start address of the «block of its» TLS variables. The address of individual variables is simply determined by adding a known offset.

addr0 = __tls_get_addr (GOT [ModuleID], 0)
addr1 = addr0 + offset1
addr2 = addr0 + offset2

Dynamic models using the __tls_get_addr function allow lazy allocation of memory for TLS data at the first request to the block.

Static Load Model TLS (initial exec) assumes that a certain set of dynamic modules will always be loaded together with the main program. Then the loader can calculate the total value of all TLS blocks of such modules and their location in a single TLS block. Separate TLS blocks of different modules in this single block will be located at a fixed distance from the beginning of the block, which the loader computes and stores in the GOT table. Now, to calculate the address of the TLS variable, you do not need to call the function __tls_get_addr, it's just a read from the GOT record, and you don't need to know the module number. Allocation of a single block occurs immediately upon the start of a new thread (without delayed lazy allocation). If such a model is used for a dynamic library, then it cannot be loaded dynamically, but only statically. Addressing occurs relative to the selected register tp with the offset known at the loading stage (taken from GOT).

addr = tp + GOT [offset]

Local static model TLS (local exec) will be obtained if we combine the local dynamic model and the initial exec model, then we get the local exec model: static loading and local calls (without dynamically linked modules). The main module of the program (main) refers to the TLS variables defined in it. Addressing occurs relative to the selected register tp with the offset known at the stage of layout.

addr = tp + offset

The compiler usually (when compiling object modules separately, when creating libraries) doesn't have full information about the future program as a whole. The compiler is forced to make the most careful decisions about the nature of the future program. This usually comes down to the compiler using the most common mechanisms for addressing private data. For TLS, this is the general dynamic addressing model.

Therefore, it is important that the linker, when compiling the finished program, can optimize and make changes to previously compiled object module, and replace for some variables the existing addressing method with another (optimized) one. To do this, at a minimum, the linker must know such places (unresolved references to TLS sections), and the compiler must create the addressing code so that it can be replaced by another. This requires the equivalence of different TLS addressing methods in terms of code size, number, type and number of registers used, etc.

If the optimized version is shorter than the original, after the replacement, the program may leave empty spaces filled with dummy nop instructions. It happens that the optimized version is longer than the original, then the compiler must add the dummies in advance, to allow future linker replacement with an optimized addressing option.

§ 11.7. Modules and private data

The POSTRISC system is focused on code and translation table sharing. It should be possible to replace shared libraries without recompiling their dependent applications. Any software module can be used by several processes. There should be no difference between application code and shared library code. For addressing code and global data, the addressing relative to the instruction pointer is used with software reconfiguration to the conformant regions of private process/thread data.

The one address range is used for mapping code sections of all program modules. This address range is shared by all processes and is executable only. For each process, another address range is allocated for static process data sections of all program modules. For each thread, another address range is allocated for the sections of static data of the thread in all program modules. All three address range types are the of the same size, a multiple of degree 2, and aligned to the same border.

For each program module, the following three values ​​must be equal: offset from the beginning of the code range to the beginning of the code section; offset from the beginning of the private data range of the DATA process to the beginning of the DATA section; offset from the beginning of the private data range of the TDATA stream to the beginning of the TDATA section. Knowing only ip and the base address of the private range (stored in dedicated registers gp and tp for DATA and TDATA, respectively), you can always calculate the location of positionally independent private data using the formula:

base = gp | ip { gtssize − 1: 0}

or indirectly

base = mem [ gp + ip {gtssize − 1: 0} >> tgsize]

Private static data can easily be found by library code. It is not necessary to explicitly pass the correct address of the data segment of the module of the new gp (global pointer) when called through the border of a module or when called through a pointer to a function. A pointer to a function becomes just a pointer to a place in a code segment, without additional levels of indirect access via function descriptor.

Table 11.8: Sample map for several modules
Address subranges Address granules (loaded modules, used granules)
0123456789 0123456789 0123456789 0123456789
m1 - m2 m3 - m4 m5 m6 -
TEXT (global)
process A DATA
thread A1 TDATA
thread A2 TDATA
process B DATA
thread B1 TDATA
thread B2 TDATA
thread B3 TDATA
process C DATA
thread C1 TDATA
thread C2 TDATA
thread C3 TDATA
process D DATA
thread D1 TDATA
process E DATA
thread E1 TDATA
thread E2 TDATA
thread E3 TDATA

When dynamically or statically loading a software module, the loader first determines if this module is presented among the loaded modules in the system. If it isn't loaded, the loader determines for the module a maximum of three module sections: CODE, DATA and TDATA. The loader then looks at the system-wide loaded modules region map and looking for an sufficient size unoccupied address range. According to the rules above, similar unoccupied address ranges exist in all three regions at the same distance from the beginning of each region. Having found such a range, the loader reserves it for future use by this software module in all processes and threads.

The module occupies the selected address range until the last process that uses this module terminates. Then the system can unload the module and free the address range, so the next time the same module may be loaded into a different address range. While the module is loaded, its the base address of the text section of the program module is unchanged for all processes using it.

§ 11.8. Examples of assembler code

The following are examples of using ldar, jmp, bv to load constants, get procedure addresses, procedure calls.

Literals and other local read-only data from the TEXT segment can be loaded using ip-relative addressing. Loading Constant Group:

ldar base, text_hi (_local_data)
ldws gb, base, text_lo (_local_data) +0
ldwz gc, base, text_lo (_local_data) +4
lddz gd, base, text_lo (_local_data) +8

Getting the address of a static procedure (within 64 MiB of the current ip):

ldar base, _myfunc

Getting the address of a static procedure (further 64 MiB from the current ip):

ldar.l base, _myfunc

Getting the address of a dynamic procedure:

ldar base, text_hi (_reloc_table)
lddz gt, base, text_lo (_reloc_table) + __imp_myfunc

Call a static procedure (within 8 GB of the current ip):

callr _myfunc
_ret_label:

Call a static procedure (beyond 8 GB of the current ip):

callr.l _myfunc
_ret_label:

Calling a procedure through a pointer (in the addr register):

callri lp, addr, gz
_ret_label:

The call of the explicit dynamic procedure (correction of the call by the compiler):

ldar base, text_hi (_reloc_table)
lddz addr, base, text_lo (_reloc_table) + __imp_myfunc
callri lp, addr, gz
_ret_label:

Invoking an Implicit Dynamic Procedure (correction of the call by the linker using the stub function):

callr _glu_myfunc
_ret_label:
...
_glu_myfunc:
ldar gt, _reloc_table
ldd gt, gt, _imp_myfunc
bv gt, g0
_glu_ret_label:

Private process data (distance up to 1 MiB):

ldd gt, gp, _local_data

Private process data (distance greater than 1 MiB):

ldan gt1, gp, data_hi (_local_data)
ldd gt2, gt1, data_lo (_local_data)

thread local data (distance less than 1 MiB):

ldd gt, gp, _local_data

thread local data (distance greater than 1 MiB):

ldan g30, tp, data_hi (_local_data1)
ldd g31, g30, data_lo (_local_data1)
ldan g31, tp, data_hi (_local_data2)
ldd g32, g31, data_lo (_local_data2)

Chapter 12. Interrupts and hardware exceptions

Interruption is an action in which the processor automatically stops execution of the current instruction thread. The processor usually saves part of the thread context (at least the address of the instruction must be saved, with which the normal execution of the instruction flow should continue). The state of the machine changes to a special interrupt processing mode. The processor starts execution from the predefined address of the interruption handler routine. Having finished the interrupt processing, the routine-handler (usually) restores the previous state of the processor (the context of the interrupted thread), and makes it possible to continue execution of the thread with an interrupted (or following) instruction (return from interruption).

Exception is an event that, if enabled, forces the processor to interrupt. Exceptions are generated by signals from internal and external peripheral devices, instructions of the processor itself, internal timer, debugger events, or conditional errors. In the general case, exceptions do not coincide with interrupts: different exceptions may generate an interrupt of the same type, one exception can produce several interrupts.

§ 12.1. Classification of interrupts

All interrupts can be classified according to the following independent characteristics: location of the code for interrupt service, synchronism to the context, synchronism to the instruction flow, criticality, accuracy.

According to the code location for its service, interrupts are divided into two groups. Interrupts of the first group depend on the specific implementation of the processor and/or platform. This is a RESET (power up, hardware or «cold» start) INIT (soft or «warm» restart), CHECK (test and, possibly, recovery of the processor and/or platform upon failure), PMI (request to the processor/platform for a implementation specific service). The method for handling such interrupts is unknown to the operating system. The code for processing them is stored in an intermediate layer between the OS and the hardware (PAL). The addresses of the handlers for such interrupts are fixed for this processor implementation, and are tied to the address range of the PAL library. The code, in whole or in part (if the implementation allows PAL updates) is sewn into the write-protected PAL memory area.

Interrupts of the second group are determined by the architecture (fixed) and do not depend on the specific processor implementation. The method of servicing such interrupts is selected by the operating system. The code for their processing is stored in the interrupt table, the location address of this table and its contents are set by the OS. Interrupts of the second group are also called vector interrupts, since the processor uses the interrupt vector number to select the handler code from the interrupt table.

Synchronism to the context specifies the ability to continue the interrupted instruction flow. For RESET or CHECK, the continuation of the interrupted execution context is impossible – it either doesn't exist yet, or it is not restored. A machine check (restart, reset) interrupts the actions of context synchronization with respect to subsequent instructions. For other types of interrupts, after the interrupt is processed, the interrupted thread context is usually restored. These interrupts are also called context-synchronous or recoverable. This means that after the interruption is completed, execution can continue. interrupted sequence of instructions (execution context is saved/restored). An interrupt can be unrecoverable if during its generation or processing The contents of the processor registers, cache memory, write buffers, etc will be lost.

Synchronization to the thread sets the relation of interruption to the interrupted instruction thread. Asynchronous to the thread interrupts are caused by events that are not explicitly dependent on the instructions being executed. For asynchronous interrupts, the address reported to the exception handling routine is it is simply the address of the next thread instruction that would be executed next if the asynchronous interrupt did not occur. Synchronous to the thread interrupts are caused directly by the execution or attempt to execute an instruction from the current thread. Synchronous interrupts are processed strictly in software order, and if available multiple interrupts for a single – instruction in order of precedence for interrupts. Thread-synchronous interrupts are divided into two classes: errors (or faults) and traps.

Error or fault is an interrupt that occurs before the instruction completes. The current instruction cannot (or should not) be executed, or system intervention is required before the instruction is executed. Errors are synchronous relative to the instruction flow. The processor completes the state changes that occurred in the instructions before the erroneous instruction. An erroneous instruction and subsequent instructions have no effect on the machine state. Possible intermediate results of the instruction execution are completely canceled upon error, and after processing the interrupt, the instruction restarts again. Synchronous interrupt errors accurately indicate the address of the instruction that caused the exception that generated the interrupt.

Trap is an interrupt that occurs after the execution of an instruction. A completed instruction requires systemic intervention. Traps are synchronous relative to the instruction flow. The trap instruction and all previous instructions are complete. The following instructions have no effect on the machine condition. The instruction that generated the trap is not canceled or restarted. Synchronous trap traps accurately indicate the location of the next instruction after the instruction that raised the exception that threw the interrupt.

When executing an instruction causes a trap or attempting to execute an instruction causes an error, The following conditions must exist at the breakpoint:

Critical Interrupts. Some types of interruptions require immediate attention, even if other types of interrupts are currently being processed, and it was not yet possible to save the state of the machine (return address and contents of the machine status registers). In addition, the interrupt handler itself may generate an interrupt, which may require a new handler to process. For example, when placing a page table in virtual memory when processing a miss in DTLB or ITLB, a DTLB may miss again.

According to these requirements, interruptions can be classified by severity level. To allow the possibility of a more critical interrupt immediately after the start of processing a less critical interrupt (that is, before the state of the machine is saved), provides several sets of shadow registers to save the state of the machine. Interrupts for each criticality class use their own set of registers.

All interrupts, except for machine verification, are ordered by two categories of interrupt criticality, so that only one interrupt of each category is processed at a time, and while it is being processed, no part of the program state will be lost. Since the group of registers for saving/restoring the processor state upon interruption is a sequential reusable resource, used by all interrupts of the same class, respectively, program status may be lost when an unordered interrupt occurs.

Interrupt Accuracy is an optional feature for synchronous interrupt flow. Exact interrupts are issued on a predictable instruction. The place where the instruction thread breaks is exactly the instruction that causes the synchronous event. All previous instructions (in program order) are completed before passing control to the interrupt handler. The instruction address is stored automatically by the processor. When the interrupt handler completes execution, it returns to the interrupted program and restarts its execution from the interrupted instruction.

Inaccurate interrupts do not guarantee spawning on a predictable instruction. Any instruction that was not yet executed when the interrupt occurred could be the place where the thread was interrupted. Inaccurate interrupts can be considered asynchronous, because the source instruction of the interrupt doesn't necessarily refer to the interrupted instruction. Inaccurate interrupts are lagging from the interrupted thread. Inaccurate interrupts and their handlers usually collect information about the state of the machine, related to interruption for reporting through the system diagnostic software. An interrupted program usually doesn't restart (cannot be restored).

Table 12.1: Classification of Interrupts
PAL code (asynchronous to the thread or inaccurate, critical) Vector
Asynchronous to instruction thread, recoverable Synchronous to the thread
Inaccurate errors Accurate, recoverable
Unrecoverable Recoverable Unrecoverable Recoverable Errors Traps
RESET, CHECK INIT, PMI, CHECK INT (external interrupts) ? maybe FPU? TLB, Access rights Debug, FPU traps

Since not all combinations of handler code location, synchronicity with the context and/or flow, criticality and accuracy are exist, it is convenient to divide all interrupts into four types: failures (aborts), asynchronous interrupts (interrupts), and synchronous interruptions (interruptions), which are also divided into errors (faults) and traps (traps).

Failures. The processor has detected an internal failure, or a processor reset has been requested. A crash is not synchronous to the context or the instruction flow. A crash can leave the current instruction thread in an unpredictable state with partially modified registers and/or memory. Crashes are PAL-stored interrupts.

Asynchronous Interrupts. An external or independent entity (such as an IO device, its own timer, or another processor) needs attention. Interrupts are asynchronous relative to the instruction flow, but usually synchronous with the context, all previous instructions are completed. Current and subsequent instructions have no effect on the machine condition. Interrupts are divided into initialization interrupts, platform control interrupts, and external interrupts. Initialization and interrupts for platform management are PAL interrupts, external interrupts are vectored interrupts.

Errors and traps. Always synchronous with context and flow. These are vector interrupts.

Machine check interruption is a special case of asynchronous interruption. They are usually caused by some hardware, or by a failure of the memory subsystem, or by trying to access an invalid address. Machine verification can be called indirectly by executing an instruction, if the error caused by the execution of the instruction will not be recognized on time and will turn into a hardware failure. The fact that machine verification interrupts cannot be said to be synchronous or asynchronous, as accurate or inaccurate. They, however, are treated as critical class interrupts.

In the case of machine verification, the following general rules apply: 1. No instruction after the one whose address is communicated to the verification interrupt routine in the iip register has started execution. 2. The instruction whose address is communicated to the machine check interrupt routine in the register iip, and all previous instructions may or may not be completed successfully. All those instructions that are ever going to complete seem to be will do so already, and have done so within the context existing prior to the Machine Interruption of the Verification. No further interruption (other than new machine check interruptions) will occur as a result of those instructions.

§ 12.2. Processor state preservation upon interruption

When an interrupt occurs, the processor saves in special registers part of the context of the interrupted instruction stream. This is necessary for the subsequent correct restoration of the interrupted stream after completion of the interrupt processing. These are the registers: iip is a copy of ip, ipsr is a copy of psr.

The processor provides the interrupt handler with some minimum free registers for intermediate computations, so that the interrupt handler can use these registers for its own purposes. Special registers group ifa, cause, iib stores information about the characteristics of the interrupt necessary to recognize and process the interrupt.

Special Registers Group (iip, iipa, ipsr, ifa, cause, iib) used to quickly save part of the machine state during interruptions, service interrupt, and restore the initial state of the machine when returning from the interrupt. This group exists in two instances to service two level interrupts. priority (criticality) and forms a file of 2 banks with 16 special registers.

These registers store information during interruption and are used by interrupt handlers. These registers can only be read or written while psr.ic=0 (while interrupt processing is in progress), otherwise the error «Illegal Operation fault» occurs. For these registers, their contents are guaranteed to be saved only when psr.ic=0. When psr.ic=1, the processor doesn't save their contents.

Special register interruption instruction pointer (iip) saves a copy of the register upon interruption ip and indicates the place of return from the interrupt. In general, iip contains the address of the instruction bundle, which contains the instruction that caused the error, or the address of the bundle that contains the next instruction to return after processing the trap. The specified and the following instructions are restarted; previous ones are ignored. Outside of the interrupt context, the value of this register is undefined.

Special register interruption instruction previous address (iipa), when interruption occurs, saves the address of the last successfully executed (all slots) instruction bundle.

Register format iip and iipa
6
3
6
2
6
1
6
0
5
9
5
8
5
7
5
6
5
5
5
4
5
3
5
2
5
1
5
0
4
9
4
8
4
7
4
6
4
5
4
4
4
3
4
2
4
1
4
0
3
9
3
8
3
7
3
6
3
5
3
4
3
3
3
2
3
1
3
0
2
9
2
8
2
7
2
6
2
5
2
4
2
3
2
2
2
1
2
0
1
9
1
8
1
7
1
6
1
5
1
4
1
3
1
2
1
1
1
0

9

8

7

6

5

4

3

2

1

0
bundle address 0

Special register interruption processor status register (ipsr) upon interruption, it saves a copy of the register psr (machine status), and has the same format and set of fields as psr. ipsr is used to restore processor state when returning from interrupt with the instruction rfi (return from interruption).

Special register interruption extended register (cause) during non-critical (primary) interruption stores information about the interruption that occurred. The cause register contains data for an exception to differentiate between the different types of exceptions that a single type of interrupt can generate. When one of these interrupts is raised, the bits or bits corresponding to the particular exception that generated the interrupt will be set, and all other bits of the register cause are cleared. Other types of interruption do not affect the contents of the register cause. The register cause must not be cleared by software. The register cause stores information about the nature of the interrupt, and recorded by the processor on all interrupt events, regardless of psr.ic, except for «Data Nested TLB faults». cause stores information about an interrupted instruction and its properties, such as read, write, execute, speculative, or non-access. Several bits can be simultaneously set to cause, for example, an erroneous semaphore operation can expose both cause.r and cause.w. Additional information about the bug or trap is available through cause.code and cause.vector.

Register format cause
6362616059585756555453525150494847464544434241403938373635343332
reserved vector
313029282726252423222120191817161514131211109876543210
code reserved ei d n a r w x
Table 12.2: Register Fields cause
Field Bit Description
r1Read exception. If 1, then the interrupt is associated with reading data.
w1Write exception. If 1, then the interrupt is associated with data recording.
x1Execute exception. If 1, then the interrupt is associated with fetch instructions.
n1Non-access – translation request instructions (dcbf, fetch, mprobe, tpa).
d1Exception Deferral – this bit is set to TLB exception deferral bit (tlb.ed) for a code page containing an erroneous instruction. If translation doesn't exist or translation for the code is prohibited, cause.ed=0. If 1, then the interrupt is delayed.
ei2 Excepting Instruction is the slot number of the bundle on which the interrupt occurred. For errors and external interrupts, cause.ei=iip.sn but doesn't match traps. For traps, cause.ei defines the instruction slot for the trap.
code16interruption Code is the 16-bit code for additional information about the current interrupt.
vector88-bit code for additional information about external interrupt.

Notes: The information in the register cause is not complete. System software may also need to identify the type of instruction which caused the interrupt, examine the TLB input accessed by data or instruction memory access, to fully determine which exception or exceptions caused the interrupt. For example, a data memory interruption can be caused by both security breach exceptions, as well as byte order exclusions. System software would have to look besides cause, type of status psr in ipsr and page protection bits in the TLB input accessed by memory access, to determine if a Defense Violation has also occurred. The bits of the stored register ipsr can be changed when returning from an interrupt via rfi.

Special register interruption faulting address (ifa) upon interruption provides the effective address calculated by the interrupted instruction (virtual, or physical if translation is not used). For loads, stores, atomics, or cache management instructions, which caused an interrupt while accessing memory due to misalignment, a miss in TLB data/instructions or for any other reason, ifa contains an erroneous data address and points to the first byte of an erroneous operand. For other instructions, ifa contains the address of the instruction bundle. For erroneous instruction addresses, ifa stores a 16-byte boundary aligned binding address for the erroneous instruction. ifa is also used to temporarily store the translation virtual address, when the translation input is inserted into the TLB translation table (instructions or data).

Special 128-bit register interruption instruction bundle (iib) upon interruption, if psr.ic=1, saves the current instruction bundle for the failed instruction. The interrupt handler may use iib if needed to disassemble the failed instruction and emulate its execution.

Register format iib
6
3
6
2
6
1
6
0
5
9
5
8
5
7
5
6
5
5
5
4
5
3
5
2
5
1
5
0
4
9
4
8
4
7
4
6
4
5
4
4
4
3
4
2
4
1
4
0
3
9
3
8
3
7
3
6
3
5
3
4
3
3
3
2
3
1
3
0
2
9
2
8
2
7
2
6
2
5
2
4
2
3
2
2
2
1
2
0
1
9
1
8
1
7
1
6
1
5
1
4
1
3
1
2
1
1
1
0

9

8

7

6

5

4

3

2

1

0
slot2 slot1 tp
slot3 slot2

§ 12.3. Exception Priority

There are two types of exceptions: those caused directly by the execution of an instruction (synchronous to the instruction stream) or caused by asynchronous events. In both cases, an exception can cause one of several types of interrupts.

The architecture requires that all synchronous interrupts be processed programmatically according to a sequential execution model. The exception to this rule is in the case of multiple synchronous interrupts from a single instruction.

For any instruction trying to raise several exceptions for which the corresponding synchronous interrupt types are allowed, a priority order is defined in which the instruction is allowed to generate single interrupts. This exception priority mechanism, apart from the requirement that synchronous interrupts be generated programmatically, also ensures that at any given time only one of the synchronous interrupt types exists for consideration. The exception priority mechanism also prevents some debug exceptions exceptions that occur in combination with other synchronously generated interrupts.

This section doesn't define the allowed installation of multiple exceptions for which the corresponding interrupt types are blocked. Throwing exceptions for which the corresponding interrupt types are blocked has no effect on throwing other exceptions, for which appropriate interrupt types are allowed. Conversely, if a specific exception for which the appropriate type of interrupt is enabled is shown in the following sections, has a higher priority than another exception, this will prevent the installation of this other exception, regardless of the corresponding type of interruption of another exception is allowed or blocked.

The priority of exception types is listed below from highest to lowest. Some types of exceptions can be mutually exclusive and can be considered as exceptions of the same priority. In these cases, the exceptions are listed according to the sequential execution model.

Table 12.3: Priority exception types
Type No. Exception Description
Aborts 1Machine reset abort (RESET)Reboot
2Machine check abort (CHECK)Processor check
External Interrupts 3Initialization interrupt (INIT)Warm restart
4Platform management interrupt (PMI)Platform interrupt (chipset, board)
5External interrupt (INT)External devices, timer, other processors
Runtime errors for the asynchronous register stack (spill-fill faults) 7RS Data debug faultAddress and memory access match with one of the debug registers
8RS Unimplemented data address faultThe presence of non-zero bits in the unimplemented bits of the address
10RS Data TLB Alternate faultMiss in TLB data (without HPT)
11RS Data HPT faultHPT error
12RS Data TLB faultMissing TLB data (after HPT)
13RS Data page not present faultData page is not in physical memory
16RS Data access rights faultAccessing a virtual memory page in an unauthorized way, for example, reading from a page for which reading is prohibited
17RS Data access bit faultAccess to the virtual memory page (first entry)
18RS Unsupported data reference faultData access is not supported by memory attributes
Fetch faults phase errors 21Instruction TLB Alternate faultMiss in TLB instructions (without HPT)
22Instruction HPT faultHPT error
23Instruction TLB faultMissing TLB instructions (after HPT)
24Instruction Page Not Present faultThe instruction page is not in physical memory
25Instruction Access rights faultSelection of instructions from the virtual memory page for which execution is not allowed
26Instruction Access Bit faultFetching instructions from the virtual memory page (first fetch)
Decode faults errors 27Illegal operation faultReserved instruction
28Privileged operation faultPrivileged instruction
29Undefined operation faultInvalid instruction form
30Disabled floating-point faultForbidden FP instruction
31Unimplemented operation faultUnimplemented standard instruction (emulation required)
32Unsupported operation faultUnimplemented dedicated instruction (emulation required)
execute faults 33Reserved register/field faultInvalid instruction field value (in particular register number)
34Out-of-frame rotated registerAccess to the rotated register outside the local frame
35Privileged register faultAttempt of an unprivileged program to perform a privileged operation with a privileged register
36Invalid register field faultAttempt to write an invalid value to registers, TLB
37Virtualization faultAttempted to execute a special instruction in processor virtualization mode
38Integer overflow faultInteger overflow
39Integer divide by zero faultInteger division by zero
40floating-point faultFloating-point error
execute faults memory access 42Data debug faultAddress and memory access match with one of the debug registers
43Unimplemented data address faultThe presence of non-zero bits in the unimplemented bits of the address
44Data TLB Alternate faultMissing TLB data (without HPT)
45Data HPT faultHPT error
46Data TLB faultMissing data TLB (after HPT)
47Data page not present faultData page not in physical memory
48Data access rights faultAccessing the virtual memory page in an unauthorized way, such as reading from a page for which reading is prohibited
49Data access bit faultAccess to the virtual memory page (first entry)
50Unaligned data reference faultAccessing data at an unaligned address
51Unsupported data reference faultData access is not supported by memory attributes
Traps (traps) 53Lower-Privilege Transfer trapDebugger, privilege level change
54Taken branch trapDebugger, taken branch
55Instruction Debug trapDebugger, attempt to jump to an address that corresponds to one of the address ranges in debug registers
56System call trapDebugger, intercept system call
57Single step trapDebugger, trap after each instruction
58Unimplemented Instruction address trapUnimplemented address of the next instruction bundle
59floating-point trapFloating-point instruction requires intervention
60software trapSoftware trap (trap) instruction

If an instruction raises multiple debug exceptions and doesn't raise any other exceptions, then it is permissible to generate a single debug interrupt (highest priority).

§ 12.4. Interrupt handling

The start addresses for interrupt handler code can be fixed in the architecture (old ARM, MIPS). But it is desirable to provide the ability to switch the entry point (for example, for updating), and also, possibly, for assigning different handlers to different processors in a multiprocessor system, since in a multiprocessor system simultaneous processing of several interrupts by different processors can occur, and no processor can use shared memory blocks for the needs of its interrupt handler. Special register interruption vector address (iva) determines the position of the system table of interrupt handlers in the virtual address space (or the physical address space if translation disabled). The vector table is 64 KiB in size and needs to be aligned on the 64 KiB border, so the lower 16 bits of the register must be zeros.

Register format iva
6
3
6
2
6
1
6
0
5
9
5
8
5
7
5
6
5
5
5
4
5
3
5
2
5
1
5
0
4
9
4
8
4
7
4
6
4
5
4
4
4
3
4
2
4
1
4
0
3
9
3
8
3
7
3
6
3
5
3
4
3
3
3
2
3
1
3
0
2
9
2
8
2
7
2
6
2
5
2
4
2
3
2
2
2
1
2
0
1
9
1
8
1
7
1
6
1
5
1
4
1
3
1
2
1
1
1
0

9

8

7

6

5

4

3

2

1

0
iva 0

For each of the 64 types of interrupts in the table, 1024 bytes of code are allocated (64 bundles or 192 short instructions). The address of the interrupt handler is obtained by combining the register iva and the interrupt vector number inum. If some vector is not used, then the place for its code in the table can use the vector preceding by the number. If, nevertheless, the interrupt handler doesn't fit in the table, a transition outside the vector table should be implemented.

Address of the interrupt handler
6
3
6
2
6
1
6
0
5
9
5
8
5
7
5
6
5
5
5
4
5
3
5
2
5
1
5
0
4
9
4
8
4
7
4
6
4
5
4
4
4
3
4
2
4
1
4
0
3
9
3
8
3
7
3
6
3
5
3
4
3
3
3
2
3
1
3
0
2
9
2
8
2
7
2
6
2
5
2
4
2
3
2
2
2
1
2
0
1
9
1
8
1
7
1
6
1
5
1
4
1
3
1
2
1
1
1
0

9

8

7

6

5

4

3

2

1

0
IVA base inum 0

Interrupt handling is implemented as a quick context switch (much simpler than completely changing the context of the process). When an interrupt occurs, the hardware does the following:

  1. If psr.ic=1, the register psr is stored in ipsr.
  2. If psr.ic=1, the register ip is stored in iip, and the address is saved in the register iipa the last fully executed instruction bundle (previous ip).
  3. If psr.ic=1, the interrupted instruction (or its first slot for the long instruction) is stored in iir.
  4. If psr.ic=1 and there is an effective address associated with the interrupted instruction (load/store instructions, atomic memory operations, transitions), this address is copied to ifa.
  5. In cause.ei, the slot number that caused the interrupt is stored.
  6. Other additional information about the interrupted instruction is stored in cause.
  7. In psr.ri, a mask of instructions is saved to continue working after interruption from the middle of the link.
  8. The psr.ic bit is cleared (the ban on saving the state for subsequent critical interrupts is introduced).
  9. The psr.i bit is cleared (prohibition of other interrupts).
  10. The current privilege level psr.cpl changes to the kernel level (zero).
  11. Execution continues from the address: iva + (1024 × interruption_number).

interruption_number is a unique integer value assigned to each interrupt. Vectorization is done by going to the interrupt vector table indexed by this integer. The interrupt vector table contains 1024 bytes (64 instruction bundles) for each interrupt processing routine. The value in the register iva should be aligned on the page border of 64 kilobytes.

Notes: The task of interrupt handlers is to resolve (unmask) external interrupts (by setting the psr.i bit to 1) as soon as possible, to minimize the worst latency for external interrupts.

At the end of the interrupt routine, rfi (return from interruption) is executed which restores the state register of the machine psr from ipsr, and normal instruction execution resumes from the address contained in iip.

Chapter 13. External interrupts

The architecture defines a mechanism for delivering external interrupts to the processor from other devices, external interrupt controllers, other processors; interrupt handling mechanism; a mechanism for sending interrupts to other processors. All this is handled by the processor's embedded interrupt controller.

Traditionally, interrupts are delivered to the processor via a separate serial bus, unlike ordinary data that is delivered via the system bus. This creates a sequencing problem that is traditionally solved by software or complex bus matching logic. If the data writing is followed by an interrupt, it is possible that the interrupt will reach the processor before the data writing takes effect, which will cause the processor to see outdated data. If you use only the system bus to deliver interrupts along with normal data, the ordering problem disappears.

The POSTRISC architecture replaces the traditional serial interrupt bus with a system bus interrupt delivery implementation. Therefore, interrupt transfer capabilities are scaled along with the system bus speed. External IO interrupts are delivered directly via the IO bus, which also speeds up delivery to the system bus.

Unlike PCI, where the device sends everyone a common interrupt signal, Now the device can send a unique vector by writing it to a specific address. The OS can configure for each device in the system the address of the receiver of its interruptions (possibly one per device) and select up to 32 different vectors per device.

The architecture introduces batch interrupt handling to minimize the number of context switches, unlike the previous approach, when each interrupt is processed in its context. This will allow the interrupt handler to handle all pending interrupts without changing the processor priority level. This reduces the number of context switches and the number of processor switches, which will improve performance.

The architecture rejects interrupts based on individual contacts - in favor of interrupts in the form of special signals of the system-wide bus. To add more interrupt sources using the contact mechanism, you need more contacts, and for the signal mechanism there are no restrictions on the number interrupt sources on a shared bus.

External interrupts are not related to the execution of the instruction thread (asynchronous to the thread). The processor is responsible for the sequence and masking (prohibition) of interruptions, sending and receiving interprocessor interrupt messages, receiving interrupt messages from external interrupt controllers, and managing local interrupt sources (from itself). External interrupts are generated by four sources in the system:

External Interrupt Controllers. Interrupt messages from any external source can be sent any processor from the External Programmable Interrupt Controller (EXTPIC), which collects interrupts from several simple devices, or from an IO device capable of sending interrupt messages directly (with a built-in controller). The interrupt message informs the processor that an interrupt request has been made and specifies the unique vector number of the external interrupt. A request for interruption from a simple device is issued if a fact of a steady signal level was detected or when the signal level was different. The processors and controllers of external interrupts communicate via the system bus according to the interrupt message protocol defined by the bus architecture.

Locally attached to the processor devices. Interrupts from these devices are generated by the processor contacts for direct interrupts (LINT, INIT, PMI) and are always directed to the local processor. LINT pins can be connected directly to the local external interrupt controller. LINT contacts are programmable either differential-sensitive or level-sensitive, and for the type of interrupt that is generated. If they are programmed to generate external interrupts, then each LINT pin has its own vector number. Only LINT pins connected to the processor can directly generate level-sensitive interrupts. LINT pins cannot be programmed to generate level-sensitive PMI or INIT interrupts. The INIT and PMI pins generate their corresponding interrupts. An interrupt is generated for the PMI contact with PMI vector number 0.

Internal processor interrupts. These are, for example, interruptions from the processor timer, from the performance monitor, or interruptions due to machine checks. These interrupts are always routed to the local processor. A unique vector number can be programmed for each interrupt source.

Other processors. Each processor can interrupt any other processor, including itself, by sending an interprocess message about the interruption to a specific target processor. The destination of the interrupt message (one of the processors in the system) is determined by the unique identifier of the processor in the system.

§ 13.1. Programmable external interrupt controllers

An external interrupt controller (EXTPIC) provides incoming interrupt signal lines, by which devices inject interrupts into the system in the form of a steady-state signal level (level) or signal level difference (edge).

EXTPIC contains a Redirection Table (RT) with entries for each incoming interrupt line. Each entry in RT can be individually programmed to recognize interruptions on the line (edge or level), which vector (and therefore which priority) has the interrupt, and which of all possible processors should serve the interrupt. RT content is controlled by software (mapped to physical addresses and writable by processors) and receives default values when reset. The table information is used to send messages to the local interrupt controller of the target processor via the system bus.

EXTPIC functionality can be integrated directly into the end device, but any component of the system that is capable of sending interrupt messages on the IO bus, It must behave like EXTPIC and must have EXTPIC functionality.

Table 13.1: EXTPIC controller registers
Name Address Description
EXTPIC Version registerBase + 0x00
IO eoi registerBase + 0x08
Redirection Table Entry XBase + 0x10 and then 8
EXTPIC block format
313029282726252423222120191817161514131211109876543210
0 Selected register
Window register
0 max RT num 0 version
eoi
Redirection Table Entry format
313029282726252423222120191817161514131211109876543210
0 pid
0 p s m t dm 0 vector

Delivery Mode (DM): delivery method, Delivery Status (S): 0 (Idle) or 1 (Pending), Interrupt Input Pin Polarity (P): 0 (High) or 1 (Low), Trigger Mode (T): edge (0) or level (1), Mask (M): mask interrupt, processor ID (PID): processor ID.

§ 13.2. Built-in interrupt controller

From the point of view of other processors and IO devices, the processor itself is a device with a built-in programmable external interrupt controller. The only difference is that the processor itself programs its built-in interrupt controller, and is not programmed by other processors.

The local interrupt controller determines whether the processor should accept interrupts sent via the system bus, provides local registers for pending interrupts, nesting and masking interrupts, manages interactions with its local processor, provides the ability to interprocess messages to its local processor.

In older architectures, this programming was even carried out similarly to external controllers, via memory-mapped registers. This required each processor to allocate its own address range to display its interrupt controller, made it possible to make strange errors with access to the controller of a «alien» processor.

Later, registers of the embedded controller prefer to implement as special registers inside the processor, without mapping to the address space. This removes the need to map controller registers to physical addresses, and solves the access problem. For example, the new architecture of the integrated Intel X2APIC interrupt controller is implemented, replacing XAPIC (the full chronology is: PIC - APIC - XAPIC - X2APIC), or IA64 SAPIC (streamlined integrated interrupt controller).

The POSTRISC architecture naturally follows the newer approach. The processor software manages external interrupts by changing special processor registers, controlling the built-in external interrupt controller. These registers are summarized in the table below, and are used to prioritize and deliver external interrupts, and for assigning external interrupt vectors to interrupt sources inside the processor such as a timer, performance monitor, and processor validation.

Table 13.2: External interrupt control registers
Name Description
lidLocal Identification register
tprTask Priority register
irr0irr3Interrupt Request registers (read only)
isr0isr3Interrupt Service registers
itcvinterval time counter vector
tsvtermal sensor vector
pmvperformance monitor vector
cmcvcorrected machine-check vector

Special task priority register (tpr) controls the forced masking (prohibition) of external interrupts depending on their priority. All external interrupt vectors with a number greater than mip (mask interrupt priority) are masked.

Register format tpr
6362616059585756555453525150494847464544434241403938373635343332
reserved
313029282726252423222120191817161514131211109876543210
reserved mip

§ 13.3. Handling external interrupts

To minimize the cost of handling external interrupts, you need to reduce the total number of context switches (processor interrupts and return from interrupts). It is desirable to be able to batch process interrupts, that is, once interrupting the processor, process all interrupts awaiting processing. To do this, the mechanism for determining which external interrupts are pending should be separated from the processor interrupt mechanism.

Special register group interrupt request registers (irr0-irr3) store a 256-bit vector of external interrupts awaiting processing (by the number of possible numbers of interrupt vectors from 0 to 255). The bit set to 1 in irr means that the processor has received an external interrupt. Registers are read-only, write is prohibited (invalid operation). Vector numbers 1-15 are reserved for internal and local interrupts. The zero bit of the register irr0 is always zero. It is a special «spurious» or empty interrupt vector. Reading from the register iv clears the bit corresponding to the highest priority interrupt and returns vector index (or spurious vector if there is no received interrupts).

Register format irr
6
3
6
2
6
1
6
0
5
9
5
8
5
7
5
6
5
5
5
4
5
3
5
2
5
1
5
0
4
9
4
8
4
7
4
6
4
5
4
4
4
3
4
2
4
1
4
0
3
9
3
8
3
7
3
6
3
5
3
4
3
3
3
2
3
1
3
0
2
9
2
8
2
7
2
6
2
5
2
4
2
3
2
2
2
1
2
0
1
9
1
8
1
7
1
6
1
5
1
4
1
3
1
2
1
1
1
0

9

8

7

6

5

4

3

2

1

0
bits 63-16rv0
irr1
irr2
irr3

Privileged register interrupt vector (iv) returns the highest priority unmasked number (vector) among external interrupts awaiting processing. If there is an external interrupt, the processor crosses out the interrupt vector from the waiting category and transfers it to the processed category. All vectors of the same and lower priority are masked until the processor finishes processing this interrupt. If there are no pending external interrupts or all external interrupts are masked, then iv returns the special value 0 (special vector spurious interrupt).

The end indicator is an entry in iv (end of interrupt). This is a signal that the software has finished servicing the last high priority interrupt, whose vector was read by reading from iv. The processor removes this interrupt vector from the category of serviced, and removes the masking of interrupts with a lower or equal priority.

§ 13.4. Handling local interrupts

The processor itself may generate interrupts, asynchronous to the current instruction thread, and not related to external devices, for example, in the case of a time slice end (itc match itm), itc overflow, performance monitor counter overflow, the processor overheat, the processor internal error, etc.

In this case, it is convenient to conditionally present these interrupts as external, and serve them according to the same principles. To do this, you need to map your dedicated external interrupt vector to the asynchronous interrupt from the processor. Accordingly, some interrupt vectors are mapped to specific types of asynchronous intraprocessor interrupts. They cannot be used to program external devices.

The interval time counter vector is associated with the processor interval timer counter (itc) match or overflow. The performance monitoring vector is associated with interrupts from the performance monitor. The corrected machine check vector is associated with interrupts due to the need to correct machine errors. The termal sensor vector is associated with interrupts due to processor overheating.

For these types of interrupts in the corresponding register, you can set the number of the interrupt vector or mask them (field m).

Register format itcv, tsv, pmv, cmcv
313029282726252423222120191817161514131211109876543210
reserved m rv vector

§ 13.5. Processor identification and interprocessor messages

The special local identification register (lid) contains the processor core identifier. It serves as the physical name of the processor for all interrupt messages (external interrupts, INIT interrupts, PMI platform interrupts). The contents of the register lid is set by the platform during boot/initialization and based on the physical location of this processor in the system. This value is implementation-dependent and should not be changed by software (available read-only). When receiving interrupt messages on the system bus, processors compare their lid with the destination address of the interrupt message. In case of a match, the processor accepts the interrupt and stores it in its queue of waiting interrupts.

Register format lid
313029282726252423222120191817161514131211109876543210
reserved lid

Each processor can interrupt any other processor, including itself, by sending an inter-processor interrupt message (IPI). Different architectures have different approaches to the organization of interprocessor interrupts and their delivery.

For example, in the X86 architecture, each processor implements a special interrupt instruction register (icr), and the processor generates IPI by writing to this special register of its own. The message delivery method is not determined in this case, as well as the bus used for this (a separate narrow dedicated interrupt bus can be used). The pid field in the register defines the target processor to interrupt. The remaining fields are interrupt parameters (interrupt vector number and delivery mode). Hint is an instruction for the external system to deliver the interrupt exactly to the address (Hint=0), or you can make a load balance and deliver to the choice of the system (Hint=1) to another (unoccupied) addressee.

This method, despite the simplicity and universality of the implementation (the method for delivering interrupts is not defined by the architecture), also has some problems. Usually, sending interrupts is preceded by data modification operations that are performed on the shared bus, and there may be situations when an interrupt sent on the interrupt bus can overtake a data change on the shared bus. This requires the implementation of complex hardware-software synchronization schemes.

Register format icr
6362616059585756555453525150494847464544434241403938373635343332
target processor id
313029282726252423222120191817161514131211109876543210
reserved h dm rv vector

Another approach is that each processor behaves like any other IO device mapped to a physical address space. A processor generates IPI for another processor by writing to a specific, architecture-specific area of physical addresses. This removes the problem of synchronizing data and interrupts, since they are sent on the same common bus, and doesn't require the implementation of a separate bus for interrupts. At the same time, however, the load on the common bus increases, but insignificantly, since the interrupt signals make up a small percentage of the total traffic of the common bus. By this principle, IPI is implemented in Intel Itanium and IBM Power architectures.

For example, in IA64 architecture, the range of physical addresses is 1 MiB in size from the area of displayed devices allocated to display processors (16 bytes per processor) and transmit interrupt signals. The base address of this range is aligned with the natural border and is fixed architecturally at 0xFEE00000. Any address of the form 0xFEENNNN0 is recognized as an interrupt signal for the processor 0xNNNN. Writing an 8-byte aligned number to an address in this range will send an interrupt to the appropriate processor. Other types of writes are not supported, as well as reading. The PID address field identifies the target processor to interrupt. Hint (h) field of the address is a command for the external system to deliver the interrupt exactly to the address (Hint=0), or you can make a load balance and deliver to the choice of the system (Hint=1) to another (unoccupied) destination. The remaining fields (in the recorded number) are interrupt parameters (interrupt vector number and delivery mode).

Physical address to send the message
6
3
6
2
6
1
6
0
5
9
5
8
5
7
5
6
5
5
5
4
5
3
5
2
5
1
5
0
4
9
4
8
4
7
4
6
4
5
4
4
4
3
4
2
4
1
4
0
3
9
3
8
3
7
3
6
3
5
3
4
3
3
3
2
3
1
3
0
2
9
2
8
2
7
2
6
2
5
2
4
2
3
2
2
2
1
2
0
1
9
1
8
1
7
1
6
1
5
1
4
1
3
1
2
1
1
1
0

9

8

7

6

5

4

3

2

1

0
0xFEE pid h 0
Record format for interrupt message
6
3
6
2
6
1
6
0
5
9
5
8
5
7
5
6
5
5
5
4
5
3
5
2
5
1
5
0
4
9
4
8
4
7
4
6
4
5
4
4
4
3
4
2
4
1
4
0
3
9
3
8
3
7
3
6
3
5
3
4
3
3
3
2
3
1
3
0
2
9
2
8
2
7
2
6
2
5
2
4
2
3
2
2
2
1
2
0
1
9
1
8
1
7
1
6
1
5
1
4
1
3
1
2
1
1
1
0

9

8

7

6

5

4

3

2

1

0
reserved dm vector

The architecture POSTRISC uses the generalization of second method: the interrupt message is sent over the common system bus as a write to the dedicated physical address. Each processor core is mapped as a device to physical memory via the standard PCE-Express config space 4KiB map size. Every processor core may be found in PCI Express config space for the corresponding chipset/socket. The first private byte of the range is used to deliver interrupts, the rest bytes are for remote processor tuning, debugging, monitoring or are reserved. The physical addresses 0xPPPPP0000000-0xPPPPPFFFFFFF are reserved for mapping existing processors (up to 65536 cores per PCIE ECAM). In the current emulator implementation, for simplicity, they are mapped to similar kernel virtual addresses 0xFFFFFFFFE0000000-0xFFFFFFFFEFFFFFFFF.

Physical address for memory-mapped cores
6
3
6
2
6
1
6
0
5
9
5
8
5
7
5
6
5
5
5
4
5
3
5
2
5
1
5
0
4
9
4
8
4
7
4
6
4
5
4
4
4
3
4
2
4
1
4
0
3
9
3
8
3
7
3
6
3
5
3
4
3
3
3
2
3
1
3
0
2
9
2
8
2
7
2
6
2
5
2
4
2
3
2
2
2
1
2
0
1
9
1
8
1
7
1
6
1
5
1
4
1
3
1
2
1
1
1
0

9

8

7

6

5

4

3

2

1

0
reserved ECAM base Bus-Device-Function offset

By writing 8-byte value with only 8 nonzero bit on address 0xFFFFFFFFENNNN000 we send interrupt to processor core with NNNN device id. The writing to any other address like 0xFFFFFFFFENNNNXXX or the loading from any address inside block leads to platform management interrupt for sender core.

Table 13.3: The processor core memory-mapped physical address block layout
address bytes
0 1 2 3 4 5 6 7
0xFENNNN000 vector reserved (0)
0xFENNNN008 timecmp (test stuff)
0xFENNNN010 reserved
0xFENNNN018 reserved
... ...
0xFENNNNFF8 reserved

Chapter 14. Debugging and monitoring

The POSTRISC architecture provides debugging tools that enable hardware and software debugging features, such as step-by-step program execution, instruction breakpoints, and data breakpoints.

Debugging tools consist of a special debugging control register dbscr (debug status and control register), a set of debugging events as a subset of interrupts, special registers for comparing instruction addresses ibr (instruction breakpoint register), special registers for comparing data addresses dbr (data breakpoint register).

Debug registers are available for program execution, but they are intended for use only by special debuggers and debugging software, not general software or operating system code.

Monitoring tools include the following resources: special registers and/or bit fields controlling the monitoring, implemented types of counted events, a fixed number of event counters, additional type of interrupts for processing the monitoring counter overflow events.

§ 14.1. Debug Events

Debugging tools are based on a special group of debug interrupts built into the general interrupt mechanism. Debug type interrupts can be thrown for various reasons that can be analyzed in the handler of this interrupt. There are seven types of predefined debugging events:

Table 14.1: Debug events (in priority order)
Name Event Type
IB «Instruction address match» debug event occurs on instruction address match. If the address of a instruction bundle matches one of the criteria, specified in the debug registers ibr, a instruction debugging event is raised (if the instruction is not canceled). One or more debug events ibr occur, if the execution of instructions at the address which matches the criteria specified in the registers ibr.
DB «Data address match» debug event occurs on data address match. If the address for accessing data in memory meets one of the criteria, specified in the debug registers dbr. Data Debug errors are only reported if the qualification predicate is true. The reported trap code returns the matching state of the first 4 dbr registers that matched during the execution of the instruction. Zero, one or more dbr registers can be reported as matching.
TRSoftware Trap
TB «Taken branch» trap occurs on each taken branch instruction received if psr.tb=1. This trap is useful for profiling a program. After the trap, iip and ipsr.ri point to the branch destination instruction, and iipa and cause.ei to the branch instruction that caused the trap. The case of debugging «taken branch» (TB) occurs if psr.tb=1 (that is, debug events «taken branch» are allowed), the branch instruction is executed (i.e., either an unconditional branch, or a conditional branch in which the branch condition is satisfied), and psr.de=1 or dbcr0.idm=0.
SS «Single step» trap occurs on each successfully finished instruction if dbsc.rss=1 (step-by-step debugging events are allowed). After the trap, iip and ipsr.ri point to the next instruction to be executed. iipa and cause.ei point to the caught instruction.
lpAn interrupt has occurred. The debug event «an interrupt occurred» (IRPT) occurs, if dbcr.irpt=1 (that is, a debug event of the interrupt occurred is allowed) and any non-critical interruption occurs while dbcr.idm=1, or any critical or non-critical interrupt occurs while dbcr.idm=0. Abort Accepted Debug Events, may occur regardless of the installation of psr.de.
IRReturns from Interrupt

Debug events include instruction and data breakpoints. These debug events set the status bits in the DBSR debug status register. The existence of a set bit in the DBSCR register is considered as a debug exception. Debug exceptions, if allowed, cause debugging interruptions. The debug status and control register (DBSCR) is used to set the allowed debug events, manage timer operations during debugging events, and set processor debugging mode. It contains the status of debug events.

The group of bits DBE (debug enabled events) of the DBSCR register is set in supervisor mode and cannot be changed by the program. Bit groups DBTE (debug taken enabled event) and DBT (debug taken event) of the DBSCR register installed by hardware, they can be read and cleaned programmatically. The contents of the DBSCR register can be read into the general register by the instruction mfspr.

Debug events are used to force debug exceptions. be registered in the DBSCR debug status register.

To enable the debug event, you need to set the corresponding bit from the DBE group of the DBSCR register and thus raise a debug exception a certain type of event must be allowed by the corresponding bit or bits in the dbcr debug control registers. Once the DBSCR register bit is set and, if debug interrupts are enabled (a bit from the DBE group is 1), a debug interrupt will be generated.

The bit in the special DBSCR debug control register must be set to 1 to allow debugging interrupt corresponding to this bit. Debugging events are not allowed to occur when the corresponding bit in the DBSCR register is 0. In such situations, no debug exception of this type occurs. and no bits of this type of DBSCR register are set.

If the corresponding bit in the register dbscr is 1 (that is, debugging interrupts of this type are allowed) during this debug exception, interruption of debugging will occur immediately (if there is no exception with a higher priority, which is allowed to cause interrupts), the execution of the instruction, causing the exception will be suppressed, and CSRR0 will be set to the address of this instruction.

If debug interrupts of this type are blocked during a debug exception, interruption of debugging will not occur, and the instruction will complete execution (provided, the instruction doesn't cause some other exception, which generates an allowed interrupt).

Notes: If an instruction is suppressed due to an instruction, that raised some other exception that allows the generation of an interrupt, then the attempted implementation of that instruction does no Cause the instruction Complete debugging case. The trap instruction doesn't fall into the category of instructions whose execution is suppressed, starting with instructions, it actually completes execution and then generates an interruption to the system call. In this case, the finished debug exclusion command will also be installed.

A trap debugging event (trap) occurs if dbscr.trap=1 (that is, Trap debugging events are allowed) and the Trap instruction is unconditional or the conditions for the trap are met.

Interrupt instruction error – execution of trap instruction results in the Interrupt instruction error. An interrupt can be used to profile, debug, and enter the operating system. (although the instruction to enter the privileged code (syscall) is recommended, since it has lower costs).

If dbcr.trap=0 (that is, trap-type debugging interrupts are blocked) during the exception of debugging a trap, interruption of debugging will not occur, and the type of exception for the Trap. A Program Interruption will occur instead if the trap condition is met.

Trap «Decrease privilege level». when psr.lp=1, and the transition that occurs lowers the privilege level (psr.cpl becomes 1), this trap occurs. This trap allows the debugger to keep track of privilege drops, for example, to remove permissions granted to higher privileged code. After the trap iip and ipsr.ri point to the effective address of the branch, and iipa and cause.ei to the branch instruction that caused the trap.

When dbcr.idm=1, only non-critical interrupts can trigger debugging events of the interrupt that occurred. This is because all critical interrupts automatically clear psr.de, which would always prevent the associated debugging interrupt from appearing accurately. Also, debug interrupts directly are – critical class interrupts, and thus any debug interrupt (for any other debugging case) would always end up during installation an additional exception to dbsr.irpt after entering the debug interrupt handler. At this point, the debug interrupt routine is unable to determine Is the interruption a valid debugging event? It was related to the initial debugging event.

When dbcr.idm=0, then critical and non-critical class interruptions can cause the Abort Accepted debugging event. In this case, the assumption is that debugging events are not used to cause interruptions. (software can vote DBSR instead) and therefore it's proper to record an exception in DBSR even though that the critical interrupt that causes the interrupt is an accepted debug event, will clear psr.de.

Debug event «Interception return from interrupt» (a call to the ret instruction) occurs if dbcr.ret=1 (i.e. debugging events are allowed when returning from an interrupt) and an attempt was made to execute the rfi instruction. When a debug event occurs on return, dbsr.ret is set to 1 to record a debug exception.

§ 14.2. Debug registers

Debug registers are designed to organize the interception of program calls to specific address ranges for specific purposes (e.g. execution or writing), and allow the debugger to verify the correctness of the program. Their number depends on the implementation. Read/write ability depends on the priority level, processor model. They are used in pairs, with at least 4 pairs for instructions and 4 for data.

The 128-bit instruction breakpoint registers ibr are for debug comparing instruction addresses. A debugging event can be allowed to occur after trying to execute a instruction at an address range specified by a ibr register. Since all instruction addresses must be aligned on the border of the bundle, the four least significant bits of the ibr register are reserved and do not participate in comparison with the address of the instruction bundle.

6
3
6
2
6
1
6
0
5
9
5
8
5
7
5
6
5
5
5
4
5
3
5
2
5
1
5
0
4
9
4
8
4
7
4
6
4
5
4
4
4
3
4
2
4
1
4
0
3
9
3
8
3
7
3
6
3
5
3
4
3
3
3
2
3
1
3
0
2
9
2
8
2
7
2
6
2
5
2
4
2
3
2
2
2
1
2
0
1
9
1
8
1
7
1
6
1
5
1
4
1
3
1
2
1
1
1
0

9

8

7

6

5

4

3

2

1

0
address
x 0 plm mask 0

The 128-bit data breakpoint registers dbr are for debug comparing data addresses. A debugging event can be allowed to occur after loads, stores, or atomic instructions to an address range specified by a dbr register.

6
3
6
2
6
1
6
0
5
9
5
8
5
7
5
6
5
5
5
4
5
3
5
2
5
1
5
0
4
9
4
8
4
7
4
6
4
5
4
4
4
3
4
2
4
1
4
0
3
9
3
8
3
7
3
6
3
5
3
4
3
3
3
2
3
1
3
0
2
9
2
8
2
7
2
6
2
5
2
4
2
3
2
2
2
1
2
0
1
9
1
8
1
7
1
6
1
5
1
4
1
3
1
2
1
1
1
0

9

8

7

6

5

4

3

2

1

0
address
r w 0 plm mask

The contents of the register dbr are compared with the address computed by the memory access instruction. Data debugging event occurs, if enabled, attempted execution of the data memory access instruction, and type, address, and maybe even the meaning of the data access memory matches the criteria specified in the dbr.

All load instructions are treated as reads regarding debugging events, while all store instructions are treated as a write regarding debugging events. Additionally, cache management instructions, and some special cases handled as follows.

The cmp bits determine whether all or some of the bits of the instruction address should match the contents of debug register, whether the address should be inside or beyond a specific range specified by a ibr register for a debugging event to occur.

There are four modes for comparing instruction addresses.

High register part contains breakpoint addresses, low part contains offset or breakpoint mask. At least 4 data and instruction registers are implemented on all processor models. The first registers after zero by number are implemented.

The instruction and data memory addresses provided for compliance are always in the implemented address space. Programming an unimplemented physical address in ibr/dbr ensures that the physical addresses provided by ibr/dbr will never match. Similarly, programming unimplemented virtual addressing in ibr/dbr ensures that the virtual addresses submitted by ibr/dbr will never match.

Table 14.2: Debug breakpoint register fields (dbr/ibr)
Field Description
Address
63:0
Matching address– 64-bit virtual or physical breakpoint address. The address is interpreted as virtual or physical depending on psr.dt and psr.it. The trap «Instruction data breakpoint address» occurs when load, store, semaphore instruction. For fetching instructions, the lower four bits of ibr.addr{3:0} are ignored when comparing addresses. All 64 bits are implemented on all processors, regardless of the number of address bits implemented.
mask
55:0
mask for the address determines which address bits in the corresponding address register will be compared when determining the conformity of the control point. Address bits for which mask bits are 1 must match the address of the breakpoint, otherwise, the address bit is ignored. Address bits {63:56} for which there are no corresponding mask bits, always compared. All 56 bits are implemented on all processors, regardless of the number implemented bits of the address.
plm
59:56
Mask for all privilege levels – Allows data breakpoints that match the specified privilege level. Each bit corresponds to one of 4 privilege levels. Bit 56 corresponds to privilege level 0, bit 57 to level 1, etc. A value of 1 indicates that debugging comparisons are allowed at this privilege level.
w
62
Write - When dbr.w=1, any not canceled store, semaphore, probe.w.fault or probe.rw.fault to the address, causes the breakpoint to the corresponding address register.
r
63
Read - When dbr.r=1, any unannounced load, semaphore, lfetch.fault, probe.r.fault or probe.rw.fault at the address corresponding to the address register causes a breakpoint. When dbr.r=1, PT access that matches dbr (except those for the tak instruction) will cause an error «Missing in Instruction/Data TLB». If dbr.r=0 and dbr.w=0, the data breakpoint register is locked.
x
63
Execution - When ibr.x=1, executing instructions at the address corresponding to the address register causes a breakpoint. If ibr.x=0, then the instruction breakpoint register is locked. Control points for instructions will be reported, even the instruction is canceled.
ig
62:60
Ignored

The registers dbr/ibr can only be accessed at the highest privilege level 0, otherwise, the «privileged operation» error occurs.

Debug register changes are not necessarily observed with the following instructions. The software must use the data serialization to ensure that modifications to dbr, psr.db, psr.tb and psr.lp observed before the dependent instruction is executed. Because changing the registers ibr and the flag psr.db, affect the subsequent instruction fetching, the software must execute the instruction serialization.

In some implementations, a hardware debugger may use two or more registers for its own use. When a hardware debugger is applied, only 2 dbr and only 2 ibr are available for program use. The software should be able to run with fewer implemented ibrand/or dbr registers if a hardware debugger is present. When a hardware debugger is not implemented, at least 4 ibr and 4 dbr are available for programmatic use.

Implemented debug registers used by the attached hardware debugger, arranged by number first (for example, if only 2 dbr are available software, the registers dbr[0-1]) are available.

Notes: When a hardware debugger is implemented and it uses two or more of debug registers, the processor doesn't force registers between the program and the hardware debugger, that is, the processor doesn't prohibit the program from reading or changing any of the debug registers. However, if the program modifies any of the registers used by the hardware debugger, the processor and/or hardware operation of the debugger may become undefined; the processor and/or hardware debugger may crash.

The instructions mfibr (move from instruction breakpoint register), mtibr (move to instruction breakpoint register), mfdbr (move from data breakpoint register), mtdbr (move to data breakpoint register) are used to indirectly read/write instruction/data debug registers. The sum of general register and simm10 is used to pass index of monitor register number.

    mfibr  ra, rb  # ra = ibr[rb+imm]
    mtibr  ra, rb  # ibr[rb+imm] = ra
    mfdbr  ra, rb  # ra = dbr[rb+imm]
    mtdbr  ra, rb  # dbr[rb+imm] = ra
instruction format for debug registers read/write
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode target index simm10 opx

§ 14.3. Monitoring registers

Monitoring Registers are designed to count various internal events when executing an instruction thread. Their number depends on the implementation (minimum 4). Read/write ability depends on the priority level, processor model.

6
3
6
2
6
1
6
0
5
9
5
8
5
7
5
6
5
5
5
4
5
3
5
2
5
1
5
0
4
9
4
8
4
7
4
6
4
5
4
4
4
3
4
2
4
1
4
0
3
9
3
8
3
7
3
6
3
5
3
4
3
3
3
2
3
1
3
0
2
9
2
8
2
7
2
6
2
5
2
4
2
3
2
2
2
1
2
0
1
9
1
8
1
7
1
6
1
5
1
4
1
3
1
2
1
1
1
0

9

8

7

6

5

4

3

2

1

0
counter for the number of such events
0 event type

There are at least 8 128-bit performance monitoring registers (mr0mr7). Unimplemented monitoring registers when reading give zero, writing to them is ignored. Each monitoring register is associated with a specific event for which it is counting.

Table 14.3: The counted events types
Name Event
Page access
DTLB miss
ITLB miss
I1-cache miss
D1-cache miss
D1-cache write-back
L2-cache miss
L2-cache write-back

An overflow of monitor counter raises an asynchronous event.

The instructions mfmr (move from monitor register) and mtmr (move to monitor register) are used to indirectly read/write monitor registers. The sum of general register and simm10 is used to pass index of monitor register number.

    mfmr  ra, rb  # ra = MR[rb+imm]
    mtmr  ra, rb  # MR[rb+imm] = ra
instruction format for mfmr, mtmr
41403938373635343332313029282726252423222120191817161514131211109876543210
opcode target index simm10 opx

Chapter 15. PAL (Privileged Architecture Library)

In a family of binary compatible machines, application and operating system developers require so that the hardware functions are implemented sequentially (uniformly). When functions correspond to a common interface, code using these functions can be used by several different architecture implementations without modification.

These can be such functions as: binary code of instructions and data, exception mechanisms, synchronization primitives. Some of these functions can be implemented cost-effectively in hardware, other functions are impractical to carry out directly in the equipment. These features include low-level hardware features such as miss buffer translation routines; interrupt handling; interrupt vector control. They also include support for privileged and atomic operations that require long sequences of instructions.

In earlier architectures, these functions were usually provided with microcode. Modern architectures try not to use microcode mechanisms. However, it is still desirable to provide an architectural interface to these functions, which will be compatible for the entire family of machines. The Privileged Architecture Library or PAL (Privileged Architecture Library) provides a mechanism for implementing these functions without microcode.

Three main components of Privileged Architecture Library: Processor Abstraction Layer (PAL), System Abstraction Layer (SAL), Extensible Firmware Interface (EFI). PAL, SAL, and EFI together initialize the processor and system before and for loading the operating system. PAL and SAL also provide machine check abort handling and other processor and system functions that may vary from implementation to implementation.

Extensible Firmware Interface (EFI) is the firmware layer that isolates the operating system loader from the details (differences) in the implementation of the platform and processor and organizes the basic functionality for controlling a machine without an OS.

System Abstraction Layer (SAL) is the firmware layer that isolates the operating system, the overlying EFI layer, and other high-level software from the details (differences) in the implementation of the platform.

Processor Abstraction Layer (PAL) is a software layer that abstracts the details of the processor implementation and isolates them from all: from the operating system, from the EFI layer and from the SAL layer. PAL is independent of the number of processors in the system. PAL encapsulates processor functions that are likely to change from implementation to implementation, so that SAL, EFI, and OS are independent of the processor version. This includes non-performance-critical functions, such as processor initialization, configuration, and correction of internal errors. PAL consists of two components:

The PAL address space occupies a maximum of 2 GB of physical address space. The PAL space contains addresses from 0x80000000 to 0xffffffff inclusive. Code execution after restart starts with the address 0x80000000.

§ 15.1. PAL instructions and functions

PAL should perform the following functions:

The architecture allows these functions to be implemented in standard machine code, which residently resides in main memory. The PAL library is written in standard machine code with some implementation-specific extensions, to provide access to low-level hardware. This allows the implementation to make various project exchanges based on the used hardware technology. The PAL library allows you to abstract these differences and make them invisible to system software.

The PAL environment differs from the normal environment in the following ways:

Full control of the state of the machine allows you to manage all the functions of the machine. Disabling interrupts allows you to provide sequences of several instructions as atomic operation. Providing implementation-specific hardware features allows access to low-level system hardware. Preventing memory management Captures I-stream allows PAL to implement memory management functions such as filling the translation buffer.

Special Features Required for PAL

PAL uses the POSTRISC instruction set for most of its operations. A small number of additional functions are required to implement PAL. Some of the free primary and/or extended opcodes can be used for PAL functions. These instructions generate an error if executed outside the PAL environment.

Having PAL will have only one effect on system code. Because PAL can fit in main memory and support privileged data structures in main memory, the operating system code that allocates the physical memory cannot use all of the physical memory. The amount of memory required by the PAL is small, so the loss for the system is negligible.

§ 15.2. PAL replacement

POSTRISC systems require that you can replace PAL with a version defined by the operating system. The following functions can be implemented in PAL code, not directly in the hardware, to facilitate the replacement with different versions.

Fill translation buffer. Various operating systems may wish to replace the translation buffer fill (TLB) routines. Substitution routines will use different data structures for page tables. Therefore, no part of the TLB padding tools that would change with a change in the page tables can be placed in hardware, if it cannot be canceled by the PAL code.

Process structure. Various operating systems may wish to replace the process context switch routines. Substitution routines will use different data structures. Therefore, no part of the context switching threads that would change with a change in the structure of the process can be placed in hardware.

PAL consists of three components:

Chapter 16. LLVM backend

Development of the POSTRISC backend for the LLVM compiler: github.com/bdpx/llvm-project.

§ 16.1. LLVM backend intro

How to build/use.

§ 16.2. LLVM backend limitations

Nullification doesn't work, in progress.

Pre/post update addressing is not used.

Currently, only static PIE executables are supported by compiler and emulator.

§ 16.3. MUSL port

POSTRISC port for MUSL: github.com/bdpx/musl.

MUSL limitations: doesn't support f128.

POSTRISC limitations: currently, buildable only as a static lib.

§ 16.4. Code density comparison

Here are results for SQLite 3.33.0 compiled with Clang 10.0.1 on FreeBSD 12.1 with -Os for various architectures:

textdatabssarch, comments
4452054576964ARMv7-A, thumb mode
6490954576964ARMv7-A, ARM mode (a32)
58811582801304ARMv8-A (a64)
64125783201312amd64
5842764576952i686
795319166881304mips64el
7250834576960mipsel
6917159148960ppc
712559491441304ppc64
6890354960959rv32g
5095834960959rv32gc (compressed)
6890354960959rv64g,
51250086681299rv64gc (compressed)
91792982801304s390x

The clear winner is ARM Thumb, but RISC-V does well indeed (with compressed instructions). It's the most space efficient 64 bit ISA for sure. i686 does a little worse (still the third most compact after RV32gc and T32) and the classic RISC instruction sets are just terrible. The clear loser is Z/Architecture (S390x).

The probably same SQLite 3.33.0 sqlite-chromium-version-3.33.0, compiled with the POSTRISC port for Clang 20.0 on Linux and for comparison x86-64 Clang 16.0.6 and gcc 13.2.

textdatabssarch, comments
51936783201691x86_64, Os, clang 16
77270383201691x86_64, O2, clang 16
430514170321784x86_64, Os, gcc 14
705880168641784x86_64, O2, gcc 14
75785682801683postrisc, Os, clang 20, dense calls
77252882801683postrisc, O2, clang 20, dense calls
80124882801683postrisc, Os, aligned calls
81579282801683postrisc, O2, aligned calls

The results for POSTRISC are without using nullification and without using the post-update addressing modes (not implemented yet in the compiler) which may improve code density a bit. The main factor for code density is the possibility of returning inside the middle of an instruction bundle (dense calls). In common, code density for POSTRISC is more or less similar to MIPS, PowerPC, S-390, and RISC-V (without compression) and only slightly worse. This is surprising taking into account 128 registers, bundles, nops, etc. For O2 mode results are similar to clang-x86_64, even lesser.

§ 16.5. DOOM port

POSTRISC port for Doom-1: github.com/bdpx/postrisc_doom. Uses the MUSL standard library (as a static lib). Doom generic interface is implemented as additional system calls. Workable, with little graphic artifacts.

The emulator log doom-log.html with static/dynamic instruction statistic for Doom Shareware demo scene autoplay. First 3 demoscenes (around). Time: 341.022 seconds. Frames: 16321. Iinstructions per frame: 851263 (up to 8-bit indexed image not counting emulator scaling/mapping). Instructions per pixel: 13.301. Frames per second: 48.191.