The POSTRISC architecture

This document describes the architecture of a non-existent virtual processor. The processor instruction set architecture and the processor itself are referred to as POSTRISC. This name is because «POST-RISC» is a common name for projects of hypothetical processors, replacing processors with the RISC architecture (Reduced Instruction Set Computer). The virtual processor POSTRISC combines the best (as it seems to me) qualities of existing and past architectures.

Main features of the POSTRISC architecture:

fixed format based on packed bundles of instructions
a large number of universal registers (128) of 128 bits, for deep loop unrolling and rescheduling
register rotation (implemented a general-purpose hardware circular buffer)
load/store architecture with strong separation memory access and computing instructions
implicit predication (all instructions allow conditional execution under the control of predicate mask)
fused operations («multiply-add», shift with addition or subtraction, indexed with scaling and memory access, compare and branch, addition with compare and branch)
position independent code only
vector SIMD-like instructions (128-bit universal registers can be used as vectors of integers or real numbers)
hardware implementation of 128-bit (quadruple) floats
additional hardware protection against software vulnerabilities

Sources are available at https://github.com/bdpx/postrisc. This repository contains source code, sample programs for POSTRISC, this description of the virtual processor. To build program cmake and clang++/g++ needed. To build this documentation from xml sources xsltproc/xmllint needed.

The postrisc is a console application for all basic tasks. Uses standard streams for input/output that need to be redirected from file or to file. It is called with different keys from the command line (table below).

Table 1: Common command-line options
Option	Operation description
--scan <src.s	scan and recognize individual tokens (tokens) of the source program in assembler virtual processor
--scan-html <src.s	scan and mark as html source program in assembler virtual processor
--assemble <src.s >src.bin	assemble the program src.s in the object file src.bin
--assemble-c	assemble as an C++ array for embedding into a C++ program
--disasm <src.bin	disassemble the file src.bin
--dumpbin <src.bin	disassemble the src.bin file with binary representation
--export-definitions	lists asm known predefined constants
--dump-file file	dump final emulation state to file
--log-file file	sets log file path
--log-level level	set logging level
--log-subsystem mask	set logging subsystem mask
--profiling	do profiling (per bundle, per core)
--timing-info	report timing info
--verbose	verbose logging

Table 2: Whole-system emulation command-line options
Option	Operation description
--execute <src.bin	execute the raw program src.bin in the emulator
--base-address vaddr	set virtual base address for image loading
--device-array paddr vaddr config-size	set device array info: physical address, virtual address, device configuration space size
--memory size paddr	add memory device with size (in hexadecimal) starting from physical address paddr
--pa-size nbits	set physical address size in bits
--paging offset nlevels	set paging info: page offset in bits and number of indexing levels. Depth of virtual address space will be: offset+nlevels*(offset-3)
--rom path paddr vaddr	add ROM image, map it to corresponding physical and virtual addresses
--video paddr vaddr width height	add video device (experimental)

Table 3: User-app emulation command-line options
Option	Operation description
--exeapp	execute the ELF program in the emulator. Static PIE executables only, syscall emulator for Linux-like system (very limited).
--env key=value	add guest app environment variable
--	separator between POSTRISC engine options and emulated guest program options

For example, running POSTRISC ELF static-PIE image:

/path/to/postrisc \
    --exeapp --log-file "test-log.txt" \
    -- \
    postrisc-app -app-option1 -app-options2

The qtpostrisc is a Qt-based graphical application with assembler editor, debugger, doom graphical backend, etc (Qt5 required). Support same command-line options as console app.

For Wayland systems currently requires switch to X11 via «QT_QPA_PLATFORM=xcb» env:

QT_QPA_PLATFORM=xcb /path/to/qtpostrisc \
    --exeapp --log-file "doom-log.txt" \
    -- \
    doomgeneric.postrisc -doom -options

Separate html utility app generates html with information about the syntax of assembler instructions, the format of machine instructions, operation codes, statistics for the instruction set.

Separate llvm utility app generates llvm tablegen file with information about the instructions encoding, the format of machine instructions, operation codes, etc. Its output is used as «PostriscInstrEncoding.td» in LLVM compiler backend.

If this manual for the instruction architecture and assembler syntax doesn't exactly match the sample program (not yet updated), then this gen.html file contains a brief instruction set manual, automatically generated by the assembler.

Example program for assembler POSTRISC program.html does nothing sensible but uses all machine instructions and pseudo-instructions of assembler, all sections of the program, all addressing modes, and its only meaning is joint testing of assembler, disassembler, emulator in the process of writing them. It is a concatenation if separate little tests.

The resulting binary program may be disassembled: out_diz.s and out_dump.s (with reported binary representation).

The results: result.txt. And full system dump: dump.html.

The POSTRISC building is based on cmake. Use generators "MSYS Makefiles" (Windows) or "Unix Makefiles" (Linux). Set -DCMAKE_BUILD_TYPE=Release as default. Set -DCMAKE_CXX_COMPILER=g++ or clang++ (MSVC isn't supported). The «USE_QUADMATH» macro controls the used long floating point internal implementation (quadmath or mpreal). Set -DUSE_QUADMATH=0 for clang (it doesn't support libquadmath), set 0 or 1 for g++.

The author is grateful in advance for reporting errors and inaccuracies in the virtual processor description and in the source code (which is far from error-free).

A set of tools for working with POSTRISC (assembler, disassembler, emulator is implemented) is incomplete. Further directions for improvement (the number in brackets like [2] characterizes the comparative complexity of the tasks):

Development of a set of sample programs to illustrate the capabilities of the virtual processor and assembler POSTRISC [1].
Development of a macroprocessor for the assembler POSTRISC[3].
Refactoring the formula calculation unit to introduce strict control over the use of object names in formulas (according to sections and segments of the program and the relocation types) [2].
Development of the object file format for a virtual processor compatible with the ELF standard, processing assembler for ELF [2].
Development of separate compilation tools, support for relocation records according to the ELF standard (Executable and Linkable Format) in assembler and the development of the linker [2].
Development of the block of floating-point calculations in the emulator (implemented), and support for real constants in assembler [3].
Development of a visual debugger for the emulator POSTRISC [2].
Development of a text editor with syntax highlighting of the assembler virtual processor [2].
Adding new «hardware» tools to the virtual processor architecture and its emulator: cache management, virtual memory and translation buffer [2].
Development of a simulator (cycle-accurate emulator) of the possible hardware implementation of a virtual processor, with the simulation of hardware such as multi-level instruction/data caches, translation buffers, branch prediction cache, return prediction stack, and other modern hardware. [5]

Content

Chapter 1. Choosing an instruction set

§ 1.1. Bottlenecks issue

§ 1.2. Memory non-uniformity problem

§ 1.3. The technologies of parallel operation execution

§ 1.4. Instruction format budget

Chapter 2. Instruction set architecture (ISA)

§ 2.1. General description of the instruction set

§ 2.2. Register files

§ 2.3. Instructions format

§ 2.4. Instruction addressing modes

§ 2.5. Data addressing modes

§ 2.6. Special registers

Chapter 3. Basic instruction set

§ 3.1. Register-register binary instructions

§ 3.2. Register-immediate instructions

§ 3.3. Immediate shift/bitcount instructions

§ 3.4. Register-register unary instructions

§ 3.5. Fused instructions

§ 3.6. Conditional move instructions

§ 3.7. Load/store instructions

§ 3.8. Branch instructions

§ 3.9. Miscellaneous instructions

Chapter 4. The software exceptions support

§ 4.1. Program state for exception

Chapter 5. The register stack

§ 5.1. Registers rotation

§ 5.2. Call/return instructions

§ 5.3. Register frame allocation

§ 5.4. The function prolog/epilog

§ 5.5. The register stack system management

§ 5.6. Calling convention

Chapter 6. Predication

§ 6.1. Conditional execution of instructions

§ 6.2. Nullification Instructions

§ 6.3. Nullification in assembler

Chapter 7. Physical memory

§ 7.1. Physical addressing

§ 7.2. Data alignment and atomicity

§ 7.3. Byte order

§ 7.4. Memory consistency model

§ 7.5. Atomic/synchronization instructions

§ 7.6. Memory attributes

§ 7.7. Memory map

§ 7.8. Memory-related instructions

Chapter 8. Virtual memory

§ 8.1. Virtual addressing

§ 8.2. Translation lookaside buffers

§ 8.3. Search for translations in memory

§ 8.4. Translation instructions

Chapter 9. The floating-point facility

§ 9.1. Floating-point formats

§ 9.2. Special floating-point values

§ 9.3. Selection for IEEE options

§ 9.4. Representation of floats in registers

§ 9.5. Floating-point computational instructions

§ 9.6. Floating-point branch and nullification instructions

§ 9.7. Logical vector instructions

§ 9.8. Integer vector operations

Chapter 10. Extended instruction set

§ 10.1. Helper Address Calculation Instructions

§ 10.2. Multiprecision arithmetic

§ 10.3. Software interrupts, system calls

§ 10.4. Cipher and hash instructions

§ 10.5. Random number generation instruction

§ 10.6. CPU identification instructions

§ 10.7. Instructions for the emulation support

Chapter 11. Application Model (Application Binary Interface)

§ 11.1. Sections and segments

§ 11.2. Data model

§ 11.3. Reserved registers

§ 11.4. Position independent code and GOT

§ 11.5. Program relocation

§ 11.6. Thread local storage

§ 11.7. Modules and private data

§ 11.8. Examples of assembler code

Chapter 12. Interrupts and hardware exceptions

§ 12.1. Classification of interrupts

§ 12.2. Processor state preservation upon interruption

§ 12.3. Exception Priority

§ 12.4. Interrupt handling

Chapter 13. External interrupts

§ 13.1. Programmable external interrupt controllers

§ 13.2. Built-in interrupt controller

§ 13.3. Handling external interrupts

§ 13.4. Handling local interrupts

§ 13.5. Processor identification and interprocessor messages

Chapter 14. Debugging and monitoring

§ 14.1. Debug Events

§ 14.2. Debug registers

§ 14.3. Monitoring registers

Chapter 15. PAL (Privileged Architecture Library)

§ 15.1. PAL instructions and functions

§ 15.2. PAL replacement

Chapter 16. LLVM backend

§ 16.1. LLVM backend intro

§ 16.2. LLVM backend limitations

§ 16.3. MUSL port

§ 16.4. Code density comparison

§ 16.5. DOOM port

Chapter 1. Choosing an instruction set

When creating instruction set for existing processor architectures in different years, their architects proceeded from various, often mutually exclusive, goals. Among these goals are the following:

minimize program size;
effective use of one or another hardware opportunity;
compatibility within the family of machines with different performance;
security of execution;
built-in support of various constructions from high-level languages;
optimized loading of pipeline functional units of the processor;
minimize energy consumption;
efficient solution to problems of a certain class;
application for special purposes (embedded solutions);
Maximum performance in single task mode;
Maximum overall multitasking throughput.

There are different architectures - the ones far gone in one certain direction, up to explicit conceptualism, and universal, seeking a balance of priorities in different directions. The choice made at the stage of designing the instruction set architecture may subsequently affect the possibility of developing the architecture in one direction or another. Errors in the design of architecture can cut off the possibility of effective implementation of architecture on new technologies due to incorrect prediction of the trend of technological innovation, narrow the scope of architecture, reduce the effectiveness of the application of architecture.

§ 1.1. Bottlenecks issue

The traditional architecture of a programmable computing device is based on the principle of controlling the system by executing a program, which is a sequences of instructions stored in memory. The execution of the instruction consists of a sequence of steps:

Fetch instructions;
Decode instructions;
Read processor state;
Execute instruction;
Update processor state and/or memory.

Naturally, processor performance equals the performance of the bottleneck of this system. It doesn't make sense to increase the capabilities of one pipeline stage if problems at other stages are not resolved. Accordingly, several processor bottlenecks arise:

RAM bandwidth;
Cache Bandwidth;
Width (power) of the instruction decoder;
Number of read ports (register file throughput);
Number and type of computing devices;
Number of write ports (register file throughput).

You can immediately say that the bandwidth of RAM is the fatal bottleneck of the processor, and this problem is only removed by increasing the amount of built-in cache.

§ 1.2. Memory non-uniformity problem

At the heart of traditional architecture are two principles that relate to the central element of this – memory architecture. This is the principle of random access to any memory element (uniformity property) and the principle of controlling the system by executing a program, which is a sequences of instructions stored in memory.

However, early computers already got summary registers, and later the counter registers and indexes appeared, stored as close as possible to the calculator. The emergence of architectures with general-purpose registers meant the final division of memory into fast registers and slower RAM. The appearance of registers made it possible to explicitly track, analyze and plan dependencies according to data in the instruction stream, and, if there are no dependencies, execute the instructions at the same time.

The further memory size increase, the computing devices miniaturization, an increased gap between the the memory and the processor speed gave rise to cache memory. Caching removed some of the problems with speed memory operations without changing the programming paradigm. However, more was needed. Cache levels of increasing size. Current computer circuit as follows: a set of specialized computing devices with its own register files relies on a system of logically uniform memory with implicit multi-layer caching.

The architecture of a computer with 16 general-purpose registers is certainly better than the architecture with 8 registers. And architecture with two pipelines of multiplication-addition of floating-point numbers is better than with one pipeline. It might seem that an architecture with 1024 registers and 16 multiplication-addition pipelines would be almost ideal. However, a register file of 1024 registers with 16×4=64 read/write ports would be a technological absurdity. Caching also reached its limit after the advent of four cache level. Further enhancement of parallel data processing capabilities is carried out by creating massively parallel systems with shared memory, which abandoned the property of uniformity of memory, leaving it only for the local memory of one multiprocessor node. But these issues already lie outside the processor architecture of the processor itself.

The new architecture doesn't abolish the traditional architecture based on logically homogeneous memory and doesn't offer a new programming paradigm. The architecture is still based on logically homogeneous RAM. Architectural changes can only affect the model of a computing device with its state explicitly described by internal registers.

§ 1.3. The technologies of parallel operation execution

In addition to the memory non-uniformity, there is another fundamental fact that determines the development of architectures – parallelism of operations. Unlike traditional strictly consistent architecture, modern architectures to achieve maximum performance seek to execute more than one instruction at a time, and more than one operation in one instruction.

The problem of parallel computing is ultimately reduced to the problem of organization consistent simultaneous access of many computing devices to logically homogeneous memory, that is, to the same problem of real memory non-uniformity and insufficient bandwidth. Accordingly, it is the level of parallel memory sharing that determines the parallelization technologies used.

There are several technologies for increasing the degree of parallelism of calculations, depending on the hierarchical level of memory for which they are intended. These technologies are implemented either at the ISA level (instruction set architecture) or at the software level. Here we are more interested in the first case, since we want to evaluate the possibilities of parallel operations due to the correct choice of ISA.

Table 1.1: Technologies for achieving parallel computing
Memory hierarchy level	Data exchange	Technology	Hardware	Acceleration	New ISA	Code density	Implementation
Separate register	inside the pipeline	SIMD: subword parallelism	Wide registers and data buses	4-16 operations in one instruction	4-16	8	At ISA level, compiler
Pipeline data	inside the pipeline	Fused instructions	Longer pipeline, additional read port	2-3 operations in one instruction	2-3	1.25	At ISA level, compiler
Separate register file	Crossbar before register file	OOOE+SS: out-of-order super-scalar execution	Increase in the number of ports of the register file, associative hardware for instruction issuing	2-10 instructions per cycle	2 5	0	At ISA level, compiler
Many computing units with local register files	inter-file transfer instructions	MIMD+VLIW: very long instruction word	Wide fetching of instructions, scheduling	2-8 instructions per cycle	0	0.25	At ISA level, compiler
Cache	Explicit sync memory access instructions	SMT/CMP: simultaneous multi-threading, chip multi-processing	Multiport cache, next instruction fetch	2-4 microkernels (threads) on one chip	4	0	At the program level
Local shared RAM	Explicit Sync Memory Access Instructions	SMP: shared memory processing	Memory banks, wide crossbar	2-64 microchips in one node	64	0	At the program level
Computing network	Library network transfer functions	MPP: massively parallel processing	Developed network topology (hypercube, torus, mesh, fat tree)	any number of nodes in the array	4096	0	At the program level

The commercially successful ISA implementation is a compromise between the implementation complexity and each technology benefits. Successful ISA implementation doesn't give preference to any one technology (isn't pure conceptual), but organically and in moderate doses combines several technologies.

SIMD (Single Instruction Multiple Data) are instructions for homogeneous vector operations on elements (8,16,32,64 bits long) of a wide register (64 or 128 bits long). They allow to perform several (2-16) operations in one instruction per cycle. However, software handling of exceptional situations is complicated (where is the error in the vector operation?). The program should contain a sufficient proportion of operations that allow vector execution, and the optimizing compiler must be able to find such operations. When accessing the memory there are problems with data alignment. Implementation of wide ports for reading and writing registers.

Fused instructions are three-operand instructions that combine two binary operations. For example: a = b × c + d. This reduces the total number of instructions and doubles (ideally) the number of computational operations performed per clock cycle (but not machine instructions). This requires a longer execution pipeline, and hence the increase in delays during branches. We need an additional read port for the third operand. The construction of an exception handler is becoming more complicated, since collisions are possible in both the first and second of the fused operations. It takes a place in the instruction to encode the fourth operand. There is a discrepancy in the formats of the computational instructions: binary and ternary formats, which complicates decoding, or you have to artificially convert all binary formats into ternary instructions. The program must contain operations that allow fusing, and the optimizing compiler must be able to find such operations. The percentage of fused operations should be large enough. The total number of possible fused instructions O(N²), where N is the number of basic operations, which is quite large. In practice, it is impossible to fuse all instructions, since the amount of decoding equipment and the place in the instruction allocated for the operation code are limited. Therefore, only some frequently occurring combinations of operations fuse.

Predication is conditional execution of instructions. Any instruction turns into a hardware-executed conditional branch statement. For example: if (a) b = c + d. An additional operand encodes the logical condition register. This technology replaces a control dependency with a data dependency and shifts a possible pipeline shutdown closer to the pipeline end. Most poorly predicted branches in short conditional calculations, and hence pipeline stops, It is eliminated due to the simultaneous execution of instructions from different branches of the conditional statement. However, this is a purely power method, which boils down to simultaneously issuing instructions from several execution branches under different predicates on the pipeline. It takes a place in the instruction to encode the extra operand – predicate register.

Superscalar (super-scalar) execution of instructions. Advantages: Execution of several (1-4) instructions per cycle. Disadvantages: Exception handling becomes more complicated, since the completion of instructions is required strictly in program order. Associative hardware of complexity O(N²) is required to analyze and select N simultaneously executed instructions. We need additional read and write ports, additional pipeline stages.

Out-of-order execution or OOOE is the execution of instructions is not in the manner prescribed by the program, and as the operands are ready, which allows you to bypass structural dependencies according to and do some useful things while waiting for the completion of previous instructions, such as reading from memory. However, the handling of exceptions is complicated, since the completion of instructions is required strictly in software order. Associative hardware of complexity O(N²) is required to analyze and select the next instruction from N buffered instructions. Additional pipeline stages are needed. A register file of sufficient size and equipment for dynamic renaming of registers are required.

VLIW (Very Long Instruction Word) or MIMD (Multiple Instruction Multiple Data) is the execution of instruction packages. Advantages: Execution of several (1-4) instructions per cycle. Disadvantages: Need synchronization – accurate knowledge of delay times for all pipelines and memory, and hence program intolerance when changing the processor model and incompatibility with data caching. Need additional read and write ports. The program must contain operations that allow synchronous execution, and the compiler must be able to find such operations. The increase in the size of the program due to empty slots for which no useful instructions were found.

§ 1.4. Instruction format budget

The program size should be as small as possible. The requirement of code density requires efficient usage of space in the instructions. The question arises about the most advantageous distribution of the bit budget between different types of information in the instruction. The following table shows what the instruction bit budget can be spent on:

Table 1.2: Allocation of the instruction bits
Type of information	Advantages	Disadvantages
Operation code	increasing the variety of implemented functions reduces the data path (the number of instructions for the operation)	Complicating functional units and the compiler
Wider register numbers	more registers in a uniform register file facilitates variable allocation and data flow organization	It is statistically useless when procedures with a small number of variables prevail, it increases the length of data buses and the number of intersections.
Additional operand registers	Non-destructive 3-ary instructions and complex fused 4-ary operations reduce the number of data moves, shorten the data path	Problems with additional register reading ports and register renaming ports (for OOOE)
Explicit predicate description	conditional execution of short conditional statements without branches reduces the branch-delay of incorrect dynamic predictions	Favorable only for short conditional statements that do not exceed the pipeline length.
Longer constants in the instruction code	loading constants becomes easier, less often special instruction sequences are required for synthesizing long constants	Statistically useless when short constants prevail
Templates for an early description of the instruction distribution to functional units	facilitating decoding and distribution of instructions among functional units	Problems with porting programs to machines with a different set of functional units
Explicit description of instructions that allow parallel execution	Simplifice the instructions sheduling to functional units	Useless with unpredictable execution times
Hints to the processor about the direction and frequency of branches	Reduced downtime due to incorrect dynamic predictions	Useless if the compiler doesn't have the necessary information, harmful if the prediction is incorrect. May conflict with hardware branch predictor.
Hints to the processor about the frequency and nature of future accesses to the cache line	Reduced cache misses, better cache utilization	Useless if the compiler doesn't have the necessary information or does incorrect predictions, harmful for different predictions of access patterns to the same line. May conflict with microarchitectural hardware prefetcher.
Explicit clustering of register files with binding to functional units	The data bus length and the number of intersections are reduced, power consumption is reduced, Reduced space for register numbers in instructions	Requires explicit data transfer between register files in different clusters by separate instructions, which lengthens the data path. It doesn't allow reducing or increasing the number of clusters specified by the architecture. It doesn't allow redistributing functional units.

A successful ISA implementation is a trade-off between the cost of instruction space and the benefits of using coding techniques. Preference should not be given to any one technique. The instruction set architecture combines different coding techniques at the ISA level.

In a broader sense, there is a question about splitting the workflow into separate instructions. The same sequence of operations can be represented in different ways as a sequence of instructions. The complication of the semantics of instructions makes it possible to increase their length and reduce the number without compromising the overall size of the program, and this, in turn, raises the question of a new redistribution of the bit budget.

However, for high-performance architectures, it's more important not even the size of the code at all, but the efficiency of using the cache for instructions. The cache line contains several instructions and begins with a naturally aligned address. The most effective option is to fetch all the instructions in one cache line starting from the first instruction. Fetches from not the line start require aligners and introduce delays, incomplete fetches use the cache not rationally.

Summarizing the above, we can say that a regular format of instructions is needed, which would shorten the data path without significantly complicating semantics and hardware support for each instruction would be dense enough, but when choosing between code density and caching efficiency would give preference to caching. The format of the instructions should give the scalability of parallel computing devices within a single portable architecture.

It should be noted that the growth in the volume of processed information is significantly ahead of the growth in the program complexity. The relative part of the RAM occupied by the program code is constantly decreasing. Therefore, the problem of minimizing the size of the program is gradually relegated to the background, remaining relevant only for embedded systems.

Pipeline parallel organization of a computing device requires a regular instruction format, that is, the constancy of the length of the portion supplied to the input of the pipeline decoder instructions. This is necessary in order to start decoding the next portion before decoding of instructions from the previous portion is completed.

The regularity of the instruction format means the finiteness of all instructions, the limitation of their length to the length of the decoded portion. However, not all instruction lengths are available for implementation. It should also be possible to determine the start of the next instruction before decoding the current instruction. The instructions must satisfy the conditions for the natural alignment of data in memory, which are constantly growing as memory systems evolve.

Table 1.3: Possible regular instruction formats
Format	Instruction lengths	Alignment	Example
Irregular	1-15	1	Intel X86
Irregular	2,4,6,8	2	Motorola 68000, IBM S390
Semi-regular	2,4	2	MIPS-16
4-byte regular	4	4	Alpha, PowerPC, PA-RISC, Sparc, MIPS
8-byte bundles	4,8	8	Intel 80960
8 byte instructions	8	8	Fujitsu VPP
16-byte bundles	5,10	16	IA-64

The first three rows of the table relate either to the legacy instruction architectures, or to special architectures for embedded applications for which the program size flashed directly into the ROM is more important than performance.

A regular 4-byte format is used by all modern RISC architectures. Now they are completing the cycle of their development, reaching the limits of improving this format. It is hardly worth hoping for significant progress based on new architectures based on this format.

The format of 8-byte instructions is used only on some vector and graphics processors, where there is generally no possibility of access to smaller memory atoms. Its application for general-purpose architecture would mean more than double the size of programs, which is unacceptable.

Table 1.4: Comparison of some architectures
Number of registers	Architecture	Advantages	Disadvantages
8	Intel X86	scaled indexed addressing mode, SSE2 double precision vector instructions double	CISC: only 8 non-universal and non-orthogonal registers, lack of uniformity in coding
16	AMD X86-64	PC-relative addressing	compatible with old X86, only 16 registers
16	ARM32	Predication, fused instructions	Combination of the instruction counter with the general register
32	ARM64	fused instructions	usually 2 instructions to adress global/static data (hi/lo parts of address)
32	SGI MIPS	First RISC: Fixed Instruction Format, PC-relative addressing (MIPS16)	delayed branches
32	Intel 80960	Regular but not fixed format with 4 and 8 bytes instructions
32	HP PA-RISC	instruction nullification, speculative execution, system calls without interruptions, global virtual address space, inverted page hash tables	delayed branches, comparison in each instruction
32	DEC Alpha	out-of-order execution of instructions, a fixed format for instructions, a unified PAL code, the absence of global dependencies outside the registers	insufficient memory access formats, poor code density, lack of good SIMD extensions, inaccurate interrupts
32	IBM PowerPC	out-of-order execution of instructions with the ordered completion and exact interruptions, fused instructions «multiply-add», multiplicity of the condition register, saving or restoring several registers with one instruction, global virtual address space, inverted cluster page tables	optional comparison in each computational instruction, dependencies between global flag instructions, inconvenient ABI.
32	IBM/Motorola PowerPC	AltiVec Vector Extension	missing double-precision vector instructions (as in SSE2)
32	Sun UltraSPARC	Recursive interrupts, register rotation	register windows of a fixed size, large register files but a small number of registers
128	Intel IA-64	Predication, register rotation, instruction bundles	only next execution of instructions, large multi-port register files, sparse code, complex compiler
128	IBM Cell	Unified register file for all types	Explicit non-uniform scratchpad memory without cache, explicit DMA for exchange with main memory
256	Fujitsu SPARC64 IX-FX	Vector instructions for paired registers.	Separate preparation instructions for specifying numbers from an extended set of registers

Chapter 2. Instruction set architecture (ISA)

This chapter provides a basic description of the POSTRISC virtual processor instruction architecture (instruction set architecture or ISA).

§ 2.1. General description of the instruction set

The architecture prefers security over performance. The exploitation of the unplanned program behavior should be avoided by design as possible. We should avoid ambiguous code interpretation. This was done for security reasons to prevent the return-oriented programming attacks like «return to libc» and to make all binary code available for inspection.

The variable-length instruction encoding allows starting execution from the middle of instruction and extracting unplanned instruction sequences. It is possible an alternative interpretations of program code via decoding from the middle of a variable-length instruction. It should be impossible to continue execution from the middle of the instruction. To ensure this, we can use a fixed format or variable-length self-synchronizing format. The POSTRISC chose a fixed format. So the variable-length instructions are forbidden and only fixed instruction encoding with aligned code chunks is allowed.

Some architectures allow placing data inside code, by design or due to the global data addressing limitations. In such architectures, data parts may be placed near a function that uses them or accumulated into bigger «data islands» for several functions. The data in a code section may lead to possible data execution and exploiting the unplanned program behavior. So the strong separation of code and data should be enforced at architecture level, and mixing of code and data in the code section should be prohibited. This also improves paging/caching/TLB.

The instruction set architecture is aimed to the most parallel extraction of instructions from memory and decoding. The format of the instructions is regular (the length of the decoded portion of the code is constant), but not strictly fixed (when all instructions are necessarily the same length), but almost fixed (inside the regular portion, the initial parts of the instructions are the same length, a possible continuation also has a fixed length). The unit of instruction flow is a 16-byte bundle assembled from three (usually) or two instructions. Bundles are always 16-byte aligned in memory.

Unlike traditional systems like VLIW (very long instruction word), the instruction bundling reflects a parallel fetching and decoding process only, but not the process of dispatching, executing, or completing instructions. The instruction bundles do not describe the binding of individual instructions to functional units, the possibility (or necessity) of parallel execution and/or completion, execution timings. The architecture doesn't expose microarchitectural details to software such as load data delays, branch delays, other fixed pipeline delays (pipeline hazards), or fixed set of functional units. This is necessary for programs portability within a family of machines with different microarchitecture/performance. It is assumed that the program can be used without recompilation on machines with different sets of functional units and timings.

Wherever possible, the instruction set tends to be uniform, that is, if some part of the instruction with the same meaning (for example, the number of the first register, the number of the second register, immediate value, etc.) is present in many instructions, then in all those instructions this part is placed at the same position.

Instruction set architecture uses the non-destructive instruction format for any calculation over registers, i.e. the result register is always encoded separately from the operand registers, unlike CISC dual-argument architectures, where the result is forcibly combined with one of the operands. Accordingly, two-argument unary instructions, three-argument binary instructions and four-argument fused instructions (trinary) are valid.

Fighting unpredictable branches or using vector extension requires the introduction of predicates and conditional execution, but to encode an additional predicate argument, each instruction needs extra space. The POSTRISC architecture uses implicit predication via nullification. Each instruction can be overridden to nop by the previous nullification instructions. Instructions are executed conditionally and canceled instructions are considered as non-ops. When we don't use predication, we don't pay for it in instruction bits.

In the new architecture, to reduce the data path, a limited number of frequently encountered combinations of operations are fused (combined in one machine instruction): addition (or subtraction) with a shift; multiplication with addition or subtraction; addition with a constant and memory access (base + displacement addressing mode); register addition (with shift) and memory access (indexed scaled addressing mode); comparison with the branch according to the result of the comparison; change of the cycle counter with comparison and branch according to the result of the comparison, etc. The architecture assumes the true hardware support for fused operations, rather than just compiling the code with hardware breakdown into the original operations.

In architecture, superscalar out-of-order instruction execution equipment can be effectively used. To do this, the instruction set has several limitations. There are no implicit or optional instruction results, no global registers and flags. The number of possible side effects of the instructions is limited. Most instructions have a single register result. Several instructions have two register results. The number of operands is limited to three (and for most instructions, two) registers.

For the POSTRISC instruction architecture, the underlying technology is parallel (super-scalar) out-of-order execution of complex (fused) instructions with implicit predication.

The instruction fetching and decoding will occur sequentially in program order. Out-of-order concurrent execution will be used to process at least one instruction bundle per cycle. The final completion of the instructions with the analysis of exceptions occurs sequentially in a program order.

All operations on integer data occur in general registers, with 2-3 registers of the source operands (there may be a direct meaning or a direct shift) and one register of the result.

All actions on floating-point data occur in general registers, with 1, 2 or 3 registers of the source operands and one register of the result. Floating-point instructions work on single/double/quadruple precision numbers in scalar or packed vector forms.

Many scalar operation codes are complemented by a wide range of vector operations. A special vector extension is used to process multimedia and numerical data in ordinary registers.

The architecture is of type load/store. Memory accesses are limited to load or store instructions that move data between registers and memory, and don't overlap whit using the loaded value. The memory access instructions usually expect strictly one memory access with a single virtual address translation. The unaligned memory accesses are possible, but strict data alignment is preferred.

Global flags and dedicated registers prevent efficient parallel execution of instructions, but duplicating resources and introducing explicit dependencies between instructions also require extra bits to be explicitly described in the instruction. Branch instructions do not use flags but check the values of general registers. The basic operation is the combination of «compare and jump» in one instruction.

To speed up the subroutine calls, to pass arguments through registers, and to reduce the number of memory accesses, a hardware circular buffer of rotated registers is implemented. It also improves code density by minimizing function prologs and epilogues. The second protected stack for rotating registers also protects the contents of all register frames from erroneous changes. The register rotation also complicates the return-oriented programming - there is no known assumption about the correspondence between the physical registers between different function frames.

Optional hints about the frequency and nature of future cache line accesses carried out (if such information is available) in separate instructions.

For immediates encoding there exist different variants with optional compression, interpreting binary values as signed/unsigned, separate sign bit and unsigned value, etc. The POSTRISC uses simple 2-complement binary representation. Each immediate class is defined as signed or unsigned depend on its usage. Base addressing displacements are defined as always signed. Shift amounts are always unsigned. Compare immediates for less/greater are signed or unsigned depend on type. Compare immediates for equal are chosen to be signed.

§ 2.2. Register files

Processor resources include register files, special registers, associative search structures, interrupts. Some resources are available for user programs, others are necessary for the functioning of the operating system. Each processor core has its own set of registers that contain the current state of the core. All registers are divided into register files. There are no registers that are not included in any register file.

It is known that for the usual code, increasing the register file size above 32 has negligible results. But using more registers has a sense for high-performance computing, digital signal processing, accelerating 3D graphics, and game physics. IBM uses the 128x128 SIMD register file in its POWER VMX extension and 64x128 in its POWER VSX extension. Fujitsu uses the 256x128 register file in its SPARC FX HPC-ACE extension. Intel Itanium had 128x82 floating-point registers for HPC.

For the POSTRISC architecture, the 128x128 register file is chosen as a compromise between ordinary usage and special computing purposes.

Table 2.1: Register Files
Register file	Number of registers	The size of the registers in bits	Additional info
General Purpose Registers	128	128	General-purpose registers are intended for manipulations with scalars 1,2,4,8,16 bytes long or vectors of numbers 1,2,4,8 bytes long. General purpose registers are divided into 120 rotated windowed and 8 global registers. In each group, all registers are equal at the architecture level. Registers can be used to manipulate real numbers of quadruple precision, single and double precision packed vectors of real numbers, packed integer vectors of length 1,2,4,8 bytes. Exceptions from equality: local: r0, globals: tp, fp, sp, gz.
Special Purpose Registers	up to 128	32/64/128	As the name implies, special-purpose registers have different purposes. Not all of the 128 possible special registers are implemented. The ability to read/write depends on the priority level, register number, etc.
CPU identification registers	implementation-defined	64	The read-only registers for reporting hardware capabilities/features. Available only indirectly.
Instruction TLB translation registers	implementation-defined	128	The fixed translations which can't be evicted from the Instruction TLB buffer. Available only indirectly.
Data TLB translation registers	implementation-defined	128	The fixed translations which can't be evicted from the Data TLB buffer. Available only indirectly.
Performance monitor registers	implementation-defined	64	The counters for the internal processor core statistic like number of TLB misses, instruction/data cache misses, branch mispredictions, etc. Available only indirectly.
Instruction breakpoint registers	implementation-defined	64	The instruction breakpoint register when enabled allows stopping execution on preferred code addresses.
Data breakpoint registers	implementation-defined	64	The data breakpoint register when enabled allows stopping execution on preferred data addresses and/or addressing types like read/write/backstore/etc.

§ 2.3. Instructions format

Existing RISC architectures have exhausted the possibilities of a fixed 32-bit instruction format. Deep loop unrolling, function inlining, other compiler optimization technologies require more than 32 general-purpose (and floating-point) registers, preferably at least 128. However, increasing the number of registers over 32 with the 32-bit RISC instruction length turned out to be difficult. The three-address format requires at least 3×log₂(128) or 21 bits for register numbers (and a four-address fused instruction even 28 bits).

The decision to separate code and data forces us to support the effective addressing modes to access global/static/const data outside code section. The approach with several instructions (like the high/low offset parts) to access global data seems unfit. But existing 32-bit instructions aren't enough to access the global data from any code position by one instruction. For the biggest known projects, their size is estimated as 150-250 MiB (210 MiB Chromium, 380 MiB Linux kernel «allyesconfig» build, various CADs, etc), which requires offsets with at least 28-30 bit size for future code blow. The POSTRISC supports programs up to 256 MiB with direct access to global data in one instruction.

Some vector processors (like NEC SX Aurora) or video cards use a longer fixed 64-bit format. But this doubles the program size and doesn't justify the possible benefits for the general purpose architecture. There remains the only intermediate format, consistent with the 2ⁿ byte alignment, with 3 instructions for 42 bits (slots), packed in 128-bit bundles. With the 128-bit format we can't transfer control to any instruction in the bundle, except the first, and execute part of the bundle. The bundle is a minimal execution unit. This approach for encoding is similar to Intel IA64 Itanium.

The POSTRISC architecture defines that a 128-bit bundle consists of a 2-bit template and three 42-bit slots. There are two types of instructions: one or two bundle slots length. A bundle may contain three simple one-slot instructions, or a dual-slot instruction and a one-slot (direct order), or a one-slot instruction and a dual-slot (reversed order).

All operation codes are placed in the first slot of double-slot instruction, so the second slot is used for the immediate extensions only. If the instruction format allows expansion to the second slot and the formation of a long instruction, then some immediate fields may have different lengths in short and long formats. For example, simm21(63) means that it is a 21-bit short format field, expandable to 63 bits in a long format.

The splitting of a bundle into instructions is completely determined by a 2-bit template, so that the main and additional instruction codes for different lengths of the instruction format do not overlap. However, they are defined to be always identical. The long instructions are always the extended versions of short instructions with the extended immediates. The following table shows the packaging of the template and instructions into bundles.

Table 2.2: The bundle splitting into slots and template
Slot 3 (bits 86…127)	Slot 2 (bits 44…85)	Slot 1 (bits 2…43)	Template (bits 0…1)
42 bits	42 bits	42 bits	00
84 bits		42 bits	01
42 bits	84 bits		10
126 bits (reserved)			11

The following table shows the instruction formats and the instruction fields lengths in bits for one-slot instructions. The high 7 bits of {35:41} always define the primary operation code (or just opcode) of the instruction. Many instructions also have one or two extended opcode (opx). The remaining bits of the instruction contain one or more fields in various formats.

Table 2.3: Instruction formats
Name format	Format bits
	41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
r1i *	opcode							ra							simm28 (64)
RaU28 *	opcode							ra							uimm28 (64)
r1b *	opcode							ra							label28 (64)
br *	opcode							opx							label28 (64)
RaU28 *	opcode							opx							uimm28 (64)
alloc	opcode							opx							framesize							0
allocsp *	opcode							opx							framesize							uimm21 (63)
raopxUI21 *	opcode							opx							0							uimm21 (63)
raopx2i *	opcode							opx							rb							simm21 (63)
r2si *	opcode							ra							rb							simm21 (63)
r2ui *	opcode							ra							rb							uimm21 (63)
raopx2b *	opcode							opx							rb							0				label17 (30)
r2b *	opcode							ra							rb							opx				label17 (30)
bbit *	opcode							ra							shift							opx				label17 (30)
brcsi *	opcode							ra							simm11 (40)											label17 (30)
brcui *	opcode							ra							uimm11 (40)											label17 (30)
RaSIN *	opcode							ra							simm11 (40)											dist-no					dist-yes					opx
RaUIN *	opcode							ra							uimm11 (40)											dist-no					dist-yes					opx
RaSbN	opcode							ra							shift							opx				dist-no					dist-yes					opx
RabN	opcode							ra							rb							opx				dist-no					dist-yes					opx
r4	opcode							ra							rb							rc							rd							opx
r3s1	opcode							ra							rb							rc							pos							opx
r2s2	opcode							ra							rb							shift							pos							opx
r2s3	opcode							ra							rb							shift							shift							pos
r3s2	opcode							ra							rb							rc							shift							pos
gmemx *	opcode							ra							rb							rc							scale			opx				disp
RbcScale	opcode							0							rb							rc							scale			opx
Rbc	opcode							0							rb							rc							0			opx
mspr	opcode							ra							0							spr							0			opx
r2	opcode							ra							rb							0							0			opx
Round	opcode							ra							rb							0							rm			opx
r2s1	opcode							ra							rb							shift							0			opx
r3	opcode							ra							rb							rc							0			opx
gmemu	opcode							ra							rb							simm10										opx
int	opcode							0							rb							simm10										opx
NoArgs	opcode							0																								opx

Table 2.4: Used text (and color) notation for instruction fields
Field	Length	Description
opcode	7	primary operation code
opx	4, 7, 11	extended operation code
ra, rb, rc, rd	7	general register number, operand or result
spr	7	special register number
uimm, simm	9, 10, 11, 21, 28	unsigned/signed immediate
disp	9, 21, 28	signed immediate for the address offset
label	17, 28	signed immediate of branch/jump/call
stride	10	signed immediate for base update
dist-yes, dist-no	5	nullification block size
shift, pos	7	bit number, shift value, field legth
scale	3	indexing scale factor
rm	3	floating-point rounding mode
0	various	unused (reserved, must be zeros)

Formats marked in the table with asterisk (*) allow the instruction continuation to the next bundle slot with the formation of a two-slot instruction. The primary codes of single and dual-slot instructions are the same. The assembled code should directly specify a forced extension of the instruction to the second slot by the additional suffix «.l» (long). The assembler adds dummy nop instructions to the code if the long instruction doesn't fit in the rest of the bundle and need to start a new bundle.

addi    r23, r23, 1234
addi.l  r23, r23, 12345678

Notes: Btw, 42-bit slot format is in line with the «Answer to the Ultimate Question of Life, The Universe, and Everything»!

§ 2.4. Instruction addressing modes

The calculation of effective addresses takes place with cyclic rounding modulo 2⁶⁴. Absolute addressing directly in the instructions is missing. Only position independent code (PIC) can be used. Target addresses for addressing executable code can only be calculated relative to the address of the current instruction bundle (instruction pointer ip) or relative to the base addresses in general registers.

The architecture supports 2 modes for ip-relative code addressing:

EA = ip + 16 × sign_extend(disp)

The call/jump offset takes up 28 bits in the instruction slot and allows to encode the branch to a maximum of ±2 GiB in both directions from the current address. If the two-slot instruction is used, the branch distance is maximum ±8 EiB on either side of the current address.

The jump offset
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							other							offset (28 bits)

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
0										continued (60 bits instead of 28)

The branch offset takes 17 bits in the instruction slot and allows to encode the branch to a maximum of ±1 MiB in both directions from the current address. If the two-slot instruction is used, the offset takes 30 bits, and the branch distance is ±8 GiB in both directions. The branch condition is encoded by the other parts of the instruction.

The branch offset
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							other																		offset (17 bits)

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
other																													(30 bits instead of 17)

The linker, creating the image of the program module, must correctly replace all symbolic links for procedures and global data with offsets, where the symbol is accessed, to the location of the symbol itself. That is, for example, calls to the same static procedure from different places in the program occur with different relative offsets.

The architecture also supports the base-relative instruction addressing. The effective address is computed as a sum of 2 registers, aligned to the bundle boundary.

EA = (GR[base] + GR[index]) & mask{63:4}.

base-relative branch
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							other							base							index							other

§ 2.5. Data addressing modes

Absolute data addressing isn't directly supported. The architecture makes it impossible to put absolute static addresses into the instruction code. Only a position-independent code is available (PIC/PIE). Target absolute addresses can be calculated relative to the address of the current instruction bundle or reserved base registers only. The architecture supports the following data addressing modes:

base ip-relative
base plus displacement addressing mode or later simply base with offset addressing
base plus scaled index addressing mode or later simply scaled indexed addressing
base with base immediate pre or post-update

For relative addressing, the immediate unsigned disp field, which is 28 bits or 64 bits for a dual-slot instruction, after unsigned extension, is added to the contents of the instruction pointer to produce a 64-bit effective address. We assume that the program data sections like «.data» or «.rodata» are placing strictly after the code sections like «.text» in the loaded program. The 28-bit immediate value allows to address 256 MiB forward from the current bundle. The dual-slot instructions allow addressing full 64-bit address space.

EA = ip + zero_extend(disp)

Relative addressing
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							target							disp (28 bit)

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
0						continued (64 bits instead of 28)

For the base plus displacement addressing mode the disp offset, which is 21 bits or 63 bits for a dual-slot instruction, after sign extension, is added to the contents of the base register, to produce a 64-bit effective address. The 21-bit immediate value disp allows addressing ±1 MiB in both directions from the base address.

EA = GR [base] + sign_extend(disp)

Base with offset addressing
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							target							base							disp (21 bits)

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
continued (63 bits instead of 21)

For scaled indexed addressing mode, firstly, the contents of the index register is extended according to instruction modifier. The instruction modifier may be «.xd» (no extension), «.xuw» (32 bit unsigned), «.xw» (32 bit signed). Secondly, extended index is shifted left by the scale, then added with a 7-bit signed offset disp (−64…63), and added with the contents of the base register to produce a 64-bit effective address.

EA = GR[base] + (SM(GR[index]) << scale) + sign_extend(disp)

Indexed (scaled) addressing
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							target							base							index							scale			opx				disp (7 bits)

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
disp continued (49 bits instead of 7)

For base with base immediate post-update addressing mode the 10-bit stride immediate is added to base after memory access.

EA = GR[base]

GR[base] = EA + sign_extend(stride)

For base with base immediate pre-update addressing mode the 10-bit stride immediate is added to base before memory access.

EA = GR[base] + sign_extend(stride)

GR[base] = EA

base with base immediate pre/post-update
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							target							base							stride (10 bits)										opx

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
stride continued (52 bits instead of 10)

Other addressing methods can be implemented through the above. Absolute addressing of data can be implemented by using any register with a value of 0 as the base. Static data should be aligned to 4-byte boundary, but if not - can be addressed by base+displacement addressing after placing the base address in one of the free registers. The special instruction ca.r (compute address relative) makes this preparation easier.

ca.r base, text_hi (ip_relative_offset)
ldz.d dst, base, text_lo (ip_relative_offset)

Here ip_relative_offset is the label of the loaded object in the immutable data segment, text_hi is a built-in assembler function for calculating the relative address of the instruction bundle (or aligned 16-byte data portion), text_lo is a built-in assembler function for calculating the displacement within a bundle (portion). Using the ca.r instruction, you can address 1 GiB on either side of the current position, or the entire address space, if you use the two-slot version of ca.r:

ca.r.l base, text_hi (ip_relative_offset)
ldz.d dst, base, text_lo (ip_relative_offset)

Addressing of private data can be implemented by first placing the correct base address in one of the free registers. Special instruction ca.n (compute address near) allows to calculate the nearest base address pointing to the middle of the page containing the desired object.

ca.n base, gp, data_hi (gp_relative_offset)
ldz.d dst, base, data_lo (gp_relative_offset)

Here gp_relative_offset is the label of the object in the data segment, data_hi is a built-in assembler function to calculate the older part of the relative offset (relative to gp) to the middle of the data page where the label is located, data_lo is a built-in assembler function to calculate the offset of the label relative to the middle of the page. Using the ca.n instruction, you can address 1 GiB of private data (or the entire address space if you use the two-slot version of ca.n).

You can also immediately use the dual-slot memory access instructions with addressing 2⁶³ bytes in both directions from the base address.

ldz.d.l dst, gp, gp_relative_offset

§ 2.6. Special registers

There are several special registers, each 64 bit length. Not all special registers are available for direct access, most are available only for privileged software (at the system level). The table provides information on the purpose of special registers and their availability in protected and privileged mode.

Table 2.5: Special Registers
Group	Registers	Description
Registers available to the program at any privilege level for direct and/or indirect reading and updating	ip	instruction pointer
	fpcr	floating-point status/control register
	rsc	register stack control
	rsp	register stack pointer
	eip	exception instruction pointer
	ebs	exception bit stack
	eca	exception context address
Registers available for reading/writing only at the system privilege level	bsp	bottom stack pointer
	peb	process env block
	teb	thread env block
	reip	returnable default exception instruction pointer
	itc	interval time counter
	itm	interval time match register
	psr	processor status register
	pta	page table addresses
Debug facility registers	ibr0…ibr3	instruction breakpoint registers
	dbr0…dbr3	data breakpoint registers
	mr0…mr8	monitoring registers
Registers for switching to the kernel and making system calls are available only in the kernel	kip	kernel instruction pointer
	ksp	kernel stack pointer
	krsp	kernel register stack pointer
Registers for interrupt handling (interrupt context descriptors, shadow copies of general registers), interrupts available in the handler	iip	interruption instruction pointer
	iipa	interruption instruction previous address
	ipsr	interruption processor status register
	cause	interruption cause register
	iva	interruption vector address
	ifa	interruption faulting address
	iib	interruption instruction bundle
Registers of the built-in interrupt controller for controlling external interrupts and asynchronous interrupts from the processor itself (available only at the system level)	tpr	task priority register
	iv	interrupt vector
	lid	local identification register (read only)
	irr0…irr3	interrupt request registers (read only)
	isr0…isr3	interrupt service registers (read only)
	itcv	interval time counter vector
	tsv	termal sensor vector
	pmv	performance monitor vector
	cmcv	corrected machine-check vector

Direct access to special registers can be obtained using instructions mf.spr (move from special-purpose register) and mt.spr (move to special-purpose register). You can copy the special register to the general register (mf.spr), perform the necessary operations, and then put the new value in a special register (mt.spr).

The format of the mt.spr and mf.spr instructions
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							ra							0							spr							0			opx

Syntax:

mf.spr ra, spr
mt.spr ra, spr

The special register instruction pointer (ip) stores the address of the bundle containing the currently executing instruction. The register ip can be read directly via mf.spr instruction, but better to get the ip-relative address (including those with zero offset) using the ca.r/ca.rf instruction. The register ip cannot be changed directly (via mt.spr instruction), but it automatically increases at the end of the bundle execution, and also receives a new value as a result of the execution of taken branch instructions. Also ip is an implicitly implied operand in a relative branch. Because the instruction format is regular and instruction bundles have a fixed length of 16 bytes and are aligned on a 16-byte boundary. The ip register lower 4 bits are always zero, writing them is ignored.

Register format ip
6 3	6 2	6 1	6 0	5 9	5 8	5 7	5 6	5 5	5 4	5 3	5 2	5 1	5 0	4 9	4 8	4 7	4 6	4 5	4 4	4 3	4 2	4 1	4 0	3 9	3 8	3 7	3 6	3 5	3 4	3 3	3 2	3 1	3 0	2 9	2 8	2 7	2 6	2 5	2 4	2 3	2 2	2 1	2 0	1 9	1 8	1 7	1 6	1 5	1 4	1 3	1 2	1 1	1 0	9	8	7	6	5	4	3	2	1	0
bundle address																																																												0

Special floating-point status/control register (fpcr) designed to control the floating-point unit FPU.

Special registers rsc, rsp are used to control register rotation and flushing the contents of the circular register buffer into memory.

Special registers eip (exception instruction pointer), reip (returnable default exception instruction pointer), ebs (exception bit stack), eca (exception context address) are used to implement almost zero-cost software exceptions (like C++ try/catch/throw).

The 64-bit special processor status register (psr) controls the current core behavior. It is writable only at the most privileged level, its changing requires explicit serialization.

Register format psr
63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
future
31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0														ri		0			pl	vm	pp	mc	us	ib	ic	ss	tb	lp	dd	id	pm

Table 2.6: The psr fields
Group	Field	Size	Description
Miscellaneous	pm	1	User performance monitor enabled. If 1, the performance monitor is turned on and counts events, otherwise the performance monitor is disabled.
Predication	future	32	The future field is used to control the nullification of the subsequent instructions. The nullification instruction may mark any of the subsequent 10 instructions as non-executing in this field. A value of 0 for a bit means that the instruction is executing, 1 - is not executing (nullified). The field is automatically shifted to the right when each instruction is executed, with zeros added for the new farthest instructions. In the case of the branch, the mask is completely cleared, thereby canceling all possible nullifications.
Debugger	id	1	Instruction Debug Breakpoint fault. If psr.id=1, breakpoints for instructions are enabled and may cause a Instruction Debug error. Otherwise, errors and traps on the address breakpoint are prohibited.
	dd	1	Data Debug Breakpoint fault. If psr.dd=1, breakpoints for the data are enabled and may cause a Data Debug error. Otherwise, errors and traps on the address breakpoint are prohibited.
	lp	1	Lower Privilege transfer trap. If 1, the Lower Privilege Transfer trap occurs when a transition occurs changes (decreases) the privilege level (the number psr.cpl increases to 1).
	tb	1	Taken branch trap. If 1, then any branch that occurs causes the debug trap Taken branch. Interrupting and returning from it doesn't cause this trap.
	ss	1	Single Step Trap. If 1, then the debug trap Single Step occurs after the successful execution of each instruction.
Privileges, restrictions	cpl	1	current privilege level. The current privilege level of the executable thread. Controls the availability of system registers, instructions, and virtual memory pages. The value 0 is the kernel level, and the value 1 is the user level. Modified by the instructions syscall, sysret, rfi, trap.
Interrupts	ri	2	Restart Instruction. Stores the size of the executed part of the current instruction bundle. Used to partially restart the bundle after call, syscall, interruption. Instructions from the ipsr.ri range are not executed (that is, instructions are skipped while psr.ri is less than that stored when ipsr.ri was interrupted).
	ib	1	interruption Bit. If 1, unmasked delayed external interrupts can interrupt the processor and transfer control to the external interrupt handler. If 0, pending external interrupts cannot interrupt the processor.
	ic	1	interruption Collection. If 1, then upon interruption, partial preservation of the context occurs (using the registers iip, iipa, ipsr, ifa, iib).
	us	1	Used Shadow registers. If 1, then during the interruption a partial preservation of the context occurred (shadow registers, iip, ipsr) are used.
	mc	1	Machine Check. If 1, then machine abortions are masked.
	vm	1	Virtual Machine. If 1, attempting to execute some instructions results in a «Virtualization fault» error. If there is no virtualization implementation, this bit is not implemented and is reserved. The psr.vm bit is available only for the rfi and vmsw instructions.

Special registers bsp (bottom stack pointer) stores bottom limit for downward grown stack, which current position is stored in general register sp. The architecture assumes that all not-used stack pages will be premapped as guard pages and might be allocated in any order, it doesn't use pre-touching for allocated stack frames. The bsp should be page-aligned.

Special registers peb (process env block) and teb (thread env block) store read-only user-mode addresses of the associated process and thread data blocks respectively.

Special register interval time counter (itc) is an unsigned 64-bit number for measuring time intervals and synchronization in intervals of the order of nanoseconds. The increase in itc is based on a fixed ratio with the processor frequency. itc increases by one time in N cycles, where N is an integer defined by the implementation, the power of two is from 1 to 32. Applications can directly read itc for time-based computing and performance measurements. itc can only be written at the most privileged level. The OS must ensure that an interrupt from the system timer occurs before itc overflows. For itc, it is not architecturally guaranteed that any other processors in the multiprocessor system will be synchronized with the time interval counters, nor with the system clock. The software must calibrate itc with a valid calendar time and periodically adjust possible drift.

Modifications of itc aren't necessarily synchronized with the instruction thread. Explicit synchronization may be required to ensure that modifications to itc are observed by the subsequent program instructions. The software should take into account the possible spread of errors when reading the interval timer due to various machine stops, such as interrupts, etc.

Special interval timer match register (itm) is a 64-bit unsigned number which contains the future value of itc at which an «interval time match» interrupt will occur.

Special register pta (page table address) controls the hardware address translation and stores root addresss for page table.

Special registers iip, iipa, ipsr save part of the context (state) of the processor upon interruption.

Special registers iva, cause, ifa, iib manage the interrupt table (iva), as well as recognition and processing of interrupts.

Special registers lid, iv, tpr, irr0 - irr3, isr0 - isr3, itcv, tsv, pmv, cmcv are for the embedded programmable interrupt controller and manage external interrupts.

Special registers ibr0-ibr3, dbr0-dbr3, mr-mr7, are for debugging and monitoring facility.

Chapter 3. Basic instruction set

This chapter describes the basic virtual processor instruction set. It is approximately 300 truly machine instructions and 30 pseudo-instructions (assembler instructions that do not have exact machine analogs and are replaced by assembler with other machine instructions, possibly with argument correction). It includes instructions for working with general registers, branch instructions, instructions for working with special registers. It doesn't include privileged instructions, floating-point instructions, multimedia instructions, support instructions for an extended (virtual) memory system.

§ 3.1. Register-register binary instructions

The register-register binary instructions have 3 arguments. The first argument is the result register number, the second and third are the numbers of the operand registers.

register-register instruction format
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							ra							rb							rc							0			opx

Syntax:

INSTRUCTION_NAME ra, rb, rc

Instruction	Operation	Description
Arithmetic instructions
add	Ra = Rb + Rc	Addition (64 bits)
addws	Ra = Rb + Rc	Addition (word, sign-extend)
add.wz	Ra = Rb + Rc	Addition (word, zero-extend)
sub	Ra = Rb − Rc	Subtraction (64 bits)
sub.ws	Ra = Rb − Rc	Subtraction (word, sign-extend)
sub.wz	Ra = Rb − Rc	Subtraction (word, zero-extend)
absd	Ra = abs(Rb − Rc)	Absolute difference (64 bits)
absd.w	Ra = abs(Rb − Rc)	Absolute difference (32 bits)
mul	Ra = LOPART(Rb × Rc)	Multiply (the lower part of 128 bits)
mul.ws	Ra = sext(Rb × Rc)	Multiply word, sign-extend
mul.wz	Ra = zext(Rb × Rc)	Multiply word, zero-extend
mulhs	Ra = HIPART (Rb × Rc)	Signed multiplication (the high part from 128 bits)
mulhu	Ra = HIPART (Rb × Rc)	Unsigned multiplication (the high part of 128 bits)
div	Ra = Rb / Rc	Signed division
divu	Ra = Rb / Rc	Unsigned division
mod	Ra = Rb % Rc	The remainder of the signed division
modu	Ra = Rb % Rc	The remainder of the unsigned division
Bitwise instructions
and	Ra = Rb AND Rc	Bitwise AND
andn	Ra = NOT (Rb) AND Rc	Bitwise AND with inverse of the first operand
or	Ra = Rb OR Rc	Bitwise OR
orn	Ra = NOT (Rb) OR Rc	Bitwise OR with inverse of the first operand
nand	Ra = NOT (Rb AND Rc)	Bitwise AND with the inverse of the result
nor	Ra = NOT (Rb OR Rc)	Bitwise OR with result inversion
xor	Ra = Rb XOR Rc	Bitwise XOR
xnor	Ra = NOT (Rb XOR Rc)	Bitwise XOR with result inversion
compare instructions
cmp.eq.[w\|d\|q]	Ra = Rb == Rc	Comparison for equality
cmp.ne.[w\|d\|q]	Ra = Rb != Rc	Comparison of inequality
cmps.lt.[w\|d\|q]	Ra = Rb < Rc	Signed comparison less
cmps.le.[w\|d\|q]	Ra = Rb <= Rc	Signed less-equal comparison
cmpu.lt.[w\|d\|q]	Ra = Rb < Rc	Unsigned comparison less
cmpu.le.[w\|d\|q]	Ra = Rb <= Rc	Unsigned less-than comparison
cmps.gt.[w\|d\|q]	pseudo instruction	permutation of arguments and cmp[w\|d\|q]lt
cmps.ge.[w\|d\|q]	pseudo instruction	permutation of arguments and cmp[w\|d\|q]le
cmpu.gt.[w\|d\|q]	pseudo instruction	permutation of arguments and cmp[w\|d\|q]ltu
cmpu.ge.[w\|d\|q]	pseudo instruction	permutation of arguments and cmp[w\|d\|q]leu
Min/Max instructions
mins	Ra = MIN (Rb, Rc)	Minimum (signed)
minu	Ra = MIN (Rb, Rc)	Minimum (unsigned)
maxs	Ra = MAX (Rb, Rc)	Maximum (signed)
maxu	Ra = MAX (Rb, Rc)	Maximum (unsigned)
Shift instructions
sll	Ra = Rb << Rc	Left shift and zero expansion
srl	Ra = Rb >> Rc	Right shift and zero expansion
sra	Ra = Rb >> Rc	Right shift and sign extension
srd	Ra = Rb >> Rc	Right shift as a signed division

The architecture doesn't use bit flags to store comparison results and doesn't use them as implicit operands/results, as, for example, do the architectures Intel X86, SPARC, IBM POWER. The comparison result as a value of 0 or 1 is stored in the general register. In this sense, POSTRISC is similar to MIPS or Alpha architectures. Additionally, to reduce the data path, instructions for determining the minimum/maximum are implemented (comparison and selection in one instruction).

These eight bitwise register-register instructions are enough to implement any binary logic function with a single instruction.

The shift value for the register-register shift instructions is defined as the lower bits of the third register: 5 bits (for 32 bit operations) or 6 bits (for 64-bit operations) or 7 bits (for 128 bit operations). High bits are ignored.

The shift right as division instructions produce a right shift according to the rules for dividing numbers with a sign. First, an arithmetic right shift is performed (with the expansion of the sign bit). If the obtained value is negative, and when shifting to the right, the non-zero bits were forced out (to the left), then the result is corrected (adding a unit). The instruction was introduced to quickly divide signed numbers by 2^shift according to the language rules like C/C++, for dividing negative numbers. With this division, the result is symmetrical with respect to zero, and the remainder can be negative.

§ 3.2. Register-immediate instructions

The register-immediate arithmetic instructions. The first argument is the number of the register of the result, the second is the number of the register-operand, the third is an immediate value of 21 or 63 bit length, sign or zero extended to 64 bits. Instructions of this group allow continuation of the immediate to the next bundle slot with the formation of a dual-slot instruction.

register-immediate instruction format
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							src							imm21(63)

Syntax:

INSTRUCTION_NAME ra, rb, simm
INSTRUCTION NAME ra, rb, imm

Instruction	Operation	Description
Arithmetic instructions
addi	Ra = Rb + imm	Addition
subfi	Ra = imm − Rb	Subtraction from the intermediates
addi.ws	Ra = Rb + imm	Addition (32 bit signed)
addi.wz	Ra = Rb + imm	Addition (32 bit unsigned)
subfi.ws	Ra = imm − Rb	Subtract from immediate (32 bit signed)
subfi.wz	Ra = imm − Rb	Subtract from immediate (32 bit unsigned)
muli	Ra = LOPART (Rb × imm)	Multiplication (the lower part is 128 bits)
muli.ws	Ra = sext(Rb × imm)	Multiply words, sign extension
muli.wz	Ra = zext(Rb × imm)	Multiply words, sign extension
divi	Ra = Rb / imm	Signed division
divui	Ra = Rb / imm	Unsigned division
modi	Ra = Rb % imm	The remainder of the signed division
modui	Ra = Rb % imm	The remainder of the unsigned division
Bitwise instructions
andi	Ra = Rb & imm	Bitwise AND
andni	Ra = not (Rb) & imm	Bitwise AND with register inversion
ori	Ra = Rb \| imm	Bitwise OR
orni	Ra = not (Rb) \| imm	Bitwise OR with register inversion
xori	Ra = Rb xor imm	Bitwise XOR
compare instructions (64 bit)
cmpi.eq.d	Ra = Rb == imm	Comparison for equality
cmpi.ne.d	Ra = Rb != imm	Comparison of inequality
cmpsi.lt.d	Ra = Rb < imm	Signed comparison less
cmpui.lt.d	Ra = Rb < imm	Unsigned comparison less
cmpsi.gti.d	Ra = Rb > imm	Signed comparison more
cmpui.gt.d	Ra = Rb > imm	Unsigned comparison more
cmpsi.le.d	Ra = Rb <= imm	Signed comparison less or equal (pseudo)
cmpui.le.d	Ra = Rb <= imm	Unsigned comparison less or equal (pseudo)
cmpsi.ge.d	Ra = Rb >= imm	Signed comparison more or equal (pseudo)
cmpui.ge.d	Ra = Rb >= imm	Unsigned comparison more or equal (pseudo)
compare instructions (32 bit)
cmpi.eq.w	Ra = Rb == imm	Comparison for equality
cmpi.ne.w	Ra = Rb != imm	Comparison of inequality
cmpsi.lt.w	Ra = Rb < imm	Signed comparison less
cmpui.lt.w	Ra = Rb < imm	Unsigned comparison less
cmpsi.gt.w	Ra = Rb > imm	Signed comparison more
cmpui.gt.w	Ra = Rb > imm	Unsigned comparison more
cmpsi.le.w	Ra = Rb <= imm	Signed comparison less or equal (pseudo)
cmpui.le.w	Ra = Rb <= imm	Unsigned comparison less or equal (pseudo)
cmpsi.ge.w	Ra = Rb >= imm	Signed comparison more or equal (pseudo)
cmpui.ge.w	Ra = Rb >= imm	Unsigned comparison more or equal (pseudo)
Min/Max instructions
minsi	Ra = smin (Rb, imm)	Minimum (signed)
minui	Ra = umin (Rb, imm)	Minimum (unsigned)
maxsi	Ra = smax (Rb, imm)	Maximum (signed)
maxui	Ra = umax (Rb, imm)	Maximum (unsigned)

For bitwise register-immediate instructions the immediate value is always sign extended. Since it is possible to invert the immediate in advance, 5 instructions are enough instead of 8 for two registers.

§ 3.3. Immediate shift/bitcount instructions

Binary instructions register and immediate shift. Shift or rotation instructions shift the value from the src register for a fixed number of bits in shift. Syntax:

INSTRUCTION_NAME dst, src, shift

Here the first argument is the number of the result register, second argument is the register number, third is the shift/rotate immediate value from 0 to 63.

Binary instruction format with immediate shift value
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							src							shift							0			opx

Table 3.3: Instructions where the second argument is shift constant
Instruction	Operation	Description
slli	shift left logical immediate	Left shift and zero expansion
srli	shift right logical immediate	Right shift and zero expansion
srai	shift right algebraic immediate	Right shift and sign extension
srdi	shift right dividing immediate	Right shift as a signed division
cnt.pop	count population	Bit population
cnt.lz	count leading zeros	Number of consecutive zeros in the most significant bits
cnt.tz	count trailing zeros	Number of consecutive zeros in the least significant bits
permb	permute bits	The bits permutation according to mask

The instructions cnt.pop, cnt.lz, cnt.tz count the ones/zeros in the interval of shift bits. cnt.pop – the total number of ones. cnt.lz – the length of a continuous sequence of zeros from the beginning interval (the most significant bits), or shift + 1 if there are all zeroes. cnt.tz – The length of a continuous sequence of zeros from the end span (least significant bits), or shift + 1 if there are all zeroes.

The instruction permb (permute bits) reverses the order of bits/bytes in the register according to the immediate mask shift. The mask determines the sequential involvement in the rearrangement of neighbors: bits, pairs of bits, nibbles (four bits), bytes, byte pairs, and four bytes of the original 64-bit value. For example, a maximum mask of 63 (all units) means a permutation of all pairs (a complete inversion of the order of the bits to the reverse as for FFT), mask 1 is only permutation of adjacent bits, mask 32 is permutation of four bytes, mask 32 + 16 + 8 is reverse order of bytes (endianness) in the register, mask 16 + 8 is reverse the byte order in each four bytes in the register.

§ 3.4. Register-register unary instructions

Instruction format
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							src							0							0			opx

Table 3.4: Unary instructions
Instruction	Operation	Description
mov	move register

Instruction mov (move register) copies data from one register to another.

Syntax:

mov   ra, rb

§ 3.5. Fused instructions

Fused instructions have more that two input paarameters and can perform two or more actions in one instruction.

Table 3.5: Fused instructions
Name	Operation
mov2 ra,rb,rc,rd	move 2 registers: gr[ra] = gr[rc], gr[rb] = gr[rd]
add.add ra,rb,rc,rd	add and add: gr[ra] = gr[rb] + gr[rc] + gr[rd]
add.sub ra,rb,rc,rd	add and sub: gr[ra] = gr[rb] + gr[rc] − gr[rd]
sub.sub ra,rb,rc,rd	sub and sub: gr[ra] = gr[rb] − gr[rc] − gr[rd]
mul.add ra,rb,rc,rd	multiply and add: gr[ra] = gr[rb] × gr[rc] + gr[rd]
mul.sub ra,rb,rc,rd	multiply and sub: gr[ra] = gr[rb] × gr[rc] − gr[rd]
mul.subf ra,rb,rc,rd	multiply and sub from: gr[ra] = gr[rd] − gr[rb] × gr[rc]
mbsel ra,rb,rc,rd	masked bit select: gr[ra] = gr[rb] ? gr[rc]: gr[rd] (bitwise)
slp ra,rb,rc,rd	shift left pair
srp ra,rb,rc,rd	shift right pair
slsrl ra,rb,rc,rd	shift left and shift right logical
slsra ra,rb,rc,rd	shift left and shift right algebraic
sl.add ra,rb,rc,shift	shift left and add: gr[ra] = gr[rb] + (gr[rc] << shift)
sl.add.ws ra,rb,rc,shift	shift left and add: gr[ra] = gr[rb] + (gr[rc] << shift)
sl.add.wz ra,rb,rc,shift	shift left and add: gr[ra] = gr[rb] + (gr[rc] << shift)
sl.sub ra,rb,rc,shift	shift left and subtract: gr[ra] = (gr[rc] << shift) − gr[rb]
sl.sub.ws ra,rb,rc,shift	shift left and subtract: gr[ra] = (gr[rc] << shift) − gr[rb]
sl.sub.wz ra,rb,rc,shift	shift left and subtract: gr[ra] = (gr[rc] << shift) − gr[rb]
sl.subf ra,rb,rc,shift	shift left and subtract from: gr[ra ] = gr[rb] − (gr[rc] << shift)
sl.subf.ws ra,rb,rc,shift	shift left and subtract from: gr[ra ] = gr[rb] − (gr[rc] << shift)
sl.subf.wz ra,rb,rc,shift	shift left and subtract from: gr[ra ] = gr[rb] − (gr[rc] << shift)
sl.or ra,rb,rc,shift	shift left and or: gr[ra] = gr[rb] \| (gr[rc] << shift)
sl.xor ra,rb,rc,shift	shift left and xor: gr[ra] = gr[rb] ^ (gr[rc] << shift)
srpi ra,rb,rc,shift	shift right pair immediate
slsrli ra,rb,shift,count	shift left and shift right logical immediate
slsrai ra,rb,shift,count	shift left and shift right algebraic immediate
dep.s ra,rb,shift,count	deposit set: Insert a group of units
dep.c ra,rb,shift,count	deposit clear: Insert a group of zeros
dep.a ra,rb,shift,count	deposit alter: Change group of bits
dep ra,rb,rc,shift,pos	deposit: deposit of parts from two registers
rlmi ra,rb,shift,count,pos

Fused 4-register instruction format
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							ra							rb							rc							rd							opx

The instruction mov2 (move 2 registers) moves 2 registers. It may be used for register values swapping and just code path reduction.

Fused instructions of the type shift-addition are intended to reduce the critical data path in address calculations. They combine in one machine instruction a left shift (by the number of bits from 0 to 7) with addition (or subtraction). The open question remains about handling overflow during shear with addition that may occur. with intermediate calculations (shift), but no place for the final result.

srpi instruction formats
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							ra							rb							rc							shift							opx

Pair shift instructions slp (shift left pair), srp (shift right pair) and srpi (shift right pair immediate) shift two registers as a whole to the left (right) by count bits, and puts the lowest part of the integer in the result register. srp takes the count low bits from the second, high bits from the first. The first argument is the result register number, the second and third are the numbers of the pair of shifted operand registers, fourth is register or immediate count from 0 to 63 to indicate the amount of shift. The instruction can be used to implement many useful 64-bit operations: rotation by a fixed number of bits, left or right shift, extraction of a part of the register:

Double shift instructions produce a left and then a right shift (with arithmetic or logical extension). They can be used to extract the bit portion from the register and other manipulations.

slsrai, slsrli, dep.s, dep.c, dep.a instruction format
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							ra							rb							shift							count							opx

The dep (deposit) instruction combines the count the least significant bits from the first operand register and the remaining bits from the second operand register. dep takes high bits from the second, count low bits from the first. The first argument is the number of the register of the result, the second and third are the numbers of the combined registers, the fourth param count is immediate number from 0 to 63 to indicate the portion size of the first merged register.

dep instruction format
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							src							src							shift							pos

Direct deposit instructions copy from one register to another with a change in part of the register: dep.s (deposit set) – insert a unit block, dep.c (deposit clear) – insert a block of zeros, dep.a (deposit alter) – invert the block of bits. The block has a length of count bits and is located after the first shift bits. If the value of count+shift is greater than the size of the register (64 bits), ones/zeros bit filling or inversion continues from the beginning of the register. The first argument is the number of the result register, the second is the number of the source operand register, the third and fourth are immediate values shift and count from 0 to 63.

The rlmi instruction extracts a portion of bits of a given length/position from the register and puts it at the specified position in the result register.

rlmi instruction format
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							src							shift							count							pos

§ 3.6. Conditional move instructions

Conditional move instructions copies data from one of two registers depend on condition.

Syntax:

NAME ra, rb, rc, rd

Description:

ra =  cond(rb) ? rc : rd

Table 3.6: Conditional move instructions
Instruction	Condition
cmov.lsb	least significand bit is set
cmov.eq.w	word equal 0
cmov.lt.w	word less than 0
cmov.le.w	word less than or equal 0
cmov.eq.d	doubleword equal 0
cmov.lt.d	doubleword less than 0
cmov.le.d	doubleword less than or equal 0

§ 3.7. Load/store instructions

The 1st group of the general-purpose register load/store instructions uses the base plus offset addressing mode. The first argument is the number of the loaded (stored) register target, second is the base register number, third is an 21 bits length immediate offset disp. The instructions in this group allow continuation of the immediate value disp in the instruction code for the next slot of the bundle with the formation of a dual-slot instruction (63 bit offset). The offset disp after the sign extension is added to the base register to produce a 64-bit effective address.

EA = gr[base] + sign_extend(disp)

Format of load/store instructions with basic addressing
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							target							base							disp21

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
continued disp (63 bits instead of 21)

The 2nd group of general-purpose load/store instructions uses the ip-relative addressing. The first argument is the number of the loaded (or stored) register target, second is an unsigned forward offset disp with a length of 28 bits. The instructions in this group allow continuation of the immediate value disp in the instruction code for the next slot of the bundle with the formation of a dual-slot instruction (64 bit offset).

EA = ip + zero_extend(disp)

Format of load/store instructions with ip-addressing
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							target							uimm28

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
0						continued disp (64 bits instead of 28)

The 3rd group of general register load/store instructions uses the basic scaled indexed addressing method. The first argument is the number of the loaded or saved register target, second is base register number base, third is index register index, next is shift amount scale, last is short offset disp 7 bits long, sign extended to 64 bits. The instructions in this group allow continuation of the immediate value disp in the instruction code for the next slot of the bundle with the formation of a dual-slot instruction (49 bit offset).

EA = gr[base] + (SM(gr[index]) << scale) + sign_extend(disp)

Scaled indexed instructions format
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							target							base							index							scale			opx				disp

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
continued disp (49 bits instead of 7)

The 4th group of load/store instructions use base addressing with base updating after usage by the immediate stride. Arguments: target register, base register, signed immediate stride (10 bits). The instructions in this group allow continuation of the immediate value stride in the instruction code for the next slot of the bundle with the formation of a dual-slot instruction (52 bit offset).

For load: (ld[s]Nmia):

EA = gr[base]
tmp = MEM(EA)
gr[base] = gr[base] + sign_extend(stride)
gr[target] = tmp

For store: (stNmia):

EA = gr[base]
MEM(EA) = gr[target]
gr[base] = gr[base] + sign_extend(stride)

The 5th group of load/store instructions use base addressing with base updating before usage by the immediate stride. Arguments are same as for post-update.

For load: (ld[s]Nmib):

EA = gr[base] + sign_extend(stride)
tmp = MEM(EA)
gr[base] = EA
gr[target] = tmp

For store: (stNmib):

EA = gr[base] + sign_extend(stride)
MEM(EA) = gr[target]
gr[base] = EA

Format of load/store instructions with immediate update
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							target							base							stride										opx

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
continued stride (52 bits instead of 10)

The signed immediate disp is added to base to form an effective base. The signed immediate stride (non-zero 11-bit) is added to base to form a new base. For loads, if target is same as base, base update doesn't occurs or loaded value replaces updated base. For stores, if target is same as base, base update occurs after the old value memory storing.

Table 3.7: Load/store instructions
Size in bytes					Operation	Description, parameters
1	2	4	8	16	Operation	Description, parameters
ldz.b	ldz.h	ldz.w	ldz.d	ld.q	load	base with offset addressing: INSN target,base,disp21
lds.b	lds.h	lds.w	lds.d		load signed
st.b	st.h	st.w	st.d	st.q	store
ldz.b.r	ldz.h.r	ldz.w.r	ldz.d.r	ld.q.r	load	ip-relative addressing: INSN target,disp28
lds.b.r	lds.h.r	lds.w.r	lds.d.r		load signed
st.b.r	st.h.r	st.w.r	st.d.r	st.q.r	store
ldz.b.xd	ldz.h.xd	ldz.w.xd	ldz.d.xd	ld.q.xd	load	scaled indexed addressing: INSN target,base,index,scale,disp
lds.b.xd	lds.h.xd	lds.w.xd	lds.d.xd		load signed
st.b.xd	st.h.xd	st.w.xd	st.d.xd	st.q.xd	store
ldz.b.xw	ldz.h.xw	ldz.w.xw	ldz.d.xw	ld.q.xw	load	scaled indexed addressing: INSN target,base,index,scale,disp
lds.b.xw	lds.h.xw	lds.w.xw	lds.d.xw		load signed
st.b.xw	st.h.xw	st.w.xw	st.d.xw	st.q.xw	store
ldz.b.xuw	ldz.h.xuw	ldz.w.xuw	ldz.d.xuw	ld.q.xuw	load	scaled indexed addressing: INSN target,base,index,scale,disp
lds.b.xuw	lds.h.xuw	lds.w.xuw	lds.d.xuw		load signed
st.b.xuw	st.h.xuw	st.w.xuw	st.d.xuw	st.q.xuw	store
ldz.b.mia	ldz.h.mia	ldz.w.mia	ldz.d.mia	ld.q.mia	load	base update with immediate stride after memory access: INSN target,base,stride
lds.b.mia	lds.h.mia	lds.w.mia	lds.d.mia		load signed
st.b.mia	st.h.mia	st.w.mia	st.d.mia	st.q.mia	store
ldz.b.mib	ldz.h.mib	ldz.w.mib	ldz.d.mib	ld.q.mib	load	base update with immediate stride before memory access: INSN target,base,stride
lds.b.mib	lds.h.mib	lds.w.mib	lds.d.mib		load signed
st.b.mib	st.h.mib	st.w.mib	st.d.mib	st.q.mib	store

§ 3.8. Branch instructions

Instructions of the unconditional branch will jump to the effective address. Additionally, the return address can be stored in the general register. Using predication can turn an unconditional jump into a conditional jump.

Instruction	Operation	Description
jmp label	jump relative	ip-relative jump
jmp.r rb,rc	jump register indirect	base-relative jump
jmp.t rb,rc	jump table	jump to table-relative address
jmp.t.ws rb,rc	jump table word signed index	jump to table-relative address
jmp.t.wz rb,rc	jump table word unsigned index	jump to table-relative address

Branch relative forms is an universal instructions conditional or unconditional static branch or procedure call to a relative address.

Relative branch instructions are generated according to the CA_R rule. After the operation code, there is a register for saving a possible return address, and a 28-bit field for encoding the offset (with a sign) relative to ip. This gives a maximum distance of ±2 GiB in both directions from the current position for a one-slot instruction and all available address space for a long instruction. The jmp instruction allows the continuation of the immediate offset in the instruction code to the next slot of the bundle with the formation of a dual-slot instruction.

ip = ip + 16 × simm

Instruction format jmp
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							opx							simm (28 bits)

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
0										extended label (60 bits instead of 28)

The instruction jmp.r (branch register indirect) is used to branch according to the base addresses in the register. The instruction jmp.r, when calculating the target address, discards the 4 least significant bits of the result, so that the address always aligned with the beginning of the bundle is always obtained.

ip = (gr[base] + gr[index]) & mask{63: 4}

Instruction format jmp.r
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							0							base							index							0			opx

The jmp.t (jump table), jmp.t.ws, jmp.t.wz (jump table word indexed) instructions are intended for organizing table-driven select statements (C language operator switch with continuous distribution of variants, preferably starting from zero). Traditionally, in most architectures, the table-driven switch operator uses table of absolute addresses for storing entry points into the code of options. This table is private for each process (if the loader base code address is different). If the architecture implements the possibility of relative addressing, then the table of absolute addresses can be replaced by a table of relative offsets, shared by all processes, and put it in the read-only data section.

Instruction format jmp.t
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							0							base							index							0			opx

jmp.t: ip = base + mem4[base + 4 × index]

jmp.t.ws: ip = base + mem4[base + 4 × sign_extend(index)]

jmp.t.wz: ip = base + mem4[base + 4 × zero_extend(index)]

.text
; limit = 7
 bsi.gt.d selector, limit, default
 ca.rf   base, table
 jmp.t   base, selector

label_0:
...
label_1:
...
...
label_7:
...
default:
...
.rodata
table:
    dw (label_0 - table)
    ...
    dw (label_7 - table)

The instructions for the conditional branch calculate the condition and (if the condition is true) jump to the effective address. Traditionally (x86, x64, SPARC), conditional branch is implemented using two instructions – comparison (with the generation of flags of the logical result) and conditional branch (by flags). However, conditional branches are very common in programs. Therefore, POSTRISC uses combined compare and conditional branch instructions to compress code and shorten the critical data path.

Table 3.9: Conditional branch instructions
Instruction	Operation
b.eq.d ra, rb, label	branch if doubleword equal
b.ne.d ra, rb, label	branch if doubleword not equal
bs.lt.d ra, rb, label	branch if doubleword less than
bu.ltu.d ra, rb, label	branch if doubleword less than unsigned
bs.le.d ra, rb, label	branch if doubleword less than or equal
bu.leu.d ra, rb, label	branch if doubleword less than or equal unsigned
bs.gt.d ra, rb, label	branch if doubleword greater than
bu.gtu.d ra, rb, label	branch if doubleword greater than unsigned
bs.ge.d ra, rb, label	branch if doubleword greater than or equal
bu.geu.d ra, rb, label	branch if doubleword greater than or equal unsigned
bi.eq.d ra, simm, label	branch if doubleword equal immediate
bi.ne.d ra, simm, label	branch if doubleword not equal immediate
bsi.lt.d ra, simm, label	branch if doubleword less than immediate
bsi.gt.d ra, simm, label	branch if doubleword greater than immediate
bui.lt.d ra, uimm, label	branch if doubleword less than unsigned immediate
bui.gt.d ra, uimm, label	branch if doubleword greater than unsigned immediate
b.eq.w ra, rb, label	branch if word equal
b.ne.w ra, rb, label	branch if word not equal
bs.lt.w ra, rb, label	branch if word less than
bu.lt.w ra, rb, label	branch if word less than unsigned
bs.le.w ra, rb, label	branch if word less than or equal
bu.le.w ra, rb, label	branch if word less than or equal unsigned
bs.gt.w ra, rb, label	branch if word greater than
bu.gt.w ra, rb, label	branch if word greater than unsigned
bs.ge.w ra, rb, label	branch if word greater than or equal
bu.ge.w ra, rb, label	branch if word greater than or equal unsigned
bi.eq.w ra, simm, label	branch if word equal immediate
bi.ne.w ra, simm, label	branch if word not equal immediate
bsi.lt.w ra, simm, label	branch if word less than immediate
bsi.gt.w ra, simm, label	branch if word greater than immediate
bui.lt.w ra, uimm, label	branch if word less than unsigned immediate
bui.gt.w ra, uimm, label	branch if word greater than unsigned immediate
b.bs ra, rb, label	branch if bit set
b.bsi ra, shift, label	branch if bit set immediate
b.bc ra, rb, label	branch if bit clear
b.bci ra, shift, label	branch if bit clear immediate
bm.all ra, uimm, label	branch if mask all bits set
bm.any ra, uimm, label	branch if mask any bit set
bm.none ra, uimm, label	branch if mask none bit set
bm.notall ra, uimm, label	branch if mask not all bit set

Relative branch instructions are formed according to the rules BRC, BRCI, BRCIU, BBIT. After the operation code, the first compared register, the second compared register (or shift immediate), and a 16-bit field for encoding the offset (with a sign) relative to ip. This gives a maximum distance of ±1 MiB in both directions from the current position. In the case of a long instruction, the maximum distance increases to ±8 GiB on both sides of the current position.

The format of b.eq.d, b.ne.d, bs.lt.d, bs.le.d, bu.lt.d, bu.le.d, b.bs, b.bc
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							srcA							srcB							opx				label17

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
0																													label30

Instructions bs.gt.d (bs.lt.d), bs.ge.d (bs.le.d), bu.gt.d (bu.lt.d), bu.ge.d (bu.le.d) and similar word instructions are pseudo-instructions with a replacement of the order of the arguments and are reduced to instructions «less».

Format of instructions b.bci, b.bsi
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							src							shift							opx				label17

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
0																													label30

Relative branch instructions which use immediate. After the operation code, the first compared register, the second compared register (or constant), and a 17-bit field for encoding the offset (with a sign) relative to ip. This gives a maximum distance of ±1 MiB in both directions from the current position. In the case of a long instruction, the maximum distance increases to ±8 GiB on both sides of the current position.

The format of the instructions is bsi.eq.d, bi.ne.d, bsi.lt.d, bsi.gt.d
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							src							simm11											label17

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
simm40																													label30

Instruction format bui.lt.d, bui.gt.d
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							src							uimm11											label17

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
uimm40																													label30

The loop control instructions are for optimization (by shortening the critical execution path) the most common forms of loops with a constant step. Loop control instructions add step (1 or -1) to the loop counter (first argument register) according to loop condition, check the loop continuation condition (compare the counter with the second argument register), and, if the condition is true, make a relative branch to the effective address (label argument).

Format of instructions like rep* (register-register comparison)
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst/src							src							opx				label17

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
0																													label30

Syntax:

INSTRUCTION_NAME ra, rb, label

A variant of loop control instructions in which register numbers are the same is a special case. The architecture determined that in this case, in comparison (as the boundary of the counter change) the old register value will participate. This can be used, for example, for branch that occur in the event of an overflow.

Table 3.10: Loop control instructions
Instruction	Operation
reps.lt.d	Add 1 and branch if doubleword less (signed)
repu.lt.d	Add 1 and branch if doubleword less (unsigned)
reps.le.d	Add 1 and branch if doubleword less or equal (signed)
repu.le.d	Add 1 and branch if doubleword less or equal (unsigned)
reps.gt.d	Add -1 and branch if doubleword greater (signed)
repu.gt.d	Add -1 and branch if doubleword greater (unsigned)
reps.ge.d	Add -1 and branch if doubleword greater than or equal (signed)
repu.ge.d	Add -1 and branch if doubleword greater than or equal (unsigned)

A similar style of loop implementation with minimal software management costs found on almost all DSP (digital signal processor) processors. The general purpose processors have limited form (with special register iteration counter) implemented in the IBM PowerPC and Intel Itanium architectures, and universal instructions for general-purpose type add-compare-jump registers are available in the HP PA-RISC architecture (instructions addb, addib), in the DEC VAX architecture (aobleq, aoblss, sobgeq, sobgtr), in the IBM S/390 architecture (brct, bctr, bxle).

§ 3.9. Miscellaneous instructions

Instruction ldi (load immediate) loads a constant into the register (high 64 bits are reset). The first argument of the instruction ldi is the register number of the result, the second is the immediate value 28 bits long (for the short form it sign extended to 64 bits) or full 64 bits (for a dual-slot instruction).

Instruction ldih (load immediate into high 64-bit) loads a constant into the upper part of the 128-bit register (the lower 64 bits remain unchanged). The first argument of the instruction is the result register number, the second is the immediate value 28 bits long (for the short form it sign extended to 64 bits) or full 64 bits (for a dual-slot instruction).

INSTRUCTION_NAME dst, simm

Instruction format ldi, ldih
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							simm28

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
0						simm (extended to 64 bits)

The instruction nop (dummy none operation instruction) intended for the sole purpose is the code alignment to fill in the missing slots in the bundles of instructions, and for the optimal selection of instructions (fetch) from memory.

For example, if necessary, insert a label in the code, the compiler should add (if necessary) the last (incomplete) bundle with dummy instructions, and put the first instruction after the label in a new bundle (since the branch is possible only at the beginning of the bundle). Or, for example, various implementations can gain performance gains, if the destination address of the frequently performed jump is aligned on the 32/64/128-byte boundary (not just the beginning of the bundle, but the beginning of the cache line).

This instruction should not be used for any other purpose. The architecture doesn't contain software delays when loading data (load delays), conditional branch (branch delays), pipeline delays (pipeline hazards).

The nop instruction is processed at the sampling stage, but may not be fed to the next stages of the pipeline (issue), retire and never cause an interrupt (detect stage) itself. This instruction has no dependencies on either reading or writing.

The nop instruction is automatically added by the assembler to populate incomplete instruction bundle, if necessary, place the next instruction in a new bundle (in the case of a tag or a long instruction). The instruction has one immediate argument (unused).

Instruction format nop
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							opx							simm28

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
0						extended simm (64 bits instead of 28)

Undefined instruction codes are reserved, and can be used for future extensions (new instructions). But one instruction undef is specially defined forever as reserved. It can automatically be added by assembler to fill in an incomplete bundle of instructions. after instructions to unconditionally jump, call a function, or return from a function. It is also used to fill the tail of code segments. The instruction has one immediate argument (unused).

Instruction format undef
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							0																								opx

Chapter 4. The software exceptions support

Software exceptions are for C++-like throw/try/catch exceptions and for more common SEH-like exceptions. The POSTRISC is planned to support deterministic exception handling via frame-based unwinding with sufficient hardware support. The really zero-cost for exception usage is expected for no-exception cases and fast unwinding for exception cases.

The 128-bit link register r0 preserves 18-bit eip offset, which allow alternate return point in case of exception. The exception landing pad address should be after current return address no further than 4 MiB offset. The return instructions may jump to usual return address or to the landing pad depending on exception state.

Link register format r0
6 3	6 2	6 1	6 0	5 9	5 8	5 7	5 6	5 5	5 4	5 3	5 2	5 1	5 0	4 9	4 8	4 7	4 6	4 5	4 4	4 3	4 2	4 1	4 0	3 9	3 8	3 7	3 6	3 5	3 4	3 3	3 2	3 1	3 0	2 9	2 8	2 7	2 6	2 5	2 4	2 3	2 2	2 1	2 0	1 9	1 8	1 7	1 6	1 5	1 4	1 3	1 2	1 1	1 0	9	8	7	6	5	4	3	2	1	0
return address																																																												0		ri
preserved caller future																																eip offset																		out-size							framesize

alternate_retaddr = retaddr + 16 × ZEXT(eip_offset)

§ 4.1. Program state for exception

Special register eip always holds the address of next proper part of unwinding code. This register is automatically restored during normal subroutine return. It's modified during object construction and destruction. Special register eca holds the throwing value (usually the address of throwing object).

Two return address will be saved to link register during subroutine call: for normal return and for exception return. Because registers are 128 bits long, it's enough place for both. But because we need store also frameinfo and previous future vector, exception return address is stored as an positive offset from normal return address. Exception landing pad should be after function body near 4MiB.

So we don't need to return some optional pair (normal return value and optional exception info), and always do the check after each call for possible software exception. Excepted subroutine finally return directly to the proper next part of unwinding code.

The instruction eh.throw sets special register eca to the value (gr[src] + simm21). Usually it should be the address of exception context. This triggers execution to jump to eip address.

Instruction format eh.throw
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							opx							src							simm21

The instruction eh.adj should be called after the successful construction of the object which requires destruction. It checks the current eca context and jumps to current eip if exception is set. Otherwize, it adjusts eip register to the new actual unwinding code address and continues normally to the next instruction.

Instruction format eh.adj
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							opx							simm (28 bits)

The instruction eh.catch copys the exception context eca to general register, clears eca, and adjust new eip value to ip+offset×16.

The instruction eh.catch should be called before the catch block or before the object destructor. For the catch block it adjusts eip register to the end of catch block. Before object destructor it should adjusts eip register to the position after destructor.

Instruction format eh.catch
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							opx							dst							0				label17 (30)

The instruction eh.next should be called after the object destructor. It restores exception context saved in ehcatch before destructor call and checks for possible double exception fault. If it is the second software exception at the time of unwinding first software exception, then hardware exception occurs. Otherwise, if it is normal destructor call during unwinding first software exception, execution continues to new eip address. Otherwise, if it is normal destructor call without any unwinding, execution continues to next instruction.

Instruction format eh.next
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							opx							src							0				label17 (30)

Chapter 5. The register stack

§ 5.1. Registers rotation

Traditionally (in most architectures) the register file is a global resource, where all registers are visible to all program procedures. If the procedure wants to use a register, the contents of this register must be stored in memory, and later restored from memory. The work of saving/restoring registers is usually divided between the procedure that makes the call (caller) and the one that is called (callee). For example, the first 14 registers out of 32 existing may be required to save caller, then the remaining 18 registers must save callee. The optimal split depends on the processor architecture: the number of registers and their universality (orthogonality), and for new architectures it is usually determined experimentally by comparing the code effectiveness for different variants, based on the analysis of a statistically large codebase.

Within one procedure, you can optimize the use of registers well, but in the case of several procedures, and especially if they are compiled separately, the use of register resources becomes suboptimal. A typical example of extreme inefficiency is recursive procedures. Even if the recursive procedure uses only one of the N available registers, each recursive call to such a procedure wants to use exactly this specific register, therefore, this register is repeatedly spilled/filled, despite the presence of many unused registers.

Summing up these arguments, we can say that a significant percentage of all memory accesses are the operations of spilling/filling registers, which in essence are not related to useful work. This fraction is not very dependent on the total number of registers due to the binding of the procedure code to specific registers. So an increase in the number of registers, although it helps to improve the efficiency of large and complex procedures, doesn't help in any way to reduce inter-procedure save and restore registers. This proportion grows with an increase in the number of procedures and a decrease in their average size (as is usually the case for object-oriented programming languages).

The solution to this problem is to implement hardware registers rotation. The registers is no more global resource. Each procedure called gets its own working subset of registers. The registers saving/restoring is not required while the registers working set of several called procedures fits in the register file.

For example, in the POSTRISC architecture, a file of 128 general-purpose registers is divided into two subsets: up to 120 rotable or stackable registers r0 - r119 (locally visible only to the current procedure) and 8 static registers g0 - g3, tp, fp, sp, gz (globally visible to all procedures). The mechanism of the register stack is implemented through the circular renaming of registers as a side effect of procedure calls and returns from procedures. The renaming mechanism is not visible to the program. There 128 rotate registers in the hardware circular buffer, which allows an easy way to cyclically calculate the remainder. In total, the general-purpose logical file of registers has 136 (128 + 8) registers, of which up to 128 maximum are simultaneously available to the program.

Static registers must be maintained and restored at the procedure boundary in accordance with programmatic conventions (APIs). Stackable registers are automatically saved and restored by the corresponding hardware mechanism without the explicit participation of the program. All other register files are visible for all procedures and must be saved/restored programmatically in accordance with program agreements.

Table 5.1: General Purpose Registers (Hardware Model)
circular register buffer (128)										Global (8)
local A				not available						global
not available			local B				not available			global
not available					local C				not available	global
local D (cont)		not available						local D		global
not available	local E					not available				global

The above diagram shows the process of using the hardware buffer of local rotate registers. Five procedures A, B, C, D, E call each other, pass call arguments through the register buffer, place their local variables in the buffer. As the hardware circular buffer is exhausted (in procedure D), the registers are flushed onto the stack in memory and the buffer is reused from the beginning. Of course, not the entire buffer is discarded, but the necessary count to create a new frame.

Table 5.2: Register buffer (dividing into parts, looped back)
clean										dirty																														local																								invalid														clean

In general, the register buffer contains the following five parts (order matters):

Table 5.3: Register ring buffer parts
Part	Description
clean	these registers belong to inactive frames and have already been flushed to the stack in memory, but have not yet been used by other frames (if there is an advanced reset of registers to memory or advanced recovery from memory)
dirty	these registers belong to inactive frames and have not yet been flushed to the stack in memory (obligatory dumping to memory is required before using under other frames)
local	these are local registers of the active frame
invalid	is garbage left over from past procedure calls, or registers that have never been used (can be used to expand the current active frame, to create a new active frame, or to expand the zone of clean-registers when returning from procedures or when reading registers from memory ahead of time)

Table 5.4: General purpose registers (visible to the program model)
Local (120)					Global (8)
local A			not available		global
local B		not available			global
local C	not available				global
local D				not available	global
local E			not available		global

Each procedure «sees» only its local registers, and the first physical local register is visible under the logical number r0.

The diagram below shows an example of working with the register stack. First, we have a register frame for the current function of 17 registers (r0 - r16). The last 5 of them (r12 - r16) the function uses to place the parameters for calling the next function. The return address when calling will fall into the first register parameter (r12), as well as the number of stored registers and output frame size (12, 5) - these two numbers can be packed together with return address in link register. This register number for the return address, as the boundary between the stored registers and the output parameters, is indicated in the call instructions.

After the call, the second function has at its disposal a register frame of 5 registers. The return address is visible in the register r0. Then the second function expands its register frame to the required number of registers for local computing (up to 10 registers).

After completion of work, the second function restores the saved part of the frame of the first function, and gives the parameter registers back to it. The number of registers to be returned is indicated in the instructions for returning from the function, and, according to ABI, it must match the number of incoming parameter registers.

physical numbering	caller function registers	callee registers immediately after the call (input parameters)	callee extends the register frame	caller registers after returning
0	are hidden	are hidden	are hidden	are hidden
1
2
3
4	r0			r0
5	r1			r1
6	r2			r2
7	r3			r3
8	r4			r4
9	r5			r5
10	r6			r6
11	r7			r7
12	r8			r8
13	r9			r9
14	r10			r10
15	r11			r11
16	r12	r0	r0	r12
17	r13	r1	r1	r13
18	r14	r2	r2	r14
19	r15	r3	r3	r15
20	r16	r4	r4	r16
21	not available	not available	r5	not available
22			r6
23			r7
24			r8
25			r9
26			not available
27
28
29
30

special register register stack control (rsc) stores information about the status of the circular register buffer, and the current active frame of local general purpose registers.

Register format rsc
6 3	6 2	6 1	6 0	5 9	5 8	5 7	5 6	5 5	5 4	5 3	5 2	5 1	5 0	4 9	4 8	4 7	4 6	4 5	4 4	4 3	4 2	4 1	4 0	3 9	3 8	3 7	3 6	3 5	3 4	3 3	3 2	3 1	3 0	2 9	2 8	2 7	2 6	2 5	2 4	2 3	2 2	2 1	2 0	1 9	1 8	1 7	1 6	1 5	1 4	1 3	1 2	1 1	1 0	9	8	7	6	5	4	3	2	1	0
0																																ndirty								soc								bof								sof

Four fields hidden from direct access store the positions and sizes of register portions local to the rotation buffer. Their sizes depend on the implementation (register ring buffer size), except for sof whose size is always 7 bits. So, for example, for a buffer of 128 registers, the number of bits for each position is 7, for 256 is 8. Field sof (size of frame) is the size of the last active frame (possibly empty), Field bof (bottom of frame) is the position in the buffer of the beginning of the last active frame and border with the dirty section, Field soc (size of clean) is the size of the clean section, The ndirty field is the number of dirty registries.

The special register stack pointer (rsp) contains the memory address, where the next local register should be saved when the hardware circular register buffer is full. Since the address must be aligned on an 16-byte register size boundary for register spilling/filling, then the lower 4 bits of the register rsp are always zero, the writing them is ignored. A specific architecture implementation can spill/fill the registers in aligned groups of 2-16 registers at a time to optimize the work with memory, so the additional least significant bits of the register may be fixed as zero.

Register format rsp
6 3	6 2	6 1	6 0	5 9	5 8	5 7	5 6	5 5	5 4	5 3	5 2	5 1	5 0	4 9	4 8	4 7	4 6	4 5	4 4	4 3	4 2	4 1	4 0	3 9	3 8	3 7	3 6	3 5	3 4	3 3	3 2	3 1	3 0	2 9	2 8	2 7	2 6	2 5	2 4	2 3	2 2	2 1	2 0	1 9	1 8	1 7	1 6	1 5	1 4	1 3	1 2	1 1	1 0	9	8	7	6	5	4	3	2	1	0
address																																																										0		0

In the Berkeley RISC research project, where register rotation was first applied (?), only eight of the 64 existing registers were visible to the program. A full set of 64 registers is called the register file, and a portion of eight – register window. The file allows up to eight procedure calls with their own sets of registers. Until the program calls a chain longer than eight calls, the registers never had to be stored in RAM, which is scary slow compared to register access. For many programs, a chain of six calls is enough.

A direct descendant of the RISC Berkeley project is Sun Microsystems' SPARC (UltraSPARC) architecture. Compared to the prototype, this processor provides the simultaneous visibility of four sets of eight registers each (has 32 simultaneously visible registers). Of these, 8 are global and 24 are windowed. Three sets, eight registers each, are implemented as a «register window». The eight registers i0…i7 are input to the current procedure, eight registers l0…l7 are local to the procedure of the current level, and eight registers o0…o7 are the output for calling the next level procedure. When a new procedure is called, the register window shifts to sixteen registers, hiding old input registers and old local registers, and making the output registers of the current procedure the input registers of the new procedure. Additionally, eight registers g0…g7 are globally visible to procedures at all levels.

The size of the frame and the number of output registers, unfortunately, are fixed in SPARC. It's also bad that flashing registers pushed from the stack into memory is implemented through interrupts, and the fact that the dumping place is not separated from the regular stack of automatic objects.

In the AMD 29000 architecture (64 global and 128 window visible registers), the register rotation design was further refined with variable-sized windows, which helps resource utilization in the general case, when fewer than eight registers are needed to call the procedure. A second separate stack for saving registers was also implemented.

Register rotation was used in the architecture of Intel 80960 (i960) processors for embedded applications (32 visible registers, of which 16 global and 16 windowed, with a fixed rotation step of 16 registers).

The last (of implemented) known processor that uses register rotation is Intel Itanium (IA64 architecture). It has 128 registers, of which 32 are static and 96 are windowed. It is possible to set a frame of any size from 0 to 96 registers with any number of output registers. To spill registers into memory without processor interrupting, an asynchronous hardware mechanism is implemented. The spill occurs on a separate (second) stack, which grows towards the main stack and is not visible to the user program explicitly. Both stacks share the same memory array.

The rotation of the registers is also applied in the new architecture of the educational processor MMIX, which replaced the legacy MIX processor in examples for new editions of Donald Knuth's book «The Art of Programming». MMIX architecture has a register file of 256 registers visible to the program, allows using the variable window size of visible registers, and even allows to change the boundary between the global and rotate registers visible to the program dynamically, which is usually not used in real architectures.

§ 5.2. Call/return instructions

Because the POSTRISC architecture uses hardware register rotation, then the call/return instructions execution is closely related to the operation of the circular buffer of local registers. When the procedure is called, the current frame of local registers is partially saved, when returning from the procedure, the previous frame is restored.

The POSTRISC may be extended in the future by big-SIMD facilities (256 or even 512 bit) using register pairs/groups for SIMD. Such SIMD register pairs/groups shouldn't cross a register frame boundary. The register frame base (bottom of frame) and the preserved frame size should be a multiple of register pair/group (2 or 4) to guarantee SIMD register pairs/groups alignment. The link info may be stored only in the even (or multiple 4) register to guarantee register pairs alignment. Currently, only 2-register alignment is required for frame size.

Procedure call instructions call.r, call.ri, call.mi, call.mrw, call.plt perform similar actions. They vary by way of computing the target call address only.

The first argument for all call instructions is the register in which the return address and other link info will be stored. All local registers starting from r0 with lower numbers up to it exclusively will be hidden after the register window rotation. All local registers, starting from the register specified in the instructions and with greater numbers, that are currently allocated will become the initial frame of the new procedure.

Then, the branch effective address is calculated (differently for different instructions). The return address along with the current frame info is stored in the return register.

Then the register window is rotated, the frame of local registers is partially saved, and the branch to the target address is performed. The new procedure always sees its return address and previous frame info in the first rotated register r0, and the input parameters in the following registers r1, r2 ...

Link register format r0
6 3	6 2	6 1	6 0	5 9	5 8	5 7	5 6	5 5	5 4	5 3	5 2	5 1	5 0	4 9	4 8	4 7	4 6	4 5	4 4	4 3	4 2	4 1	4 0	3 9	3 8	3 7	3 6	3 5	3 4	3 3	3 2	3 1	3 0	2 9	2 8	2 7	2 6	2 5	2 4	2 3	2 2	2 1	2 0	1 9	1 8	1 7	1 6	1 5	1 4	1 3	1 2	1 1	1 0	9	8	7	6	5	4	3	2	1	0
return address																																																												0		ri
preserved caller future																																eip offset																		out-size							framesize

If the call instruction is the last in the bundle, call instruction saves the return address as a pointer to the next bundle after the current bundle, and and the stored slot number ri is set to zero. If the call instruction isn't the last in the bundle, then the current bundle address and the next slot number ri were saved.

In the general case, return to middle of bundle may be less optimal but saves code size. The processor anyhow fetch and execute whole bundle, but discards execution of first ri instructions. For better performance, the bundle before call instruction may be filled with dummy nop instructions to shift call instruction to the end of bundle. There is corresponding compiler command line parameter to choose between «dense» and «aligned» calls.

For example, dense calls:

ldi %r33, 1234    ; r33 is future r1 (param for myfunc)
call.r %r32, myfunc    ; r32 is future r0 (link info)
call.r %r32, myfunc2
call.r %r32, myfunc3

For example, aligned call:

ldi %r33, 1234    ; r33 is future r1 (param for myfunc)
nop   0
call.r %r32, myfunc    ; r32 is future r0 (link info)
; this is next bundle and aligned return address
add %r34, %r12, %r12    ; next instruction after return from myfunc
sub %r14, %r22, %r11

Table 5.6: Instructions for calling procedures, managing the register frame and the regular stack
Instruction	Description
call.r dst,label	ip-relative call
call.ri dst,base,index	call register indirect
call.mi dst,base,disp11	memory-indirect call, base addressing
call.mrw dst,base,disp11	memory-indirect call, word, base relative addressing
call.plt dst,uimm28	call procedure linkage table: indirect, relative addressing
alloc framesize	allocate register stack frame
alloc.sp framesize,uimm21	allocate register stack frame, update SP
ret	return from the subroutine
ret.f uimm21	return from the subroutine, update SP

The instruction call.r (call relative) makes a procedure call using ip-relative addressing using 28-bit immediate signed offset. This gives the maximum distance of the ±2 GiB to both sides of the current position for a one-word instruction. A long form of instruction is also implemented.

Instruction format call.r
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							simm (28 bits)

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
0										simm (60 bits)

EA = ip + 16 × simm

call (EA)

Instruction call.ri (call register indirect) is the address of the procedure call from the register. The branch address is calculated as base plus index. The call.ri instruction discards the 4 least significant bits of the address, so the call address is always aligned at the beginning of the bundle.

Instruction format call.ri
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							base							index							0			opx

EA = (gr[base] + gr[index]) & mask {63:4},

call (EA)

The instruction call.mi (call memory indirect) takes the callee address from memory using base+displacement addressing. The instruction discards the 4 least significant bits of the loaded value, so that the address always aligned with the beginning of the bundle is always obtained. The instruction is intended to load from address table with aditional checks for finalized state of virtual page. The vtables should be relocated by linker and set as finalized to disable future access rights changes (hardware-assisted one way relro). The 10-bit displacement is enough to support vtables (or other function pointer tables) with up to 1024 items.

Instruction format call.mi, call.mrw
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							base							simm10										opx

EA = gr[base] + sign_extend(simm10)

EA = mem8 (EA)

EA = EA & mask {63:4},

call (EA)

The instruction call.mrw (call memory indirect relative word) takes the callee relative offset from memory using base+displacement addressing. This offset is used to compute callee address relative to base address. The instruction discards the 4 least significant bits of the loaded value, so that the address always aligned with the beginning of the bundle is always obtained. The instruction is intended to load from address table with aditional checks for finalized state of virtual page. The vtables should be relocated by linker and set as finalized to disable future access rights changes (hardware-assisted one way relro). The 10-bit displacement is enough to support vtables (or other function pointer tables) with up to 1024 items.

EA = gr[base] + sign_extend(simm10)

offset = mem4(EA)

EA = (base + offset) & mask {63:4},

call (EA)

The call.plt instruction (call procedure linkage table) takes the address of the call from memory using ip-relative addressing. The instruction discards the 4 least significant bits of the loaded value, so that the address always aligned with the beginning of the bundle is always obtained. The instruction is intended to load from address table with aditional checks for finalized state of virtual page. The import tables should be relocated by linker and set as finalized to disable future access rights changes (hardware-assisted one way relro).

Instruction format call.plt
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							uimm28

EA = ip + zero_extend(uimm28)

EA = mem8 (EA)

EA = EA & mask {63:4},

call (EA)

The ret and ret.f instructions (return from subroutine) is used to return control from the procedure. It also restores the caller procedure register window state, and retf roll-back fixed-size stack frame.

Unlike other branch instructions, these instructions may use special hardware structures to predict the destination branch address. If the prediction array branch target buffer is generally used for branch address prediction, then for ret instructions it can be additionally (for better prediction accuracy) implemented hardware branch target stack as a short stack of saved return addresses.

While restoring the previous frame state the ret instructions may load part or all of the previous frame from memory if necessary (when the circular hardware register buffer overflows). Instruction may return control until a complete recovery from memory is completed, but the architecture guarantees that attempts to use not yet recovered from memory local registers in the subsequent instructions will be delayed until recovery is performed (via the scoreboard mechanism of the registers).

Instruction format ret
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							0																								opx

Instruction format ret.f
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							opx							0							uimm21 (63)

The link register is implicit argument for both ret instructions. It is the first current function local register and provides the return address and the previous frame info. The argument for retf is the displacement which is used for the optional stack rollback (maybe be 0). The instruction may cause an error if link register contains a broken frame info and there is no place in the local registers for the outgoing and preserved frame parts of the previous procedure since the maximum frame size is 120 registers.

§ 5.3. Register frame allocation

Each callee procedure after call obtains the remaining frame part of the calling procedure starting from the link register(the parameters and maybe slightly more). If callee wants to increase the size of its register frame it should use the alloc (allocate register stack frame) instruction. The first parameter of the instruction is local register, which will be the last in the frame of our procedure (from r0 to r119).

Instruction format alloc
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							opx							framesize							0

If there is not enough free space in the rotate registers hardware buffer to accommodate a new frame, the alloc instruction flushes registers from the previous functions frames onto the register stack in memory. The instruction can return control before a complete flush is completed, but the architecture guarantees that attempts by the subsequent instructions to use local registers not yet flushed to the stack will be delayed until the flush (through the scoreboard mechanism of the registers).

The new eip is set up from reip. The reip should point to simple universal function epilog with just ret instruction. This epilog should live in the highest corresponding usermode/kernel region. The reip should be set up during thread start.

The next minimum program for a virtual processor demonstrates the use of the call.r, alloc and ret instructions.

.text
; at the beginning of the program, the register stack is empty
alloc  54   ; expand frame to 54 registers
eh.adj  endfunc
ldi    %r47, 1  ; will be saved when called
ldi    %r53, 3  ; first argument
ldi    %r52, 2  ; second argument
ldi    %r51, 1  ; third argument
; func procedure call, all registers up to 50 will be saved,
; return address, eip, frame size (50) are saved in r50
call.r  %r50, func
; at this point, after returning, the frame will be again 54
halt
func:
; at the starting point, the func procedure has a 4-register frame
; their previous numbers are 50, 51, 52, 53, new - 0, 1, 2, 3
; extend the frame to 10 registers (plus regs 4,5,6,7,8,9)
alloc  10
write  "r0 = %x128(r0)"    ; print packed return info
write  "r1 = %i64(r1)"    ; print 1st argument
write  "r2 = %i64(r2)"    ; print 2nd argument
write  "r3 = %i64(r3)"    ; print 3rd argument
ret
endfunc:
.end

Result of execution:

r0 = 000000010000c232_fffffffff1230020
r1 = 1
r2 = 2
r3 = 3

Here: 0xfffffffff1230020 - return bundle address, 0x0000c232 - packed: previous frame size (50 registers), and output frame size (3 parameters and link), offset between return address and previous eip exception return address (endfunc label). 0x00000001 - previous future mask, nonzero because call.r is a middle from 3 instructions in the bundle, so we return to the bundle middle and skip one instruction.

The instruction allocsp is introduced for code compression. Its function similar to alloc, but additionally it push usual stack. The allocsp adiust sp down by immediate size.

alloc    framesize
allocsp  framesize, uimm21

Instruction format allocsp
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							opx							framesize							uimm21 (63)

§ 5.4. The function prolog/epilog

The type of prolog/epilog depend on:

if instructions inside function may generate software/hardware exceptions
if function need to allocate local registers
if function need to allocate fixed-size stack frame
if function need to allocate fixed-size stack frame more than pagesize
if function need to extend stack frame with variable size (using variable length arrays or alloca function)

In the examples below r1…r5 are arguments, and r6…r10 are optional local registers. The frame stack grows from up to down addresses.

The simplest function which doesn't allocate local registers (uses arguments only) and doesn't allocate stack frame. All instructions inside function don't generate software/hardware exceptions (never touch memory, divide, etc). Then it's enough just:

insn    # can't fail, can use only args r1..r5
...
insn    # can't fail, can use only args r1..r5
ret

The next function can generate software/hardware exceptions but doesn't allocate local registers (uses arguments only) and doesn't allocate stack frame. Here in case of exception control can be transferred to eip, so we need proper eip before execution.

The special register reip is introduced to not blow code with multiple copies of standard universal epilog which consists from only ret instruction. It stores the address of such epilog. The proper initialization of reip to avalable standard universal epilog is at runtime at thread start.

Each call instruction setup eip by reip copy, so we won't worry about proper eip just after call. So even if instructions may fail, we don't need additional setup at function start.

insn    # can fail, can use only args r1..r5
...
insn    # can fail, can use only args r1..r5
ret

The next function doesn't allocate stack frame but allocate local registers. The alloc instruction here does the local registers allocation. The register allocation may trigger register spilling to memory so may fail and trigger hardware exception. But again, because eip stores the copy of reip, we won't worry about eip.

alloc   11
insn    # can fail, can use r1..r5 and r6..r10
...
ret

The next function allocates local registers and allocates the fixed-size stack frame. In this case we need to set new eip before execution to the label before return for proper traditional stack unwinding. The stack frame should be no bigger than pagesize, so we don't touch next page after the stack guard page.

st.d     %gz, %sp, -frame_size_immediate # touch new stack frame
alloc.sp 11, frame_size_immediate
eh.adj   before_return   # immediately after allocsp
...
insn    # can fail, can use r1..r10
ldz.w    %r7, %sp, +offset # using sp for local frame addressing
...
before_return:
addi    %sp, %sp, frame_size_immediate
ret

The next function allocates local registers and allocates the fixed-size stack frame. The stack frame is bigger than pagesize so proper guard-page extension via store probing is required.

# guard page probing for frame size bigger than pagesize
st.d     %gz, %sp, -page_size * 1
st.d     %gz, %sp, -page_size * 2
...
st.d     %gz, %sp, -page_size * n
# allocation only after probing
allocsp 11, frame_size_immediate
eh.adj   before_return   # immediately after allocsp
...
insn    # can fail, can use r1..r10
ldz.w    %r7, %sp, +offset # using sp for local frame addressing
...
before_return:
addi    %sp, %sp, frame_size_immediate
ret

The before_return block:

...
before_return:
addi    %sp, %sp, frame_size_immediate
ret

may be changed to one retf instruction:

...
before_return:
ret.f    frame_size_immediate

and, if there is a space in previous bundle, then retf may be copied into it, and before_return block may be potentially amortized once for several functions with same frame size:

...
ret.f    frame_size_immediate
before_return:
ret.f    frame_size_immediate

The next function allocates local registers and allocates stack frame with variable size (uses variable length arrays or alloca function) possibly with initial size more than pagesize. In this case we have 2 rollback points: for the case of failure in local register alocation, and for the case of failure in initial stack alocation. The sp can't be used for access local stack frame (because of variable frame size), so some local temp register is used to save/restore old sp value (r6 in example) with negative offsets.

# optional guard page probing for frame size bigger than pagesize
st.d    %gz, %sp, -page_size * 1
st.d    %gz, %sp, -page_size * 2
...
st.d     %gz, %sp, -page_size * n
# allocation only after probing, r6 is allocated on the fly
alloc.sp 11, initial_frame_size_immediate
addi    %r6, %sp, initial_frame_size_immediate
eh.adj   before_return   # immediately after saving fp in r6
...
insn    # can fail, can use r1..r10
ldz.w    %r7, %r6, -offset # using r6 for local frame addressing

# alloca or VLA
# optional guard page probing for big frame size
st.d    %gz, %sp, -page_size * 1    
st.d    %gz, %sp, -page_size * 2
...
st.d    %gz, %sp, -page_size * m
# allocation only after probing
sub    %sp, %sp, additional_frame_size
# end of alloca or VLA

st.w    %r7, %r6, -offset # using r6 for local frame addressing
...
before_return:
mov    %sp, %r6
ret

§ 5.5. The register stack system management

One alloc instruction, along with instructions for calling procedures and returning control, in principle, it is enough for user programs to handle the register stack. But for system programs that handle interrupt processing, returning from an interrupt, context switching, initialization of the register stack, some more instructions are needed.

Instruction without parameters rs.cover (register stack cover frame) is used, to put the last (active) frame of the register stack into the dirty state (registers belonging to inactive procedure frames). After executing this instruction, the size of the active frame of the local registers is zero. This instruction prepares the register stack for subsequent disconnection or switching.

Instruction without parameters rs.flush (register stack flush) used to flush all inactive frames of the register stack into memory (transfer from the dirty state to the clean state). After executing this instruction, the register stack can be disabled without fear of data loss.

Instruction without parameters rs.load (register stack load) used to load from memory the last inactive frame of the register stack and be ready to activate it. After executing this instruction, the register stack is ready to work (a group of clean registers appears in it).

Instruction format rs.cover, rs.flush, rs.load
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							0																								opx

§ 5.6. Calling convention

ABI defines a standard relationship function relationship, namely stack frame location, using registers, passing parameters.

The standard function call convention applies only to global functions. Local functions (not available from other object files) may be used by other agreement unless it prevents the correct recovery after an exception.

Convention about the use of registers in standard function calls divides all the global registers available to the program into two categories: saved (preserved) and non-saved (scratch) registers.

preserved registers are guaranteed to be saved when the procedure is called. The called procedure (callee) guarantees the safety of the contents of such a register with a normal return from it. She either doesn't touch this register, or saves it contents somewhere and restores before returning.

Unsaved (scratch) registers may not be saved when a procedure is called. The calling procedure (caller) must store the contents of such a register in memory (on the stack) or in another, but persistent register, if it doesn't want to lose its contents when calling callee. The callee function called uses this register for its needs without restriction.

The architecture provides 128 general purpose registers of 128 bits, and several 64/128 bit special purpose registers. General purpose registers are divided into global (static) and rotatable. The following table shows how registers are used.

Table 5.7: Saving registers when calling procedures
registers	volatility
sp	stack pointer is saved. The address of the top of the stack must be aligned on a 16-byte boundary. It should always point to the last placed stack frame, growing down in the direction of lower addresses. The contents of the word at this address always points to a previously placed stack frame. If required, may be reduced by the called function. The stack top pointer must be updated atomically with a single instruction to avoid any period of time in which the interrupt can happen with a partially updated stack.
tp	thread pointer is saved. This register stores the base address of the TDATA segment of the main program module.
r0	communication register, saved automatically by the rotation mechanism of the registers.
r1-r32	Used to pass parameters to the called function (not saved). Registers r1 and r2 store the return value.

Static registers g0-g7 must retain their values in the process of accessing the function. Functions that use these registers must save their values before changing, and restore them before returning from the function.

External signals can interrupt the flow of instructions at any time. Functions called during signal processing do not have any special restrictions on their use of registers. In addition, when the signal processing function returns control, the process resumes its work with correctly restored registers. Therefore, programs and compilers are free to use all registers above except reserved for use by the system without fear of signal processing programs that inadvertently change their values.

The operating system provides each thread with its own stack, in which data is placed on both sides. The stack of rotated registers grows from the bottom towards the higher addresses, work with it under the control of the equipment and is not visible to ABI. The usual stack of software local objects grows from top to bottom towards lower addresses. Each frame (frame) corresponds to an activation record of a procedure in a call chain. The stack pointer sp (stack pointer) always points to the first byte after the top of the stack. The stack frame should be aligned at the 16-byte boundary, and should be a multiple of 16 bytes in size.

The last function in the call chain, which itself doesn't call anyone, may not have its own frame. Such functions are called leaf or terminal (in the graph of dependencies between functions). All other functions must have their own stack frame in the dynamic stack. The following figure shows the organization of the stack frame. sp in the figure means the pointer (register r1) of the top of the stack the called function after it has executed the code setting the stack frame.

Stack frame organization

highest address

        + -> Frame header (return address, gp, rsc)
        | Register storage area (aligned on the boundary of 16 bytes)
        | Local variable space (aligned on the boundary of 16 bytes)
sp ---> + - The title of the next frame (sp + 0)

lowest address

The following requirements apply to the stack frame:

The pointer to the top of the stack should always be aligned on a 16-byte boundary.
The pointer to the top of the stack should point to the beginning of the most recently placed stack frame – 8-byte number of «return address». The stack should grow down towards lower addresses.
The first 8-byte number of the stack frame should always point to the previously allocated frame (in the direction of the older addresses) of the stack, except for the first frame of the stack, which must have a null pointer.
Pointer to the top of the stack, if required, should be reduced by the called function in its prolog and restored before returning.
Before a function calls another function, it must save the value of the register lp in the storage area for the return address.
Some space below the stack top may be available as volatile storage (red zone), which is not saved when the function is called. Interrupt routines and any other functions that can be performed without an explicit call should take care to guard this region. If the function doesn't need more stack space than is available in this area, then it doesn't need to have its own stack frame.

The header of the stack frame consists of a pointer to the previous frame (link info), storage areas rsc, lp and gp, resulting in 32 bytes. Link info always contains a pointer to the previous frame in the stack. Before function B refers to another function C, it must save the contents of the communication register received from function A in the storage area lp for the stack frame of function A, and must set its own stack frame.

Except for the header of the stack frame and inserts for alignment at the 16-byte boundary, the function should not allocate space for areas that it doesn't use. If the function doesn't call other functions and doesn't require anything from the rest of the stack frame, then it should not set the stack frame. The parameter saving area follows the stack frame, the register saving area should not contain any inserts.

For machines of the RISC type (where there are many registers) it is generally more efficient to pass arguments to the called functions in the registers (real and general purpose), rather than constructing a list of arguments in memory or pushing them onto the stack. Since all calculations must somehow be performed in registers, then extra memory traffic can be eliminated if the caller can calculate the arguments in the registers and pass them in the same registers of the called function (callee), and she can immediately use them for its calculations. The number of arguments that can be passed in this form is limited by the number of available registers in the processor architecture.

For POSTRISC, up to 16 parameters are passed in general registers and are visible in the callee new frame in registers r1…r16. The caller passes parameters starting from any register. Exact register number on the caller side depend on caller local frame size.

Parameter storage area, which is located at a fixed distance of 32 bytes from the pointer to the top of the stack, reserved in each frame of the stack for use under the argument list. A minimum of 8 double words is always reserved. The size of this area should be sufficient to preserve the longest list of arguments passed to the function if it owns a stack frame. Although not all arguments for a particular call are in storage, consider their list formation in this area, with each argument occupying one or more double words.

If more arguments are passed than are allowed to be stored in registers, the remaining arguments are stored in the parameter storage area. Values passed through the stack are bitwise identical to those that would be placed in registers.

For variable argument lists, the ABI uses the type va_list, which is a pointer to the location in memory of the next parameter. Using the simple va_list type means that variable arguments should always stay in the same location despite the type, so that they can be found at runtime. This ABI defines the location, which is the common registers r8-r18 for the first eight double words and the parameter storage area on the stack for the rest. Alignment requirements, such as for real types, may require so that the va_list pointer is pre-aligned before accessing the value.

The return value of the function. Functions must return type values int, long, long long, enum, short, char, or pointers to any type, in register r1, extended to 64 bits (zeros or sign).

Arrays of characters up to 8 bytes long, or bit strings up to 64 bits long, will be returned in the g8 register, right justified. Structures or joins of any length, and character strings longer than 8 bytes, will be returned in the storage buffer allocated by the caller. The caller passes the address of this buffer as a hidden optional argument.

Functions must return a single real result of type float, double, long double (quadruple) in the r1 register, rounded to the desired precision. Functions must return complex numbers in the registers r1 (real part) and r2 (imaginary part), rounded to the desired accuracy.

Chapter 6. Predication

§ 6.1. Conditional execution of instructions

The architecture defines a model in which the control flow is passed to the next sequential instruction in memory, unless otherwise directed by a jump instruction or interrupt. The architecture requires the program to appear that the processor is executing instructions in the order in which they are located in memory, although in reality the order can be changed inside the processor. The instruction execution model described in this chapter provides a logical representation of the steps involved in executing the instruction. The branch and interrupt sections show how flow control can be changed during program execution.

If the branch direction is incorrectly predicted, the branch instruction causes the pipeline to stop. All speculatively launched instructions are reset from the fetch stage to the stage of writing results – for almost the entire length of the pipeline.

Predication is a conditional execution of instructions. The purpose of conditional execution is to remove badly predicted branches from the program. In this case, any instruction becomes a hardware-executed conditional branch operator. For example:

if (a) b = c + d.

add (a) b = c, d

The optional argument «a» (predicate) sets the logical condition – to execute the instruction or not. This technology replaces a control dependency with a data dependency and shifts a possible pipeline shutdown closer to the pipeline end. All instructions issued with a false value of the predicate are rejected at the completion stage (retire) or earlier (up to the decode stage) without interruptions.

The instruction predication may be explicit or implicit. With explicit predication, each instruction contains an additional argument – a one-bit predicate register, and, accordingly, the architecture contains a file of several predicate registers (16 predicates in the ARM-32 architecture, 64 in Intel-Itanium).

When implicit predication, the architecture contains a special register-mask for storing information about the conditionality of the execution of future instructions. Before executing the instruction, the first bit from this register is taken as its predicate. Then the register is shifted by one bit, while the current bit is lost. The subsequent instruction takes the second bit as the predicate. The register is constantly updated from the other end with «clean» bits corresponding to unconditionally executed instructions.

Some instructions may write data to this register, thereby canceling the unconditional execution of some future instructions according to the bitmask. These are the so-called nullification instructions. For example, using mask 0b10011 containing 3 1-bits, the 3 instructions (1, 2, and 5th) after the nullification instruction will be canceled.

The advantage of conditional execution is the elimination of most branches in short conditional calculations, and hence the pipeline stops. However, this is a purely power method, which boils down to simultaneously issuing instructions from several execution branches under different predicates on the pipeline. In addition, when explicitly predicting, a place in the instruction is required to explicitly encode the optional argument – the predicate register.

Predication is more suitable for short conditional calculations. It makes no sense to apply predication for loops, or conditional statements longer than the gain from the continuous operation of the pipeline without branches. However, it is the only means of removing downtime for poorly predictable branches (for example, a conditional branch depending on unpredictable data).

An implicit predication scheme was chosen for the POSTRISC architecture. This is due to the fact that according to statistics collected for other architectures where there is a predication, approximately 90% of instructions are executed without using predication, so spending several bits for the predicate in each instruction is not profitable. On the other hand, the remaining 10% of the instructions depend on unpredictable data and, without predication, introduce a significant delay in the pipeline operation. Therefore, architecture without predication will also be suboptimal.

The special field psr.future is used to control the nullification of the subsequent instructions. The least significant bit of the register corresponds to the current instruction, other bits correspond to the subsequent instructions. At the end of the instruction, a right shift occurs. In the case of the branch, the future mask is completely cleared, thereby canceling all possible established nullifications.

Format of nullification instructions
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							nullification condition																		dist-no					dist-yes					opx

Based on the condition of nullification (for different instructions consists of 2 registers, or register and shift, or register and immediate value) either the next «n-yes» instructions are nullified or the next «n-no» subsequent instructions (after first block «n-yes» instructions) are nullified in the psr.future.

The following are examples of removing branches from short conditional statements and the corresponding use of nullification. In all cases, the management dependency is converted to data dependence and masking for future instructions. Everywhere it is assumed that the calculation of conditions gives side effects in the form of possible exceptional situations and should occur strictly predictively. If there can be no side effects in the form of exceptional situations, then, naturally, the calculation of a difficult condition can be done without predication and reduce unnecessary manipulations.

Table 6.1: Schematic examples of using predication
Conditional statement	Predication
if (c1) {x1; } else {x2; }	c1 c1yes, c1no x1 (c1no) x2 (c1yes)
if (c1) { x1; if (c2) x2; else x3; x4; } else { x5; if (c3) x6; else x7; x8; }	c1 c1yes, c1no x1 (c1no) c2 c2yes, c2no (c1no) x2 (c1no, c2no) x3 (c1no, c2yes) x4 (c1no) x5 (c1yes) c3 c3yes, c3no (c1yes) x6 (c1yes, c3no) x7 (c1yes, c3yes) x8 (c1yes)
if (c1) {x1; } else if (c2) {x2; } else if (c3) {x3; } else {x4; }	c1 c1yes, c1no c2 c2yes, c2no (c1yes) c3 c3yes, c3no (c1yes, c2yes) x1 (c1yes) x2 (c2yes) x3 (c3no) x4 (c3yes)
if (c1 & & c2) {x1; } else {x2; }	c1 c1yes, c1no c2 c2yes, c2no (c1no) x1 (c1no, c2no) x2 (c2yes)
if (c1 \|\| c2) {x1; } else {x2; }	c1 c1yes, c1no c2 c2yes, c2no (c1yes) x1 (c2no) x2 (c1yes, c2yes)
if (c1 \|\| (c2 & & c3)) { x1; } else { x2; }	c1 (p0) p2, p3 c2 (p3) p4, p5 (unc) c3 (p4) p2, p3 x1 (p2) x2 (p3)
if (c1 & & (c2 \|\| c3)) { x1; } else { x2; }	c1 (p0) p2, p3 c2 (p2) p4, p5 (unc) c3 (p5) p4, p5 (unc) x1 (p4) x2 (p3)

§ 6.2. Nullification Instructions

Nullification instructions mark in the special field psr.future the fact that the execution of the subsequent instructions was canceled. Nullification instructions create mask of 1s for nullified instruction for if or else block, and or them with current future mask. Nullification instructions assume that the «if»-block precedes the «else»-block.

Next instructions cancel future instructions depending on the result of comparing two registers.

Table 6.2: reg-reg nullification instructions
Instruction	Operation
nul.eq.d	nullify if equal doubleword
nul.eq.w	nullify if equal word
nul.ne.d	nullify if not equal doubleword
nul.ne.w	nullify if not equal word
nuls.lt.d	nullify if signed less doubleword
nuls.le.d	nullify if signed less or equal doubleword
nulu.lt.d	nullify if unsigned less doubleword
nulu.le.d	nullify if unsigned less or equal doubleword
nuls.lt.w	nullify if signed less word
nuls.le.w	nullify if signed less or equal word
nulu.lt.w	nullify if unsigned less word
nulu.le.w	nullify if unsigned less or equal word

Nullification instruction format compare-regs
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							ra							rb							opx				dist-no					dist-yes					opx

Next instructions cancel future instructions depending on the result of comparing the register and the 14(40)-bit immediate value, with or without a sign. The conditions are the same as for compare with immediate and branch instructions.

Table 6.3: reg-imm nullification instructions
Instruction	Operation
nuli.eq.d	nullify if doubleword equal
nuli.ne.d	nullify if doubleword not equal
nulsi.lt.d	nullify if doubleword less
nulsi.le.d	nullify if doubleword less or equal
nului.lt.d	nullify if doubleword less unsigned
nului.le.d	nullify if doubleword less or equal unsigned
nuli.eq.w	nullify if word equal
nuli.ne.w	nullify if word not equal
nulsi.lt.w	nullify if word less
nulsi.le.w	nullify if word less or equal
nului.lt.w	nullify if word less unsigned
nului.le.w	nullify if word less or equal unsigned
nulm.all	nullify if mask all bit set
nulm.any	nullify if mask any bit set
nulm.none	nullify if mask none bit set
nulm.notall	nullify if mask not all bit set

Format of nullification instructions compare-with-immediate
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							ra							imm11											dist-no					dist-yes					opx

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
imm40																													0

Instructions nul.bs (nullify if bit set) and nulbsi (nullify if bit set immediate) cancel future instructions depending on whether or not a bit is set in the register.

Analogous instructions nul.bc (nullify if bit clear) and nulbci (nullify if bit clear immediate) cancel future instructions depending on whether or not a bit is clear in the register.

Nullification instruction format nbs, nbc
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							ra							rb							opx				dist-no					dist-yes					opx

Format of nullification instructions nbsi, nbci
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							ra							shift							opx				dist-no					dist-yes					opx

Floating-point scalar values may be checked for nullification. Two registers may be compared, or single register value may be classified (normalized, signed, denormal, NaN, INF, etc).

Format of nullification instructions fp compare
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							ra							rb							opx				dist-no					dist-yes					opx

Format of nullification instructions fp classify
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							ra							classify							opx				dist-no					dist-yes					opx

§ 6.3. Nullification in assembler

Assembler eliminates the need to manually set predication distances. You can use named markers whose distances are computed automatically. General syntax:

INSTRUCTION NAME regular_parameters (pred1, pred2, pred3, ...)

The predicate list indicates in which previously defined predicate instruction this instruction is the last in if or else-block. These predicates should be mentioned in previous nullification instructions no higher than 31 instructions from the current one for «yes» predicate and 63 instructions from the current one for «no» predicate.

write    "test nullification (explicit distances)"
ldi      %r10, 0
nuleq    %r10, %r10, 5, 4
write    "0" ; nullified in 5
write    "1" ; nullified in 5
write    "2" ; nullified in 5
write    "3" ; nullified in 5
write    "4" ; nullified in 5
write    "5" ; nullified in 4
write    "6" ; nullified in 4
write    "7" ; nullified in 4
write    "8" ; nullified in 4

write    "test nullification (predicate names)"
ldi      %r10, 0
nuleq    %r10, %r10, equal, nonequal
write    "0"
write    "1"
write    "2"
write    "3"
write    "4" (equal)
write    "5"
write    "6"
write    "7"
write    "8" (nonequal)

Both variants print «5 6 7 8» (4 instructions else-block) and avoid printing «0 1 2 3 4» (5 instructions if-block) due to predication.

The else-block may be empty if both «yes» and «not» distances refer to same instruction (dist_yes == dist_not). To create zero-length else-block, the last instruction of if-block should be marked as last for else-block also. To create zero-length if-block, the nullification instruction itself should be marked as last in if-block.

In the next sample all 3 subsequent instructions will be nullified, because nullification condition is true. And there is no else-block. Both markers of block ends «(equal)» and «(nonequal)» are on same instruction.

nuleq    r10, r10, equal, nonequal
write    "0"                     ; part of equal-block
write    "1"                     ; part of equal-block
write    "2" (equal, nonequal)   ; part of equal-block

In the next example all 3 subsequent else-block instructions will be executed, because nullification condition is false, and if-block for true nullification condition isn't presented. The «(equal)» marker or if-block is set just on predication instruction, so the size of if-block is zero - the distance from end of block to nullification instruction.

nuleq    r10, r12, equal, nonequal (equal)
write    "0"                ; part of nonequal-block
write    "1"                ; part of nonequal-block
write    "2" (nonequal)     ; part of nonequal-block

Chapter 7. Physical memory

From the most applications point of view, memory is defined as a linear array of bytes, indexed from 0 to 2⁶⁴−1. Each byte is identified by its index or address, and each byte contains a value. This information is sufficient for programming applications that do not require special features in any system environment. Other objects are constructed as sequences of bytes.

The architecture supports composite types of size 1,2,4,8,16 bytes. The following is the terminology used in this guide for composite data types. It is considered that the word size is 4 bytes.

Byte is a 8 contiguous bits starting at an arbitrarily addressable byte boundary. Bits are numbered from right to left from 0 to 7.

Halfword is a two contiguous bytes starting on an arbitrary (but multiple of two) byte boundary. The bits are numbered from right to left from 0 to 15.

Word is a four contiguous bytes starting on an arbitrary (but multiple of four) byte boundary. Bits are numbered from right to left from 0 to 31.

Doubleword is a eight contiguous bytes starting on an arbitrary (but multiple of eight) byte boundary. The bits are numbered from right to left from 0 to 63.

Quadword is a sixteen contiguous bytes starting on an arbitrary (but multiple of 16) byte boundary. The bits are numbered from right to left from 0 to 127.

Octaword (optional) is a 32 contiguous bytes starting on an arbitrary (but multiple of 32) byte boundary. The bits are numbered from right to left from 0 to 255.

This chapter additionally defines physical addressing, physical memory map, physical memory properties, memory ordering.

An extension of simple memory model include: virtual memory, cache, memory mapped IO, multiprocessor systems with shared memory, and, together with services, provided by the operating system, describes the mechanism which allows explicit management of this extended memory model.

A simple sequential execution model allows at most one memory access at a time and requires so that all memory accesses seemed to be executed in program order. Unlike this simple model, a relaxed memory model is further defined. In multiprocessor systems that allow multiple locations of data copies, aggressive architecture implementations can allow time intervals during which different copies have different meanings.

The program accesses the memory using the effective address calculated by the processor, when it performs a download, write, jump, or cache management instruction, and when it selects the next sequential instruction. The effective address is converted to a physical address according to the translation procedures. The physical address is used by the memory subsystem to execute memory access. The memory model provides the following features:

Architecture allows memory to take advantage of benefits efficiency from poor sequencing of memory access between processors or between processors and external devices.

Memory accesses by a single processor seem to be completed sequentially from the point of view of the programming model, but this may not end in order with respect to the final position in the memory hierarchy. Order is guaranteed at every level of the memory hierarchy just to access the same address from the same processor.

The architecture must provide instructions to allow the programmer to guarantee consistent and ordered state of memory.

The following defines the resources of the operating system for translating virtual addresses to physical addresses, physical addressing, memory sequencing and physical memory properties, status registers to support virtual memory management, virtual memory errors.

§ 7.1. Physical addressing

The blocks of RAM, ROM, flash, memory mapped IO and other control blocks occupy a common 64-bit physical address space with byte addressing. Accesses to RAM and the IO address ranges can be performed either through virtual addressing, by mapping to a 64-bit physical address space, or directly through physical addressing.

While software should always consider physical addressing as 64-bit, in fact, PALEN less than 64 bits of the physical address can be implemented in hardware. As shown below, the physical address consists of two parts: unimplemented and implemented bits. At least 40 bits of physical addressing must be implemented.

The system software can determine the specific value of PALEN by reading the PALEN field of the configuration word with the cpuid instruction.

Not all of these available addresses have real devices under them. The hardware at startup maps the available address blocks to the physical memory ranges and notifies the system about mapping. Similarly, the control ranges of the registers of external devices are mapped to physical addresses. Most physical addresses usually remain unused.

64-bit physical address
6 3	6 2	6 1	6 0	5 9	5 8	5 7	5 6	5 5	5 4	5 3	5 2	5 1	5 0	4 9	4 8	4 7	4 6	4 5	4 4	4 3	4 2	4 1	4 0	3 9	3 8	3 7	3 6	3 5	3 4	3 3	3 2	3 1	3 0	2 9	2 8	2 7	2 6	2 5	2 4	2 3	2 2	2 1	2 0	1 9	1 8	1 7	1 6	1 5	1 4	1 3	1 2	1 1	1 0	9	8	7	6	5	4	3	2	1	0
reserved																implemented physical address bits

When the processor model doesn't implement all the bits of the physical address, the missing bits must be zero. If the software generates physical addresses with non-zero unimplemented bits, a runtime error occurs. Accessing instructions for unimplemented physical addresses results in the error «unimplemented instruction address». Accessing data by unimplemented physical addresses results in the error «unimplemented data address». Any accesses to the implemented but unused addresses end with an asynchronous «machine check abort» when the platform reports an operation timeout. The exact machine behavior of the check is implementation-dependent.

§ 7.2. Data alignment and atomicity

Memory accesses give a significant performance hit when accessing operands, which are not aligned at the natural address boundary. A naturally-aligned 2-byte number in memory has a zero bit in the low order of the address. A naturally-aligned 4-byte number in memory has two zero bits in the least significant bits of the address. A naturally-aligned 8-byte number in memory has three zero bits in the least significant bits of the address. A 16-byte number, naturally aligned in memory, has four zero bits in the least significant bits of the address. In general, a naturally aligned object of size 2^N bytes has N zero bits in the least significant bits of the address.

Struct data types must provide natural alignment for all of their fields by inserting (paddings). Additionally, it should be possible to use the structs as elements of an array, by using the final padding with the strictest alignment among all struct fields.

Using the example of the following S C language structure, containing a set of various scalars and a character string, the location of the fields in memory is shown.

struct {
   int       a;    /* usual 4 bytes */
   double    b;    /* usual 8 bytes */
   int       c;    /* usual 4 bytes */
   char      d[7];
   short     e;    /* usual 2 bytes */
   int       f;
} S;

C language rules for mapping structures allow the use of paddings (byte skipping) to align scalars in memory on natural boundaries.

Table 7.1: Aligned representation of the structure in memory
0	1	2	3	4	5	6	7
4 bytes (a)				padding
8 bytes (b)
4 bytes (c)				d[0]	d[1]	d[2]	d[3]
d[4]	d[5]	d[6]	padding	2 bytes (e)		padding
4 bytes (f)				final padding

In the example, to map the structure to memory, alignment was made along the boundary that is natural for each scalar. This alignment gives an additional four missing bytes between a and b, one byte between d and e, and two bytes between e and f. Since the alignment for the double precision number b is the strictest for this structure, then the whole structure should be aligned on an 8-byte boundary. This gives 4 more bytes at the end of the struct.

Unaligned memory accesses may throw an error «Unaligned data address». POSTRISC may not contain any hardware support for unaligned memory accesses, limiting itself to the installed program handler of the corresponding interrupts. Therefore, the software is preferred to align all scalar values on their natural boundaries in memory.

Since the instruction fetch, aligned load/store, and operations with semaphores operate only on aligned target addresses, they are atomic. The operation is atomic if for other agents working with memory (other processors, IO devices), memory access from our processor is an indivisible transaction (and vice versa). If our processor stores data to memory, then no other agent will be able to read from memory a mixture of old data and the newly written data replacing them. Similarly, if our processor reads data, then it will never read from memory a mixture of old data and new-write data replacing them from another agent. Of course, at the machine architecture level, these rules only apply to atoms memory, that is, correctly aligned objects of 1, 2, 4, 8, or 16 bytes in size. For arbitrary objects in memory, the atomic nature of their change is not guaranteed by architecture, and software tricks must be applied.

§ 7.3. Byte order

If scalars (individual data elements or instructions) were indivisible, then there would be no concept of «byte order». It makes no sense to consider the order of bits or groups of bits within the smallest addressable memory atom, because this order for an atom cannot be observed and determined. The question of order arises only when scalars, which the programmer and processor refer to as indivisible objects, occupy more than one addressable memory atom.

For most existing computer architectures, the smallest addressable memory atom is a 8-bit bytes. Other scalars consist of groups of 2, 4, 8, or 16 bytes in length. When a 4-byte scalar moves from register to memory, it occupies four consecutive byte addresses. Thus, it becomes necessary to establish the order of byte addresses relative to the scalar value: which byte contains the most significant eight bits of the scalar, which byte contains the next eight bits of importance, and so on.

For a scalar consisting of several atoms (bytes) of memory, the choice of byte order in memory is essentially arbitrary. There is N! ways to determine the order of N bytes within a long number, but only two of these orderings are actually used.

The order in which the smallest address is assigned to a byte that contains eight bits of a scalar of the lowest order (the rightmost bits), the next consecutive address is next in ascending order of eight bits, and so on. This order is called little-endian because it is least significant (from to the smaller end) the bits of the scalar, regarded as a binary number, are the first to go into memory. Intel-X86 is an example of an architecture using this byte order.

In a little-endian machine, bytes within a large number are numbered from right to left in decreasing order of byte addresses, so the low byte is stored in memory at the lowest address. This is a direct byte order (a format for storing and transmitting binary data, in which the least (least significant) bit (byte) is transmitted first.

The order in which the smallest address is assigned to a byte that contains eight bits of a scalar of the highest order (the leftmost bits), the next consecutive address is the next in descending order of eight bits, and so on. This order is called big-endian because the most significant ones (from the larger end) the bits of the scalar, regarded as a binary number, are the first to go into memory. IBM PowerPC is a sample architecture using this byte order.

In a big-endian machine, bytes within a large number are numbered from left to right in ascending order of byte addresses, so the low byte is stored in memory at the highest address. This is the reverse byte order (a format for storing and transmitting binary data in which the most significant (most significant) byte is transmitted or stored first. The terms little/big-endian comes from Gulliver's Travel Jonathan Swift.

Using the example of the following S structure of the C language containing a set of various scalars and a character string, shows the location of fields in memory under different conventions on byte order. Comments show values for each element of the structure. These values show how the individual bytes that make up each element of the structure are mapped into memory.

struct {
   int     a;     /* 0x1112_1314 (4 bytes) */
   double  b;     /* 0x2122_2324_2526_2728 (8 bytes) */
   int     c;     /* 0x3132_3334 (4 bytes) */
   char    d[7];  /* "A","B","C","D","E","F","G" bytes array */
   short   e;     /* 0x5152 (2 bytes) */
   int     f;     /* 0x6162_6364 (4 bytes) */
} S;

C language rules for mapping structures allow the use of inserts (byte skipping) to align scalars in memory at desired (natural) boundaries. In the examples below, the mapping of the structure into memory is done with natural alignment border for each scalar. This alignment gives an additional four missing bytes between a and b, one byte between d and e, and two bytes between e and f. The same amount of padding is present in big-endian and little-endian mappings.

The contents of each byte, as defined in the S structure, are displayed as a hexadecimal number or character (for line elements). Cell addresses (offsets from the beginning of the structure) are shown below the data stored at this address.

Table 7.4: Little-endian structure mapping `S`
0x14 0	0x13 1	0x12 2	0x11 3	padding 4	padding 5	padding 6	padding 7
0x28 8	0x27 9	0x26 10	0x25 11	0x24 12	0x23 13	0x22 14	0x21 15
0x34 16	0x33 17	0x32 18	0x31 19	«A» 20	«B» 21	«C» 22	«D» 23
«E» 24	«F» 25	«G» 26	padding 27	0x52 28	0x51 29	padding 30	padding 31
0x64 32	0x63 33	0x62 34	0x61 35	padding 36	padding 37	padding 38	padding 39

Table 7.5: Big-endian structure mapping `S`
0x11 0	0x12 1	0x13 2	0x14 3	padding 4	padding 5	padding 6	padding 7
0x21 8	0x22 9	0x23 10	0x24 11	0x25 12	0x26 13	0x27 14	0x28 15
0x31 16	0x32 17	0x33 18	0x34 19	«A» 20	«B» 21	«C» 22	«D» 23
«E» 24	«F» 25	«G» 26	padding 27	0x51 28	0x52 29	padding 30	padding 31
0x61 32	0x62 33	0x63 34	0x64 35	padding 36	padding 37	padding 38	padding 39

For POSTRISC architecture, the primary is the little-endian direct order. All operations on data in registers/memory are carried out according to this order. Implementations may include optional support for big-endian addressing for loading/storing numbers.

The bit numbering within bytes doesn't affect the byte numbering convention (big-endian or little-endian). The byte numbering convention doesn't matter when accessing the full aligned data in memory. However, the numbering agreement is important when accessing less or not aligned data, or when manipulating data in registers, as follows:

Retrieving the 5th byte from an 8-byte number into the low byte of the register requires a right shift 5 bytes according to the little-endian agreement, but the right shift is 2 bytes according to the big-endian agreement.

The manipulation of data in the register is almost the same for both conventions. In both integers and floating-point numbers store the sign bits in the leftmost byte and their least significant bit in the rightmost byte, so the same integer instructions and floating-point instructions are used unchanged for both conventions. However, big-endian character strings have their most significant character on the left, while little-endian strings have their most significant character on the right.

In addition to little-endian and big-endian, there are other (combined) options for storing long scalars in memory. For example, some architecture (PDP-11?) stores double-byte numbers according to the little-endian order, but 4-byte numbers as pairs of double-byte numbers but according to big-endian order. It happens that integers are stored according to one principle, and real ones according to another, for example, if a floating-point coprocessor (ARM, TMS320C4x) is added to the integer processor later.

§ 7.4. Memory consistency model

There are several memory-consistency models for SMP systems:

Sequential consistency (all reads and all writes are in-order).
Relaxed consistency (some types of reordering are allowed):
- loads can be reordered after loads (for better working of cache coherency, better scaling),
- loads can be reordered after stores,
- stores can be reordered after stores,
- stores can be reordered after loads.
Weak consistency (reads and writes are arbitrarily reordered, limited only by explicit memory barriers)

Atomic operations can be reordered with loads and stores.

The instruction fetching is incoherent with data, so self-modifying code can't be executed without special instruction cache flush/reload instructions plus maybe jump instructions.

The POSTRISC follows the weak memory model. And same the weak memory model with acquire loads and release stores also called release-consistency model. Only the acquire/release atomic instructions are synchronization points.

Table 7.6: Memory ordering in some architectures
Architecture	Loads can be reordered after		Stores can be reordered after		Atomics can be reordered with		Dependent loads can be reordered	Incoherent instruction cache/ pipeline
Architecture	loads	stores	loads	stores	loads	stores	Dependent loads can be reordered	Incoherent instruction cache/ pipeline
Alpha	+	+	+	+	+	+	+	+
ARM	+	+	+	+	+	+		+
RISC-V WMO	+	+	+	+	+	+		+
RISC-V TSO			+					+
PA-RISC	+	+	+	+
POWER	+	+	+	+	+	+		+
SPARC RMO	+	+	+	+	+	+		+
SPARC PSO			+	+		+		+
SPARC TSO			+					+
x86			+					+
AMD-64			+
IA-64	+	+	+	+	+	+		+
IBM-Z			+
Postrisc	+	+	+	+	+	+		+

Notes: On Alpha the dependent loads can be reordered. If the processor first fetches a pointer to some data and then the data, it might not fetch the data itself but use stale data which it has already cached and not yet invalidated. Allowing this relaxation makes cache hardware simpler and faster but leads to the requirement of memory barriers for readers and writers. On Alpha hardware (like multiprocessor Alpha 21264 systems) cache line invalidations sent to other processors are processed in lazy fashion by default, unless requested explicitly to be processed between dependent loads. The Alpha architecture specification also allows other forms of dependent loads reordering, for example using speculative data reads ahead of knowing the real pointer to be dereferenced.

§ 7.5. Atomic/synchronization instructions

The processor implementation must follow the programmatic order of executing the instructions of a single-threaded program. But the effects of the actions of one thread on the memory can be observed by other threads not in the programmatic order of this thread Depending on the guarantees that the architecture explicitly gives and the permissions that the implementation is explicitly allowed, talk about a stricter or weaker ordering of memory. POSTRISC is an architecture with weak memory ordering. There are no obvious restrictions on the visibility of third-party processors or other devices (for example, input-output) actions on the memory of the current thread. Similarly, the current thread has no explicit guarantees on the other agents actions visibility order.

Special «fence» instructions are memory barriers.

Instruction format for atomic fences
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							0																								opx

Table 7.7: atomic fence instructions
Fence instruction	Memory ordering
fence.a	acquire
fence.r	release
fence.ar	acquire-release (acq_rel)
fence.sc	sequential-consistent (seq_cst)

Special instructions «load-atomic» and «store-atomic» used to push the visibility of changes from one thread to another. relaxed - normal operation, acquire - acquire changes, release - submit changes. Before changing the general data, the thread performs acquire by load-acquire from the watchdog variable. Similarly, after a change is made to the general data, the stream pushes the changes, executing release by store-release to the watchdog variable.

INSN_MNEMONIC target, base

Instruction format for atomic load/store
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							target							base							0							0			opx

These instructions are one-way barriers. They do not allow speculative and out-of-order execution of operations with memory through themselves. Acquire doesn't allow the subsequent instructions to move forward, and release doesn't let the instructions before it lag behind. With the correct (pairwise) use of acquire-release, a closed section of code is obtained, locked at the top (acquire) and at the bottom (release).

Table 7.8: atomic load-store instructions
Description	Ordering	Instruction
Description	Ordering	1-byte	2-byte	4-byte	8-byte	16-byte
load atomic	relaxed	lda.b	lda.h	lda.w	lda.d	lda.q
load atomic	acquire	lda.b.a	lda.h.a	lda.w.a	lda.d.a	lda.q.a
store atomic	relaxed	sta.b	sta.h	sta.w	sta.d	sta.q
store atomic	release	sta.b.r	sta.h.r	sta.w.r	sta.d.r	sta.q.r

The load-op atomic instructions copy the old value of a variable from memory to register, and sets a new value in memory (test-and-set), or obtained from the old (fetch-and-add and analogues). The possible memory orderings: relaxed, acquire, release, acq_rel (acquire-release).

ea = gr[base]
gr[dst] = mem[ea]
mem[ea] = dst op gr[src]

Instruction format load-op
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							base							src							0			opx

ld_op_type dst, base, src

Table 7.9: atomic instructions load-op
Description	Ordering	Instruction
Description	Ordering	1-byte	2-byte	4-byte	8-byte	16-byte
swap	relaxed	swap.b	swap.h	swap.w	swap.d	swap.q
	acquire	swap.b.a	swap.h.a	swap.w.a	swap.d.a	swap.q.a
	release	swap.b.r	swap.h.r	swap.w.r	swap.d.r	swap.q.r
	acq_rel	swap.b.ar	swap.h.ar	swap.w.ar	swap.d.ar	swap.q.ar
addition	relaxed	ld.add.b	ld.add.h	ld.add.w	ld.add.d	ld.add.q
	acquire	ld.add.b.a	ld.add.h.a	ld.add.w.a	ld.add.d.a	ld.add.q.a
	release	ld.add.b.r	ld.add.h.r	ld.add.w.r	ld.add.d.r	ld.add.q.r
	acq_rel	ld.add.b.ar	ld.add.h.ar	ld.add.w.ar	ld.add.d.ar	ld.add.q.ar
bitwise AND	relaxed	ld.and.b	ld.and.h	ld.and.w	ld.and.d	ld.and.q
	acquire	ld.and.b.a	ld.and.h.a	ld.and.w.a	ld.and.d.a	ld.and.q.a
	release	ld.and.b.r	ld.and.h.r	ld.and.w.r	ld.and.d.r	ld.and.q.r
	acq_rel	ld.and.b.ar	ld.and.h.ar	ld.and.w.ar	ld.and.d.ar	ld.and.q.ar
bitwise OR	relaxed	ld.or.b	ld.or.h	ld.or.w	ld.or.d	ld.or.q
	acquire	ld.or.b.a	ld.or.h.a	ld.or.w.a	ld.or.d.a	ld.or.q.a
	release	ld.or.b.r	ld.or.h.r	ld.or.w.r	ld.or.d.r	ld.or.q.r
	acq_rel	ld.or.b.ar	ld.or.h.ar	ld.or.w.ar	ld.or.d.ar	ld.or.q.ar
bitwise XOR	relaxed	ld.xor.b	ld.xor.h	ld.xor.w	ld.xor.d	ld.xor.q
	acquire	ld.xor.b.a	ld.xor.h.a	ld.xor.w.a	ld.xor.d.a	ld.xor.q.a
	release	ld.xor.b.r	ld.xor.h.r	ld.xor.w.r	ld.xor.d.r	ld.xor.q.r
	acq_rel	ld.xor.b.ar	ld.xor.h.qr	ld.xor.w.ar	ld.xor.d.ar	ld.xor.q.ar
signed minimum	relaxed	ld.smin.b	ld.smin.h	ld.smin.w	ld.smin.d	ld.smin.q
	acquire	ld.smin.b.a	ld.smin.h.a	ld.smin.w.a	ld.smin.d.a	ld.smin.q.a
	release	ld.smin.b.r	ld.smin.h.r	ld.smin.w.r	ld.smin.d.r	ld.smin.q.r
	acq_rel	ld.smin.b.ar	ld.smin.h.ar	ld.smin.w.ar	ld.smin.d.ar	ld.smin.q.ar
signed maximum	relaxed	ld.smax.b	ld.smax.h	ld.smax.w	ld.smax.d	ld.smax.q
	acquire	ld.smax.b.a	ld.smax.h.a	ld.smax.w.a	ld.smax.d.a	ld.smax.q.a
	release	ld.smax.b.r	ld.smax.h.r	ld.smax.w.r	ld.smax.d.r	ld.smax.q.r
	acq_rel	ld.smax.b.ar	ld.smax.h.ar	ld.smax.w.ar	ld.smax.d.ar	ld.smax.q.ar
unsigned minimum	relaxed	ld.umin.b	ld.umin.h	ld.umin.w	ld.umin.d	ld.umin.q
	acquire	ld.umin.b.a	ld.umin.h.a	ld.umin.w.a	ld.umin.d.a	ld.umin.q.a
	release	ld.umin.b.r	ld.umin.h.r	ld.umin.w.r	ld.umin.d.r	ld.umin.q.r
	acq_rel	ld.umin.b.ar	ld.umin.h.ar	ld.umin.w.ar	ld.umin.d.ar	ld.umin.q.ar
unsigned maximum	relaxed	ld.umax.b	ld.umax.h	ld.umax.w	ld.umax.d	ld.umax.q
	acquire	ld.umax.b.a	ld.umax.h.a	ld.umax.w.a	ld.umax.d.a	ld.umax.q.a
	release	ld.umax.b.r	ld.umax.h.r	ld.umax.w.r	ld.umax.d.r	ld.umax.q.r
	acq_rel	ld.umax.b.ar	ld.umax.h.ar	ld.umax.w.ar	ld.umax.d.ar	ld.umax.q.ar

The store-op atomic instructions update value in memory via corresponing operation. Comparing to load-op, they don't return old variable value from memory to register, so may be implemented as a one-way communication. The possible memory orderings: relaxed, acquire, release, acq_rel (acquire-release), seq_cst (sequentially-consistent).

ea = gr[base]
mem[ea] = mem[ea] op gr[src]

Instruction format store-op
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							0							base							src							0			opx

store_op_type base, src

Table 7.10: atomic instructions store-op
Description	Ordering	Instruction
Description	Ordering	1-byte	2-byte	4-byte	8-byte	16-byte
addition	relaxed	st.add.b	st.add.h	st.add.w	st.add.d	st.add.q
addition	release	st.add.b.r	st.add.h.r	st.add.w.r	st.add.d.r	st.add.q.r
bitwise AND	relaxed	st.and.b	st.and.h	st.and.w	st.and.d	st.and.q
bitwise AND	release	st.and.b.r	st.and.h.r	st.and.w.r	st.and.d.r	st.and.q.r
bitwise OR	relaxed	st.or.b	st.or.h	st.or.w	st.or.d	st.or.q
bitwise OR	release	st.or.b.r	st.or.h.r	st.or.w.r	st.or.d.r	st.or.q.r
bitwise XOR	relaxed	st.xor.b	st.xor.h	st.xor.w	st.xor.d	st.xor.q
bitwise XOR	release	st.xor.b.r	st.xor.h.r	st.xor.w.r	st.xor.d.r	st.xor.q.r
signed minimum	relaxed	st.smin.b	st.smin.h	st.smin.w	st.smin.d	st.smin.q
signed minimum	release	st.smin.b.r	st.smin.h.r	st.smin.w.r	st.smin.d.r	st.smin.q.r
signed maximum	relaxed	st.smax.b	st.smax.h	st.smax.w	st.smax.d	st.smax.q
signed maximum	release	st.smax.b.r	st.smax.h.r	st.smax.w.r	st.smax.d.r	st.smax.q.r
unsigned minimum	relaxed	st.umin.b	st.umin.h	st.umin.w	st.umin.d	st.umin.q
unsigned minimum	release	st.umin.b.r	st.umin.h.r	st.umin.w.r	st.umin.d.r	st.umin.q.r
unsigned maximum	relaxed	st.umax.b	st.umax.h	st.umax.w	st.umax.d	st.umax.q
unsigned maximum	release	st.umax.b.r	st.umax.h.r	st.umax.w.r	st.umax.d.r	st.umax.q.r

Instructions cas (compare and swap), Designed for non-blocking interactions in a multi-threaded multiprocessor environment. Both instructions are atomic indivisible memory operations that cannot be partially performed.

The cas instruction reads an N-byte number from memory at the address from the base register, and compares it with the value in the register dst. If the values match, the instruction saves the new value from the src register at this address. Otherwise, the instruction doesn't save anything at this address. The base address must be aligned at the N-byte boundary. The read value is stored in the register dst.

value = mem [base]
if (value == gr [dst]) {
 mem [base] = gr [src]
}
gr [dst] = value

Format of instructions casX
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							base							src							0			opx

Table 7.11: atomic CAS instructions
Ordering	Instruction
Ordering	1-byte	2-byte	4-byte	8-byte	16-byte
relaxed	cas.b	cas.h	cas.w	cas.d	cas.q
acquire	cas.b.a	cas.h.a	cas.w.a	cas.d.a	cas.q.a
release	cas.b.r	cas.h.r	cas.w.r	cas.d.r	cas.q.r
acq_rel	cas.b.ar	cas.h.ar	cas.w.ar	cas.d.ar	cas.q.ar

Using the following procedure, a stream can modify the contents of a memory cell even if there is a possibility that the stream can be interrupted and replaced by another thread that will update the cell, or that a thread on another processor can simultaneously modify a cell. First, an 8-byte number is entirely loaded into the register. Then, the updated value is computed and placed in another sval register. Then the casq instruction is executed with parameters test (register number where the initial value), base (base address register number) and sval (register number that contains the updated value). If the modification was completed successfully, the original value will be returned. If the memory cell doesn't contain the original value (the current thread was interrupted or the thread of another processor interrupted), the update will not be successful, general register with the number dst of the instruction casq contains the new current value of the memory cell. If the memory cell doesn't contain an original value, the thread may try to repeat the procedure again using the new current value.

loop:
lda.d test, base
mov save, test
...
addi sval, dst, 12; some kind of modification
...
cas.d sval, base, test
b.ne save, test, loop

The instruction casd can be used for controlled sharing of a common data area, including the ability to send messages (to a linked message list) when a common area is in use. To achieve this, an 8-byte number in memory can be used as a control number. A value of zero indicates that the common area is not in use, and that no messages exist. A negative value indicates that the area is in use by someone, and that no messages exist. A positive value indicates that the shared area is in use, and that the value is the address of the most recent message added to the list. Therefore, any number of threads wishing to capture a common area, can use casd to update the check number, to indicate that the area is in use or add messages to the list. The only thread that has captured the shared area can also safely use casq to remove messages from the list.

The instruction casq can be used similarly to casd. Additionally, it has other uses. Consider a linked data list, with a check number used to address the first message in the list, as described above. If multiple threads are allowed to delete messages using casd (and not just the only thread that captured the common area), then the list will probably be incorrectly modified. This can happen if, for example, after one thread reads the address the very last message to move the message, another thread will delete the first two messages, and then adds the first message back to the linked list («ABA» issue in IBM terminology). The first thread, continuing the interrupted execution, will not be able to determine that the list has changed. By increasing the size of the control word to a pair of 8-byte numbers, containing the address of the first message and the modification tag (change number), which increases by 1 each time the list is modified, and using casq to update both fields together, the possibility of incorrect list updating can be reduced to an insignificant level. Namely, incorrect modification can occur only if the first stream was interrupted and during this time the number of changes to the list is exactly a multiple of 2⁶⁴, and only if the last change to the list uses the original address of the message.

§ 7.6. Memory attributes

The architecture of any processor needs a simple and effective mechanism for distinguishing memory accesses and IO operations for IO devices mapped to the address space. When accessing the memory, it is possible to cache data with a write back, blocking operations with semaphores are allowed, optimizing reordering of load/store operations and optimizing write-combining(coalescing) of write operations are available. For operations with mapped IO devices, write through is strictly necessary and cannot be cached, possible side effects even when reading, you need a strict order (sequential) without permutations/merges.

In addition, a dedicated, fixed address range must exist in the physical address space for the bootloader code. It's some analog of the PC EPROM and BIOS. It contains entry point from which the execution starts after the system restart, and other embedded code, implementation-dependent (processor-dependent code or PDC) and platform (system-dependent code or SDC). It is a read-only memory block, although updating may be permitted.

Memory attributes define speculativeness, cacheability, orderliness, and write policy. If virtual addressing is enabled, the memory attributes that define the actually displayed physical page are determined by the TLB. If physical addressing is enabled, memory attributes are determined based on the physical address.

The software must use the correct address subspaces when using physical addressing. Otherwise, incorrect access to IO devices with side effects is possible.

An address range can be either cacheable or non-cacheable. If the range is cacheable, the processor is allowed to distribute a local copy corresponding physical memory at all levels of the processor cache hierarchy. Distribution can be changed by cache management instructions.

The cached page is memory coherent, i.e. the processor and memory system guarantee that there is a consistent representation of memory for each processor. Processors support multiprocessor cache coherence based on physical addresses between all processors in the coherence domain (tightly coupled multiprocessors). Coherence doesn't depend on virtual aliases, since they are forbidden.

The processor is not required to support coherence between local instruction and data caches; that is, locally, the entry may not be observable by the local instruction cache. Moreover, multiprocessor coherence is not required from the instruction cache. However, the processor must ensure that the operations of other IO agents like «Direct Memory Access» (DMA) are physically coherent with a cache of data and instructions.

For an uncached access, the processor doesn't provide any coherence mechanisms. The memory system must ensure that a consistent memory representation is seen by each processor.

When writing to cached memory with write-back, only the processor-owned local copy of the data cache line changes. Writing to a lower level cache system (or to the level of the physical arrangement of data in memory) occurs when a changed cache line is explicitly (or implicitly) pushed out of a higher level cache. With write through policy, data changes affect all levels of caching immediately.

For non-cached address ranges, a write-combining (coalescing) can be set, which tells the processor that multiple writes to a limited memory area (typically 32 bytes) can be assembled together in the write buffer and made later as one large combined write. The processor can combine writes for an indefinite period of time. Several writes can be combined into one large, which accumulates in the buffer. Write-combining – means to increase processor efficiency. A processor with multiple write buffers should provide the preemptive order, using buffers approximately the same, even if some buffers are only partially full.

The processor can flush data from write buffers to memory in any order. The combined writes aren't performed in the original order. Write-combining can be either spatial or time-based. For example, writing bytes 4 and 5 and writing bytes 6 and 7 are combined into a single writing of bytes 4, 5, 6, and 7. In addition, writing bytes 5 and 6 is combined with subsequent writing of bytes 6 and 7, into a single write of bytes 5, 6, and 7 (with the removing of the first write to the byte 6).

The memory attributes may be defined in several ways.

Memory attributes may be defined via special registers at the level of physical address ranges. In X86 the special memory type range registers (MTRRs) are a set of processor supplementary capability control registers that provide system software with control of how accesses to memory ranges by the CPU are cached. It uses a set of programmable model-specific registers (MSRs) which are special registers provided by most modern CPUs. Possible access modes to memory ranges can be uncached, write-through, write-combining, write-protect, and write-back. In write-back mode, writes are written to the CPU's cache and the cache is marked dirty, so that its contents are written to memory later. Write-combining allows bus write transfers to be combined into a larger transfer before bursting them over the bus to allow more efficient writes to system resources like graphics card memory. This often increases the speed of image write operations by several times, at the cost of losing the simple sequential read/write semantics of normal memory. Additional bits, added in AMD64, allow the shadowing of ROM contents in system memory (shadow ROM), and the configuration of memory-mapped I/O.

Memory attributes may be defined at the level of virtual addresses via virtual page properties as an additional part of cached translation info. Then such per-page memory attributes may redefine previous per-range physical address atributes or restrict them in compatible manner.

Memory attributes may be determined by physical memory mapping only. In this case, fixed address ranges have specified memory attributes. Memory attributes are set implicitly during the initial physical address ranges mapping at reset and can't be changed further.

In the POSTRISC, the last way will be choosen. Memory attributes of physical address ranges are defined from their mapping to corresponding physical adreess ranges. They can't be redefind further via special registers and/or page properties. So the physical address space is divided into fixed parts with mmio-like and memory-like address ranges.

Table 7.12: Classification of physical addresses
Addresses	Use
0 to 1 GiB	mmio-like for compatible devices (not 64-bit ready).
1 to 4 GiB	memory-like for compatible devices (not 64-bit ready).
4-256 GiB	mmio-like for 64-bit ready devices
over 256 GiB	memory-like main space.

§ 7.7. Memory map

From the system point of view, the physical address space is a bunch of devices, each of which is mapped to continuous address range. Everything is the memory-mapped device: memory RAM units are devices, external io devices are naturally memory-mapped devices, even processor cores are memory-mapped devices.

The bus controller which controls memory mapping is also the memory-mapped device. The special «device array» device maps all device configuration spaces (similar to PCI root complex). Each device has 4 KiB configuration space maximum in device array.

At least one block address in the physical memory map should be fixed in architecture: starting address in ROM for code execution after reset. Other blocks layout may be also fixed. Or may be known from the ROM code.

start	end	size	description
0x0000000000000000	0x00000000ffffffff	4GiB	reserved
0x0000000100000000	0x00000001000fffff	1MiB	chipset control
0x00000001f0000000	0x00000001ffffffff	256MiB	ROM
0x0000000200000000	0x00000002ffffffff	4GiB	PCIE ECAMs (16x256MiB)
0x0000004000000000	0x0000004fffffffff	64GiB	PCIE BARs
0x0000010000000000	0x000003ffffffffff	2TiB	RAM

The memory map should be consistent with memory attributes. Chipset control, PCIE config spaces, memory-mapped io: should be mapped to mmio-like ranges. Memory devices: should be mapped to memory-like ranges. ROM devices: may be both, but the startup ROM should be memory-like.

§ 7.8. Memory-related instructions

Instructions to clear the cache icbf (instruction cache block flush) and dcbf (data cache block flush) supplant the entire contents of the write buffers, whose addresses are no more than 32 bytes from the aligned address (at the boundary of 32 bytes), specified by icbf or dcbf, forcing the data to become visible. The icbf and dcbf instructions may also preempt additional write buffers.

Instruction without parameters msync (memory synchronize) – this is a hint for the processor to speed up the flushing out of all pending (buffered) stores, regardless of their addresses. This makes pending entries visible to other memory agents.

There is no way to know when the preemption of writes will be completed. The ordering of joined records is not guaranteed, so that later writes may occur before previous writes. To ensure that preceding linked entries are made visible before later entries, software must serialize between entries.

The processor can at any time flush connected writes to memory in any order before the software explicitly requires it.

Pages that allow writes to be joined are not necessarily coherent with write buffers or caches of other processors, or with local processor caches. Downloads to connected pages of memory by the processor see the results of all previous writes by the same processor in the same connected page of memory. Memory calls made by a connecting buffer (such as buffer streams) have an unordered non-sequential memory ordering attribute.

The MMGR family includes instructions for working with special registers, barrier instructions, cache management, dynamic procedure calls, interprocess communication, etc.

41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							s	nt		opx				label28
opcode							s	nt		opx				base							simm21
opcode							s	nt		opx				base							index							scale			opx				disp

The second register contains the base address (only the address register). The rest of the instruction is reserved for storing the offset, a 9-bit signed number. Formulas to get the effective address:

ip + 16 × sign_extend(label23)

gr [base] + sign_extend(disp)

gr [base] + (gr [index] << scale) + sign_extend(disp)

Instructions ECB (Evict cache block), FETCH (Prefetch data), FETCHM (Prefetch data, modify intent), WH64 (Write hint 64 bytes) regulate the use of cache resources.

FETCH - load the block into the cache for reading N times (if N=0, then free the block).

FETCHM - load a block into the cache for modification N times (if N=0, then push the block out of the cache into memory).

Chapter 8. Virtual memory

This chapter additionally defines operating system resources to translate 64-bit virtual addresses to physical addresses. The virtual memory model introduces the following key features that distinguish it from the simplified presentation of application programs:

Translation lookaside buffer (TLB) support high-performance paged virtual memory systems. Software handlers for populating and protecting TLBs allow the operating system to control translation policies and protection algorithms.

Page table (PT) with hardware browsing capabilities has been added to increase TLB performance. PT is a continuation of the processor TLB, which is located in RAM and can be automatically viewed by the processor. The use of PT, its size, is entirely under software control.

Sparse 64-bit virtual addressing is supported by provisioning large translation structures (including multi-level hierarchies, like a cache hierarchy), effective support for processing translation misses, pages of different sizes, fixed (non-replaceable out) translations, mechanisms for sharing TLBs and page table resources.

The main addressable object in the architecture is an 8-bit byte. Virtual addresses are 64 bits long. An implementation may support less virtual address space. Virtual addresses visible by the program are translated into physical memory addresses by the memory management mechanism.

§ 8.1. Virtual addressing

From an application point of view, the virtual addressing model represents a 64-bit single flat linear virtual address space. General purpose registers are used as 64-bit pointers in this address space.

Less than 64 bits of a virtual address may be implemented in hardware. Unimplemented address bits must be filled with copies of the last implemented bit (be a sign extension of the implemented part of the address). Addresses in which all unimplemented bits match the last implemented bit are called canonical. Implemented virtual address space in this case consists of two parts: user and kernel. For N implemented virtual address bits, the user addresses ranges from 0 to 2^N-1-1, and the kernel addresses ranges from 2⁶⁴-2^N-1 to 2⁶⁴-1.

So, for example, for 48 bits:

0x0000000000000000 - start of user range
0x00007FFFFFFFFFFF - end of user range
0xFFFF800000000000 - beginning of the kernel range
0xFFFFFFFFFFFFFFFF - end of kernel range

Each virtual address consists of a page table index (1 bit), virtual page number (VPN) and page offset. The least significant bits form the page offset. The virtual page number consists of the remaining bits. Page offset bits don't change during translation. The border between page offset and VPN in the virtual address changes depending on the page size, used in virtual display. In the current implementation, 16 Kib page sizes are available, and super pages are multiples of 16 Kib (32 MiB and 64 GiB).

Virtual address, unimplemented bits, 16 KiB pages
6 3	6 2	6 1	6 0	5 9	5 8	5 7	5 6	5 5	5 4	5 3	5 2	5 1	5 0	4 9	4 8	4 7	4 6	4 5	4 4	4 3	4 2	4 1	4 0	3 9	3 8	3 7	3 6	3 5	3 4	3 3	3 2	3 1	3 0	2 9	2 8	2 7	2 6	2 5	2 4	2 3	2 2	2 1	2 0	1 9	1 8	1 7	1 6	1 5	1 4	1 3	1 2	1 1	1 0	9	8	7	6	5	4	3	2	1	0
sign extension																	virtual page number																																	16 KiB page offset

Switching between physical and virtual addressing modes is controlled by the privileged special register pta. The mode field sets the page translation mode. After restarting the processor, this flag is zero. Virtual addressing is allowed via pta.mode!=0.

Table 8.1: PTA modes
pta.mode	description
0	without translation (physical addressing)
1	reserved
2	2 translation levels
3	3 translation levels
4	4 translation levels

A variable page size is needed to help the software display system resources, to improve TLB utilization. Typically, operating systems choose a small range of page sizes to implement their virtual memory algorithms. Large pages can be statically distributed. For example, large areas of virtual addressing space can be allocated to the kernel of the operating system, frame buffers, or mapped IO regions. The software can also selectively pin these translations by placing them in translation registers.

Page size can be specified in: translation cache, translation registers, and PT. Page size can also be used as a parameter for TLB cleanup instructions.

The page sizes are encoded as a 4-bit field ps (pagesize). Each field defines the display size of 2^ps+12 bytes.

Virtual and physical pages should be aligned on their natural border. For example, 64 kilobyte pages are aligned at the 64KiB border, and 4 megabyte along the border of 4 megabytes.

Processors using variable virtual page sizes, are characterized by the need for hardware implementation of the fully associative TLB buffer. Processors that use only one page size can be bypassed in part by an associative buffer, although usually fully associative.

Table 8.2: Page permissions
abbreviation	designation	description
r	read	read access with the usual load/store instructions
w	write	write access with normal load/store instructions
x	execute	code execution access
b	backstore	saving/restoring registers from the hardware register stack
f	finalized	final state, page rights cannot be changed, gives the right to read addresses for indirect call instructions through trusted import tables and virtual function tables
p	promote	the right to elevate privileges of the current thread to the kernel level

The software can check page level permissions with the instructions mprobe, mprobef, which check the availability of this virtual page, privilege level, read/write permissions at the page level, and read/write permissions with a security key.

Executable-only pages may be used, to increase privileges on entering operating system code. User level code should usually go to such a page (managed by the operating system) and execute the instruction epc (Enter Privileged Code). When epc has successfully elevated privileges, the subsequent instructions are executed at the target privilege level indicated by the page. A branch can (optionally) lower the current privilege level if the page where the branch is made has a lower privilege level.

§ 8.2. Translation lookaside buffers

Virtual addresses are translated to physical addresses using a hardware structure called Translate Lookaside Buffer (TLB) or translation cache. Using the virtual page number (VPN), the TLB finds and returns the physical page number (PPN). A processor usually have two TLB architectural buffers: instruction TLB (ITLB) and data TLB (DTLB). Each TLB buffer translates, respectively, references to instructions and data. In a simplified implementation, a single (combined) buffer used for both types of translation can be implemented. The term TLB itself refers to the union of instructions, data, and translation cache structures.

When the processor accesses the memory in the TLB, a translation record is searched with the corresponding VPN value. If the corresponding translation record is found, the physical page number PPN (physical page number) is combined with the page offset bits, to form a physical address. In parallel with the translation, page permissions are checked by privilege level and the permissions granted for reading, writing, and executing are checked.

If the required translation is not found in the TLB, the processor itself can search the page table in memory and install it in the TLB. If the required input cannot be found in the TLB and/or page table, the processor generates a miss error in the TLB so that the operating system establishes the translation. In a simplified implementation, the processor may generate an error immediately after a miss in the TLB. After the operating system installs the translation in the TLB and/or page table, the erroneous instruction may be restarted and execution continues.

Translation format in TLB
6 3	6 2	6 1	6 0	5 9	5 8	5 7	5 6	5 5	5 4	5 3	5 2	5 1	5 0	4 9	4 8	4 7	4 6	4 5	4 4	4 3	4 2	4 1	4 0	3 9	3 8	3 7	3 6	3 5	3 4	3 3	3 2	3 1	3 0	2 9	2 8	2 7	2 6	2 5	2 4	2 3	2 2	2 1	2 0	1 9	1 8	1 7	1 6	1 5	1 4	1 3	1 2	1 1	1 0	9	8	7	6	5	4	3	2	1	0
ppn																																																		pl		ma				a	d	0	p	ar			v
vpn																																																		rv								ps
rv																																								asid

Table 8.3: TLB translation record fields
Translation field	Description
v	Valid bit. If this bit is 1, then translation can be used in the search.
ar	Global permissions for the virtual page.
p	Present bit. This bit indicates that the mapped physical page is present in the physical memory and not ejected to disk.
ma	memory attributes. Describes caching, coherence, writing method, and other attributes of the displayed physical page.
a	Access bit. This bit may cause an error on access for tracing or debugging purposes. The processor doesn't modify Access bit when referenced.
d	Dirty bit. There was a write or semaphore instruction on this page.
pl	Privilege Level or Page Level.
ppn	Physical page number.
ps	Page size 2^ps bytes.
vpn	Virtual Page Number.
asid	Address Space identifer.
rv	Reserved (doesn't exist)

TLB is local processor resource (local insertion or clearing of translation entries on one processor doesn't affect the TLB of another processor). Global TLB cleanup is provided to clean translations in all processors within a coherent TLB region in a multiprocessor system.

Translation Cache (TC) is an implementation-defined structure, designed to store a small working set of dynamic translations for links to memory. The processor directly controls the record replacement policy in the TC.

Purge translation cache ptc (purge translation cache) produces cleaning ITC/DTC local processor entries that match the specified range of virtual addresses. The software should handle the case where cleaning should be extended to all processors in a multiprocessor system. Flushing the translation cache doesn't affect fixed TC inputs.

The translation cache has at least 16 inputs for itc and 16 inputs for DTC. An implementation may have additional levels of a TLB hierarchy to increase efficiency.

The translation cache is controlled by both software and hardware. Generally speaking, the software cannot assume how long any installed translation will remain in the cache. This term, as well as the replacement (extrusion) algorithm, depends on the implementation. A processor can push translations out of the cache at any given time for various reasons. TC cleanups can remove more inputs than is explicitly required.

Records in the translation cache must be maintained in a consistent state. When you insert or clean a TLB, all existing entries must be deleted. which partially or completely overlap with the given translations. In this context, overlap refers to two translations with partially or completely overlapping ranges of virtual addressing. For example: two 64K pages with the same virtual addressing, or a 128K page with the virtual address 0x20000 and a 64K page with the address 0x30000.

Translation registers (TR) is part of the TLB, which contain translations whose replacement policies are controlled directly by the software. Each translation cache entry can be fixed and turned to a software-controlled translation register or unlocked and sent to a common pool. Fixed translations are not replaced when the TC overflows (but are flushed when overlapping with new translations). Fixed insert into the previously unfixed TC entry removes the cached translation in this entry. The software can explicitly embed translations in TR by determining the entry number in the cache. Translations are deleted from TR when the translation register is cleared, but not when the translation cache is cleared.

Translation registers allow the operating system to pin critical virtual memory translations into TLB, for example, IO spaces, kernel memory areas, frame buffers, page tables, sensitive interrupt code, etc. The interrupt handler instruction fetching is performed using virtual addressing, and therefore, virtual address ranges containing software translation miss handlers and other critical interrupt handlers should be fixed, otherwise, additional recursive misses in the TLB may occur. Other virtual mappings may be pinned for performance reasons.

Insertion record will be fixed if it is done with the fix bit turned on. Once such a translation falls into the TLB, the processor will not replace this translation in order to make room for other translations. Fixed translations can only be deleted by the TLB software cleanup. Insertions and cleanups of translation registers can selectively delete other translations (from the translation cache).

A processor must have at least 8 fixed translation registers for itc and 8 for dtc. An implementation may have additional translation registers to increase efficiency.

§ 8.3. Search for translations in memory

In case of a miss to the TLB translation hardware cache (lack of the necessary record), an interrupt occurs and the software miss handler comes into play. He should find the necessary translation in the page table in memory and place it in TLB, after which the instruction that caused the interrupt is restarted. However, many systems contain a hardware (or half-hardware) implemented unit translations Then, in case of a miss in TLB, the hardware block for searching for translation in memory comes into play, and only if this block doesn't detect the desired translation, an interrupt occurs and a system (software) miss handler is called.

If the processor implements an automatic search block for translations in memory, then the format of individual translation records, the format of the translation table as a whole, and the search algorithm in the translation table ceases to be the free choice of the operating system. At the same time, the system (owned by the OS) translation structures should work in close cooperation with the hardware translation search unit.

Page Table Walker (PTW) is a hardware unit for independent search for translations in RAM in case of their absence in the TLB. PTW is designed to increase the performance of a virtual address translation system.

Page Table (PT) is a translation table in memory, viewable by the PTW hardware unit (must be configured according to the requirements of the PTW equipment).

The processor PTW block can be (optionally) configured to search for translation in PT after a failed search in the TLB for instructions or data. PTW unit provides a significant increase in productivity by reducing the number of interrupts (and therefore delays and cleanup of the processor pipeline), caused by misses in the TLB, and by ensuring the parallel operation of the PTW block to populate the TLB translation cache at the same time as other processor actions.

To organize a page table in memory, traditionally in different architectures, the following schemes are used with varying success:

top-down is a traditional multi-level translation search scheme based on direct downward parsing of a virtual address, when each level of the table tree is directly indexed by the next portion of the virtual address. The easiest way for a hardware implementation. All tree tables are placed in physical memory. The number of memory accesses for searching for translation is equal to the number of levels (depth of the tree) - 2 for X86, 3 for DEC Alpha, 4 for X64, 5-6 for IBM zSeries. It has problems with sparseness and fragmentation, limited support for variable page sizes. It takes up too much space for translation tables (proportional to the size of virtual memory) and inefficiently uses table space with large fragmentation.

guarded top-down is an improved multi-level translation search scheme based on direct downward parsing of a virtual address, when each level of the table tree is directly indexed by the next portion of the virtual address, and omissions of some levels are possible. Harder for hardware implementation. All tree tables are placed in physical memory. The number of memory accesses for translation search may be less than the maximum number of levels. Reduces problems with sparseness and fragmentation, limited support for variable page sizes.

bottom-up is a scheme of the reverse recursive ascending order of viewing translation tables, when recursive misses are used in one large linear table located in virtual memory. Requires hardware implementation of nested interrupts. The number of memory accesses for searching for a translation depends on the number of recursive misses in the TLB and, at best, is 1, but in the worst case, it is proportional to the top-down method. Has problems with sparseness and fragmentation, limited support for variable page sizes. It takes up too much space for translation tables (in the worst case, it is proportional to the size of virtual memory) and inefficiently uses table space with large fragmentation.

inverted hash page table of pages. Its size is proportional to the size of physical (rather than virtual) memory and doesn't depend on the degree of fragmentation of virtual space. The number of memory accesses for translation search doesn't depend on the size of the page table and, if the hash function is correctly selected and the hash table size is usually 1. It copes well with sparseness and fragmentation, limited support for variable page sizes. It caches poorly when looking for translations for neighboring pages.

In the architecture POSTRISC, a multi-level translation search scheme was chosen to implement the page table based on the direct top-down order of viewing the translation tables, when each next level is directly indexed by a new portion of the virtual address. The number of memory accesses for translation search is equal to the number of levels (variable, currently 3 levels). The page table is located in the physical memory space as a multi-level structure of service tables.

Virtual address: 16KiB pages and possible translation levels
6 3	6 2	6 1	6 0	5 9	5 8	5 7	5 6	5 5	5 4	5 3	5 2	5 1	5 0	4 9	4 8	4 7	4 6	4 5	4 4	4 3	4 2	4 1	4 0	3 9	3 8	3 7	3 6	3 5	3 4	3 3	3 2	3 1	3 0	2 9	2 8	2 7	2 6	2 5	2 4	2 3	2 2	2 1	2 0	1 9	1 8	1 7	1 6	1 5	1 4	1 3	1 2	1 1	1 0	9	8	7	6	5	4	3	2	1	0
sign extension																	11 bits											11 bits											11 bits											16 KiB page offset

In the event of a miss in the TLB translation hardware cache (lack of the necessary record), the hardware translation search block in memory comes into play, and if this block doesn't find the required translation, an interrupt occurs and the program miss handler is called.

Special register page table address (pta) defines the search parameters for translation in memory for the virtual space, describes the location and size of the PT root page in the address space. The operating system must ensure that page tables are aligned naturally.

Special register pta (root level) and translation records for the next levels
6 3	6 2	6 1	6 0	5 9	5 8	5 7	5 6	5 5	5 4	5 3	5 2	5 1	5 0	4 9	4 8	4 7	4 6	4 5	4 4	4 3	4 2	4 1	4 0	3 9	3 8	3 7	3 6	3 5	3 4	3 3	3 2	3 1	3 0	2 9	2 8	2 7	2 6	2 5	2 4	2 3	2 2	2 1	2 0	1 9	1 8	1 7	1 6	1 5	1 4	1 3	1 2	1 1	1 0	9	8	7	6	5	4	3	2	1	0
reserved																ppn																																				0									mod
reserved																ppn																																				ma				0						s	v
reserved																ppn																																				ma				0						s	v
reserved																ppn																																				ar						g	d	a	p	0	v

Table 8.4: Translation record fields
Field	Bit	Description
mod	3	Translation mode: 0 - no translation, 1,2,3 and so on - the number of indexing levels when searching.
v	1	Bit of validity. For intermediate and final formats, if 1 - the page entry is valid, otherwise a search error occurs.
ppn	varied, 30-50	Physical page number if p=1, or other system data if p=0.
s	1	Superpage bit, stop the search (final format instead of intermediate).
p	1	The page is in memory
ma	4	Page Physical Attributes. Should be defined per superpage.
a	1	Access Bit
d	1	Dirty bit - indicates whether there were any changes in the page. When a page is pushed into a swap, it may not be saved if the page is already in the swap and has not changed.
ar	6	Permissions
rv		Reserved (must be zeros)

The format of the page tables should take into account the mapping of virtual addresses to a physical address space of a total depth of 64 bits.

§ 8.4. Translation instructions

List of translation instructions. The processor doesn't guarantee that the modification of translation resources is observed by subsequent samples of instructions or by accessing data in memory. The software should provide serialization (serialization by issuing a synchronizing barrier instruction) according to instructions before any dependent selection of instructions and serialization according to data before any dependent reference to data.

Table 8.5: Instructions that modifies TLB
Syntax	Description
ptc ra,rb,rc	Purge translations cache
ptri rb,rc	Clear the instruction translation register. ITR ← gr[rС], ifa
ptrd rb,rc	Cleans the data translation register. DTR ← gr[rС], ifa
mprobe ra,rb,rc	Returns page permissions for the privilege level gr[rC]
tpa ra,rb	Translates the virtual address to the physical address

The ptc instruction invalidates all translations from the local processor cache specified with the address and ASID. The processor determines the ASID-specifiec page that contains that address and invalidates all TLB entries for that page. The instruction deletes all translations from both translation caches that intersect with the specified address range. If the paging structures map the linear address using a large pages and/or there are multiple TLB entries for that page, the instruction invalidates all of them.

Format of instructions ptc
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							base							address							asid							0			opx

Translation records can be inserted into fixed translation registers by instructions mt.itr (move to instruction translation register) and mt.dtr (move to data translation register). The data for the inserted translation is taken from the first register argument of the instruction and special registers ifa. The translation register number is taken from the second argument register.

Translation records can be deleted from translation registers by instructions ptri (Purge Translation register for Instruction) and ptrd (Purge Translation register for Data). The first argument is the base address register number, the second argument is the register number that stores the translation register number. The instructions also delete all translations from both translation caches that intersect with the specified address range. The instructions only remove translations from the local processor registers.

Instruction format mtitr, mtdtr, ptri, ptrd
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							target							base							asid							0			opx

Permissions for a virtual page can be specified by instructions mprobe (memory probe), mprobef (memory probe faulting). The mprobe instruction for a given base address and privilege level returns the available rights mask. The privilege level is set as a value in the register. The mprobef instruction doesn't return rights, but tests for the necessary access rights for a given base address and privilege level. If there are no rights, the mprobef instruction raises a «Data Access rights fault» error, otherwise the instruction doesn'thing.

Instruction format mprobe
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							base							pl							0			opx

Instruction format mprobef
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							0							base							pl							0			opx

Privileged instruction tpa (translate to physical address) returns the physical address corresponding to the given virtual address.

Instruction format tpa
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							base							0							0			opx

The common sequence of TLB/PT search looks like this. If the TLB search fails, if the PTW is blocked (pta.v=0), a miss error in ITLB/DTLB occurs. If PTW is enabled (pta.v=1), then PTW calculates the index for access to the root page table, and tries to find the missing translation in the PT memory, looking through the table tree. If additional misses in the TLB occur during PTW operation, PT generates an error. If the PTW doesn't find the required translation in the memory (that is, the PT doesn't contain it), or the search is interrupted, an instruction/data miss TLB error occurs. Otherwise, the record is loaded into ITC or DTC. The processor can upload records to ITC or DTC, even if the program did not require translation.

Insertions from PT to TC follow the same «cleanup rules before inserting» as program inserts. PT insertion of entries that exist in TR registers is not allowed. Specifically, PT can search for any virtual addressing, but if the address is mapped to TR, such a translation should not be inserted into the TC. The software should not be placed in PT translations that intersect with current TR translations. An insert from PT may result in a machine abnormal termination if there is overlap between the TR and the inserted PT record.

After the translation record is loaded into the TLB, additional translation errors are checked (in order of priority): lack of rights to the page, enabled access bit, enabled dirty bit, lack of a page in memory.

Chapter 9. The floating-point facility

This chapter describes the floating-point and vector subsystem of the virtual processor instruction set.

§ 9.1. Floating-point formats

IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE 754-1985) defines two floating-point formats – single and double precision, in two groups – main and advanced. The architecture supports all four formats according to IEEE terminology: basic single and double formats and extended dual format. The basic dual format serves simultaneously as an extended single format.

The architecture defines the representation of floating-point values in four different fixed-length binary formats. The format can be 16-bit for half-float precision values, 32-bit for single precision values, 64-bit for double precision values, 128-bit for quadruple precision values. Values in each format are composed of three fields: sign bit (S), exponent (E), fractional part or mantissa (F).

float number format - half
15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
S	Exp					Fraction

float number format - single
15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
S	Exp								Fraction
Fraction

float number format - double
15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
S	Exp											Fraction
Fraction
Fraction
Fraction

float number format - quadruple
15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
S	Exp
Fraction
Fraction
Fraction
Fraction
Fraction
Fraction
Fraction

Single precision numbers occupy four adjacent bytes of memory, starting with an arbitrary address multiple of 4. Double precision numbers occupy eight adjacent bytes of memory, starting with an arbitrary address multiple of 8. Quadruple numbers occupy sixteen contiguous bytes of memory, starting with an arbitrary address multiple of 16.

The values represented within each format are determined by two integer parameters – the size of the format S and the number of bits of the exponent P. All other – parameters are derived from these two.

Table 9.1: Parameters of the formats of float numbers
Format Options	Half	Single	Double	Quadruple
Format bits B	16	32	64	128
Exponent bits P (P<B)	5	8	11	15
Sign bit S (1)	1	1	1	1
Fraction bits FB: (B−P−1)	10	23	52	112
Fraction significant bits (B−P)	11	24	53	113
Significant decimal digits log₁₀(2^B−P)	3.311	7.225	15.955	34.016
Maximum exponent EMAX: (2^P−1−1)	15	127	1023	16383
Minimum exponent EMIN: −(2^P−1−2)	−14	−126	−1022	−16382
Exponent bias (2^P−1−1)	15	127	1023	16383
Maximum biased exponent EBMAX: (2^P−1)	31	255	2047	32767
bias adjustment 3×2^P–2	24	192	1536	24576

The following table shows the exact limit values for the three formats decimal places:

Limit	Value
Normalized values	(−1)^S×1.F×2^E−EMAX
Maximum normalized values	(2.0−2^−FB)×2^EMAX
Single absolute maximum	3.40282347e+38
Double absolute maximum	1.7976931348623158e+308
Quadruple absolute maximum	1.1897314953572317650857593266280070162e+4932
Minimum normalized values	1.0×2^EMIN
Single absolute minimum	1.17549435e−38
Double absolute minimum	2.2250738585072013e−308
Quadruple absolute minimum	3.3621031431120935062626778173217526026e−4932
Subnormalized values	(−1)^sign×0.fraction×2^EMIN
Maximum subnormalized values	(1−2^−FB)×2^EMIN
Quadruple maximum subnormal	3.3621031431120935062626778173217519551×10⁻⁴⁹³²
Minimum subnormalized values	1.0×2^EMIN−FB
Single minimum (subnormal)	1.401298464324817071e−45 (inaccurate)
Double minimum (subnormal)	4.940656458412465442e−324 (inaccurate)
Quadruple minimum (subnormal)	6.4751751194380251109244389582276465525×10⁻⁴⁹⁶⁶

The following objects are allowed within each format:

Numbers in the form (−1) ^S × 2 ^E × b (0) .b (1) b (2 ) … b (P-1), where S = 0 or 1, E = any integer between E_min and E_max inclusive, B (n) = 0 or 1.
Two infinities - positive and negative
At least one signal NAN
At least one silent NAN

NAN – short for «not a number» (Not A Number). NAN is an IEEE is a binary floating-point representation that is something other than a number. NANs come in two forms: «signaling» NANs and «quiet» NANs.

Arithmetic with infinities is treated as if the operands are arbitrary large amount. Negative infinity is less than any finite number; positive infinity is greater than any finite number.

Denote: S is a sign bit (sign), EXP is an exponent with offset, i.e. reduced to unsigned (biased exponent), F is a fractional part or mantissa (fraction), XXXXX as an arbitrary but non-zero sequence of bits, EBMAX is a maximum offset unsigned exponent. The value of a float number is interpreted as follows.

If EXP = EBMAX (consists of one bit units), then this is a special IEEE value. To recognize special values, F. is further investigated. If F is not equal to zero, then it is + NAN or −NAN. In particular, if the first bit of the mantissa is 0, then it is a signal NAN (signaled), and if 1 – then it is «quiet» NAN. If EXP = EBMAX and F = 0, then it is «infinity» + INF or −INF depending on S. If 0 < EXP < EBMAX, then this is a finite normalized number. If EXP = 0, and the mantissa is not equal to zero, then this is a finite unnormalized number. If EXP = 0 and F = 0, then this is +0 or −0 depending on S.

Exponent	Fraction	IEEE value
EBMAX	0XXXXXX	QNAN
EBMAX	1XXXXXX	SNAN
EBMAX	0	INF
0<E<EBМAX	any	Finite (Normalized): (−1)^S × 2^(E−BIAS) × 1.F
0	XXXXXXX	Finite (Denormal): (−1)^S × 2^{(− EMIN)} × 0.F
0	0	±0

Floating-point operations can raise arithmetic exceptions for many reasons, including invalid operations, overflow from above or below, division by zero, inaccurate result.

§ 9.2. Special floating-point values

NAN is the abbreviation for the concept of «not a number». NAN is an IEEE bitmap floating-point that represents something other than a number. These are the values that have the maximum value of the offset exponent and non-zero fractional part. The sign bit is ignored (NAN is neither positive nor negative), although it can be determined. NANs come in two forms: Signaling NANs and Silent NANs. If the high bit of the mantissa is zero, then this is a signaled NAN, otherwise a quiet NAN.

Signaled NAN (SNAN) is used to provide values for uninitialized variables and for extension arithmetic. The signaled NAN reports an invalid operation when it is the operand of an arithmetic operation, and may throw an arithmetic exception. The signaled NAN is used to raise a signal exception when such a value appears as the operand of the computational instruction.

Quiet NAN (QNAN) provides the retrospective diagnostic information relative to previous invalid or inaccessible data and results. Quiet NANs propagate through almost every operation without generating arithmetic exceptions.

QNAN is used to present the results of some invalid operations, such as invalid arithmetic operations at infinity or on a NAN, when the generation of an exception for an invalid operation is blocked. Quiet NANs propagate through all floating-point operations except ordered comparisons (LT, LE, GT, GE) and conversions to an integer, otherwise they report exceptions. QNAN codes can thus be stored through a sequence of floating-point operations and used to transmit diagnostic information, helping to identify the consequences of illegal operations.

When QNAN is the result of a floating-point operation, because one of the NAN operands or because the QNAN was generated due to a blocked exception on an invalid operation, then the following rule applies to determine the NAN with the high bit of mantissa 1, which should be saved as a result. If either operand is an SNAN, then the SNAN is returned as the result of the operation. Otherwise, if a QNAN is generated due to a prohibition on the exclusion of an invalid operation, then this QNAN is returned as a result. If the QNAN is generated as a result, then the QNAN has a positive sign, an exponent of all 1, and the most significant bit of the mantissa 1 (all other 0). An instruction that generates a QNAN as a result of an exception ban due to an invalid operation should generate such a QNAN (e.g. 0x7FF8000000000000 for double).

§ 9.3. Selection for IEEE options

Floating-point instructions provide a subset of the IEEE standard for binary floating-point arithmetic (ANSI/IEEE Standard 754-1985 for Binary Floating-Point Arithmetic). The following describes how to create a full implementation of IEEE.

Four IEEE rounding modes are supported in hardware: normal, truncation, plus infinity, and minus infinity. The hardware supports IEEE enable/disable software traps for special situations. Addition, subtraction, multiplication, division, conversion between floating formats are supported in hardware, rounding to an integer in floating-point format, conversion between floating and integer formats, comparison, square root calculation. The remainder of division is supported in software, conversion of binary format to decimal number. Copying (possibly with a change in sign) without changing the format is not considered an operation (non-finite numbers are not checked). Operations with different formats are not provided, calculations occur with the maximum accuracy available for this vector format.

Conversion precision between decimal strings and binary numbers floating-point - no less than the requirements of the IEEE standard. Depends on the implementation, whether the conversion procedures to decimal format are processed any excess numbers (over 9, 17 or 36 digits) as zeros.

Overflows above and below, NAN, INF, which the binary to decimal conversion software encounters, return strings that define these states.

The hardware supports comparisons of numbers of the same format. You can programmatically compare numbers with a different format. The result of the comparison is true or false. The hardware supports the required six predicates and the predicate of incomparability of numbers. The other 19 optional predicates can be created from comparisons and bitwise operations. Infinity is supported in hardware in comparison instructions.

QNANs provide retrospective diagnostic information. Copying NAN signals without changing the format doesn't report an invalid exception (fmerge instructions also do not check for non-finite numbers.)

The hardware fully supports negative null operands and follows IEEE rules to create negative null results. The hardware support bottom overflow and denormal numbers.

Tiny is detected by hardware after rounding, and a loss of accuracy is detected by the software as an inaccurate result.

§ 9.4. Representation of floats in registers

Universal registers with a width of 128 bits each can store in themselves one float number of quadruple precision (quadruple float), 2 double precision, 4 single precision, 8 half-float precision, or integer vector length of 1, 2, 4 or 8 bytes.

Table 9.4: Representation format for real data
register bytes
15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
half		half		half		half		half		half		half		half
single				single				single				single
double								double
quadruple

The special register fpcr regulates the execution of material and vector operations. It controls the arithmetic rounding mode for all instructions except explicit rounding instructions, indicates the allowed traps of the user level, stores the exceptions that have occurred (exceptions), stores excepted and masked exceptions.

FPU control register format
31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
IEEE masked flags								IEEE masked traps								IEEE nonmasked traps								control bits
0		im	um	om	zm	dm	vm	0		i	u	o	z	d	v	0		i	u	o	z	d	v	0		td	ftz	0		rm

Table 9.5: SF Field Bits
bits	description
v	Invalid Operation
d	Denormal/Unnormal Operand
z	Zero Divide
o	Overflow
u	Underflow
i	Inexact result
td	Traps disabled
rm	Rounding mode
ftz	Flush-to-Zero mode (zeroing without underflow)

The rm (rounding mode) bits control the rounding mode of the results. The rounding mode doesn't affect the execution of explicit rounding instructions, for which only the rounding mode specified directly in the instructions matters.

Rounding mode (RM)	Description
0	Round to nearest (round)
1	Round toward minus infinity (floor)
2	Round toward plus infinity (ceil)
3	Round toward zero (chopping)

The masked flags vector stores a mask of flags allowing IEEE interrupts of the corresponding type. The bits of the vectors nonmasked traps and masked traps store flags of the exceptions that occurred. the occurrence of which was allowed (or, accordingly, prohibited) in the vector masked flags.

The fldi instruction is used to load direct real constants into the registers. It allows you to load real constants presented in formats up to extended (80 bits) without loss of accuracy. The instruction doesn't allow to set zero values, special values, and has restrictions on the order value (6 bits). The instruction stores numbers 28 bits long (or 70 bits for a double instruction).

Instruction format fldi
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							target							s	exponent						mantissa (high 21 bits)

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
mantissa (full 63 bits)

§ 9.5. Floating-point computational instructions

All computational operations are performed only on registers. The basic operation for maximum performance is vector (or scalar) operation «multiply-add» MAC (multiply-accumulate fused). Floating-point arithmetic instructions that fuse multiplication with addition and possibly sign change, formed according to the FMAC rule.

Ternary «fused» floating-point instruction format
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							src1							src2							src3							opx

Binary floating-point instruction format
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							src1							src2							0			opx

Unary floating-point instruction format
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							src							0							0			opx

Unary floating-point instruction format with rounding
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							src							0							rm			opx

The following table lists these instructions. They exist in all variants of – single, double or quadruple respectively.

Table 9.10: Floating point computational instructions
Scalar	Packed	Description
Fused instructions
fmadd.s[h\|s\|d\|q]	fmadd.p[h\|s\|d\|q]	a = b × c + d
fmsub.s[h\|s\|d\|q]	fmsub.p[h\|s\|d\|q]	a = b × c − d
fnmadd.s[h\|s\|d\|q]	fnmadd.p[h\|s\|d\|q]	a = − b × c + d
fnmsub.s[h\|s\|d\|q]	fnmsub.p[h\|s\|d\|q]	a = − b × c − d
	fmadda.p[h\|s\|d]	ax = bx × c.x + dx, ay = by × cy − dy
	fmsuba.p[h\|s\|d]	ax = bx × cx − dx, ay = by × c.y + dy
binary instructions
fadd.s[h\|s\|d\|q]	fadd.p[h\|s\|d\|q]	a = b + c
	faddh.p[h\|s\|d]	ax = b.x + by, ay = c.x + cy
	faddc.p[h\|s\|d]	ax = b.x + cx, ay = bx − cy
fnadd.s[h\|s\|d\|q]	fnadd.p[h\|s\|d\|q]	a = − (b + c)
fsub.s[h\|s\|d\|q]	fsub.p[h\|s\|d\|q]	a = b − c
	fsubh.p[h\|s\|d]	ax = bx − by, ay = cx − cy
	fsubc.p[h\|s\|d]	ax = bx − cx, ay = b.x + cy
fabsd.s[h\|s\|d\|q]	fabsd.p[h\|s\|d\|q]	a = abs (b − c)
fnabsd.s[h\|s\|d\|q]	fnabsd.p[h\|s\|d\|q]	a = − abs (b − c)
fmul.s[h\|s\|d\|q]	fmul.p[h\|s\|d\|q]	a = b × c
fdiv.s[h\|s\|d\|q]	fdiv.p[h\|s\|d\|q]	a = b/c
fmin.s[h\|s\|d\|q]	fmin.p[h\|s\|d\|q]	a = min (b, c)
fmax.s[h\|s\|d\|q]	fmax.p[h\|s\|d\|q]	a = max (b, c)
famin.s[h\|s\|d\|q]	famin.p[h\|s\|d\|q]	a = min (abs (b), abs (c))
famax.s[h\|s\|d\|q]	famax.p[h\|s\|d\|q]	a = max (abs (b), abs (c))
fcmpo.eq.s[h\|s\|d\|q]	fcmpo.eq.p[h\|s\|d]	fp compare ordered and equal
fcmpo.ne.s[h\|s\|d\|q]	fcmpo.ne.p[h\|s\|d]	fp compare ordered and not-equal
fcmpo.lt.s[h\|s\|d\|q]	fcmpo.lt.p[h\|s\|d]	fp compare ordered and less
fcmpo.le.s[h\|s\|d\|q]	fcmpo.le.p[h\|s\|d]	fp compare ordered and less-equal
fcmpo.s[h\|s\|d\|q]	fcmpo.p[h\|s\|d]	fp compare ordered
fcmpu.eq.s[h\|s\|d\|q]	fcmpu.eq.p[h\|s\|d]	fp compare unordered or equal
fcmpu.ne.s[h\|s\|d\|q]	fcmpu.ne.p[h\|s\|d]	fp compare unordered or not-equal
fcmpu.lt.s[h\|s\|d\|q]	fcmpu.lt.p[h\|s\|d]	fp compare unordered or less
fcmpu.le.s[h\|s\|d\|q]	fcmpu.le.p[h\|s\|d]	fp compare unordered or less-equal
fcmpu.s[h\|s\|d\|q]	fcmpu.p[h\|s\|d]	fp compare unordered
	p[s\|d]pk	pack two vectors into one
Conversion to integer with rounding
fcvt.iw.s[h\|s\|d\|q]	fcvt.iw.ps	convert signed word to floats
fcvt.uw.s[h\|s\|d\|q]	fcvt.uw.ps	convert unsigned word to floats
fcvt.s[h\|s\|d\|q].iw	fcvt.ps.iw	convert floats to signed word
fcvt.s[h\|s\|d\|q].uw	fcvt.ps.uw	convert floats to unsigned word
fcvt.id.s[h\|s\|d\|q]	fcvt.id.pd	convert signed doubleword to floats
fcvt.ud.s[h\|s\|d\|q]	fcvt.ud.pd	convert unsigned doubleword to floats
fcvt.s[h\|s\|d\|q].id	fcvt.pd.id	convert floats to signed doubleword
fcvt.s[h\|s\|d\|q].ud	fcvt.pd.ud	convert floats to unsigned doubleword
fcvt.iq.s[h\|s\|d\|q]		convert signed quadword to floats
fcvt.uq.s[h\|s\|d\|q]		convert unsigned quadword to floats
fcvt.s[h\|s\|d\|q].iq		convert floats to signed quadword
fcvt.s[h\|s\|d\|q].uq		convert floats to unsigned quadword
Conversion to narrower float with rounding
fcvt.s[s\|d\|q].sh		convert float to half-float
fcvt.s[d\|q].ss		convert float to single float
fcvt.sq.sd		convert float to double float
Extending to wider float instructions
fext.sh.ss		extend float to single float
fext.s[h\|s].sd		extend float to double float
fext.s[h\|s\|d].sq		extend float to quadruple float
Rounding instructions
frnd.s[h\|s\|d\|q]	frnd.p[h\|s\|d]	floating-point round
unary instructions
fneg.s[h\|s\|d\|q]	fneg.p[h\|s\|d]	floating-point negate value
fabs.s[h\|s\|d\|q]	fabs.p[h\|s\|d]	floating-point absolute value
fnabs.s[h\|s\|d\|q]	fnabs.p[h\|s\|d]	floating-point negate absolute value
frsqrt.s[h\|s\|d\|q]	frsqrt.p[h\|s\|d]	floating-point reciprocal square root
fsqrt.s[h\|s\|d\|q]	fsqrt.p[h\|s\|d]	floating-point square root
	funph.p[h\|s\|d]	unpack high half the vector into wider precision vector
	funpl.p[h\|s\|d]	unpack lower half the vector into wider precision vector

The instructions fcmp are intended for generating predicates from the results of floating-point comparisons. They produce boolean scalar/vectors as a result of real vector comparison. Comparison of real numbers is done by elementwise comparison of two vectors and recording the result in the third real vector. All bits of the result vector, for elements of which the condition is satisfied, are set to 1, the rest to 0. After comparison, you can get a single predicate bit performing respectively conjunction and disjunction of all bits of the result vector.

For some instructions, the second operand is replaced with the 7-bit immediate value count from 0 to 127, which describes the accuracy of a non-pipelined unary operation, e.g. fsqrt or frcp.

The accuracy of executing the instructions fsqrt, frcp and frsqrt is indicated by the constant count directly in the instruction. The instruction is executed with minimal accuracy at the same time as a regular MAC, without pipeline delays.

§ 9.6. Floating-point branch and nullification instructions

Quadruple scalar	Scalar Double	Scalar Single	Description
branch if compare is true
bfo.eq.sq	bfo.eq.sd	bfo.eq.ss	ordered and equal
bfo.ne.sq	bfo.ne.sd	bfo.ne.ss	ordered and not-equal
bfo.lt.sq	bfo.lt.sd	bfo.lt.ss	ordered and less
bfo.le.sq	bfo.le.sd	bfo.le.ss	ordered and less-or-equal
bfo.sq	bfo.sd	bfo.ss	ordered
bfu.eq.sq	bfu.eq.sd	bfu.eq.ss	unordered or equal
bfu.ne.sq	bfu.ne.sd	bfu.ne.ss	unordered or not-equal
bfu.lt.sq	bfu.lt.sd	bfu.lt.ss	unordered or less
bfu.le.sq	bfu.le.sd	bfu.le.ss	unordered or less-or-equal
bfu.sq	bfu.sd	bfu.ss	unordered
branch if classification is true
bf.class.sq	bf.class.sd	bf.class.ss	compare
nullify if compare is true
nulfo.eq.sq	nulfo.eq.sd	nulfo.eq.ss	ordered and equal
nulfo.ne.sq	nulfo.ne.sd	nulfo.ne.ss	ordered and not-equal
nulfo.lt.sq	nulfo.lt.sd	nulfo.lt.ss	ordered and less
nulfo.le.sq	nulfo.le.sd	nulfo.le.ss	ordered and less-or-equal
nulfo.sq	nulfo.sd	nulfo.ss	ordered
nulfu.eq.sq	nulfu.eq.sd	nulfu.eq.ss	unordered or equal
nulfu.ne.sq	nulfu.ne.sd	nulfu.ne.ss	unordered or not-equal
nulfu.lt.sq	nulfu.lt.sd	nulfu.lt.ss	unordered or less
nulfu.le.sq	nulfu.le.sd	nulfu.le.ss	unordered or less-or-equal
nulfu.sq	nulfu.sd	nulfu.ss	unordered
nullify if classification is true
nulf.class.sq	nulf.class.sd	nulf.class.ss

Format of fp scalar compare branch instructions
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							src1							src2							opx				disp17x16

Format of fp scalar compare nullification instructions
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							src1							src2							opx				dist-no					dist-yes					opx

The instructions for branch on floating-point classification check floating-point value class. The floating-point classification instructions use 7-bit immediate mask with flags describing which floating-point value types are meet condition.

Classification flag	Description	Assembler mnemonic
0x01	Zero	@zero
0x02	Negative	@neg
0x04	Positive	@pos
0x08	Infinity	@inf
0x10	Normalized	@norm
0x20	Denormalized	@denorm
0x40	NaN (Quiet)	@nan
0x80	fixme: no place for Signaling NaN	@snan

Format of fp scalar classification branch instructions
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							src							classify							opx				disp17x16

The instructions for nullification on floating-point classification nfclsd, nfclsq, nfclss check floating-point value class.

Format of fp scalar classification nullification instructions
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							src							classify							0				dist-no					dist-yes					opx

§ 9.7. Logical vector instructions

The instructions for manipulating real registers as bit vectors are independent of the type of data stored in the registers. They are intended for conditional movements, operations on bit masks, generation of predicates.

41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							src1							src2							0			opx

Name	Description
vsll	shift left
vsrl	shift right
vrll	rotate left
vrrl	rotate right
p1perm	permute bytes
lvsr	vector load for shift left (permutation)

Instruction vsel (vector bitwise select) produces a bitwise selection of two registers based on the contents of the third register, where the bit mask is the preliminarily computed result of a logical operation or a comparison operation.

Instruction format vsel, p1perm
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							src1							src2							src3							opx

Instructions dep16 (vector deposit) and srp16 (vector shift right pair) produce a bitwise selection of two registers. The instruction dep16 takes the first count bit of the result from the first operand register, the remaining bits are from the second operand register. The instruction srp16 takes the first count bit of the result from the upper part of the first operand register, the remaining bits are from the lower part of the second operand register.

Instruction format dep16, srp16
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							src1							src2							count							opx

§ 9.8. Integer vector operations

These are DSP (digital signal processing) instructions for working with multimedia integer data. Instructions are generated according to the FBIN rule (format). The first register is the result. The second and third are operands.

41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							src1							src2							0			opx

The size of vector data can be 1, 2, 4, and 8 bytes. It is possible to carry out calculations with rounding modulo or with saturation (saturate). Saturation can be signed or unsigned. Modular rounding can be truncated or carried back (carry-out).

Name	Description	Element Size
vaddc*	add carryout unsigned	1,2,4,8
vaddu*	add unsigned modulo	1,2,4,8
vaddo*	add overflow	1,2,4,8
vaddss*	add signed saturate	1,2,4,8
vaddus*	add unsigned saturate	1,2,4,8
vavgs*	average signed	1,2,4,8
vavgu*	average unsigned	1,2,4,8
vcmpeq*	compare equal	1,2,4,8
vcmplts*	compare less than signed	1,2,4,8
vcmpltu*	compare less than unsigned	1,2,4,8
vmaxs*	maximum signed	1,2,4,8
vmaxu*	maximum unsigned	1,2,4,8
vmins*	minimum signed	1,2,4,8
vminu*	minimum unsigned	1,2,4,8
vmrgh*	merge high	1,2,4,8
vmrgl*	merge low	1,2,4,8
vpkssm*	pack signed as signed modulo	2,4,8
vpksss*	pack signed as signed saturate	2,4,8
vpksum*	pack signed as unsigned modulo	2,4,8
vpksus*	pack signed as unsigned saturate	2,4,8
vpkuum*	pack unsigned as unsigned modulo	2,4,8
vpkuus*	pack unsigned as unsigned saturate	2,4,8
vrol*	rotate left	1,2,4,8
vror*	rotate right	1,2,4,8
vsll*	shift left logical	1,2,4,8
vsra*	shift right alfebraic	1,2,4,8
vsrl*	shift right logical	1,2,4,8
vsubb*	subtract carryout unsigned	1,2,4,8
vsubu*	subtract unsigned modulo	1,2,4,8
vsubus*	subtract unsigned saturate	1,2,4,8
vsubss*	subtract signed saturate	1,2,4,8
vupkhs*	unpack high signed	1,2,4
vupkls*	unpack low signed	1,2,4

In the table, the asterisk* replaces the size of vector elements: 1, 2, 4, 8.

Chapter 10. Extended instruction set

This chapter describes the extended virtual processor instruction set which was not included in the basic set.

§ 10.1. Helper Address Calculation Instructions

To simplify addressing, several instructions have been introduced that calculate effective addresses without going to memory. The ca.x instruction returns the effective address as indexed addressing.

Instruction format ca.x
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							base							index							scale			opx				disp

The instruction ca.r (compute address relative) calculates the ip-relative base address as jump instruction. The first argument is the number of the result register, the second is the distance in the instruction bundles from the current position (in assembler, this is a label in the code section, or a label in the immutable data section, aligned on a 16-byte boundary). It is used to get the base address of immutable data from a code section, function address or label. The instruction doesn't generate interrupts.

ca.r dst, label

This instruction is necessary for position-independent code to get the absolute address of objects, stored at a fixed distance from the current position, for example, intra-module procedures or unchanged local module data. On systems like MAS (Multiple Address Spaces) with multiple address spaces, where the module's private data is stored at a fixed distance from the code section, it can also be used to obtain the base absolute address of the module's private data.

Instruction format ca.r
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							label (28 bits)

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
0										label (expanding to 60 bits instead of 28)

The instruction is formed according to the CA_R rule. The result register is followed by a 28-bit field for encoding the offset relative to the instruction counter. The data block must be aligned with at least a 16-byte boundary, since the offset expresses the distance in instruction bundles, not bytes. The general formula for obtaining the address:

gr[dst] = ip + 16 × sign_extend(label)

Offset field 28 bits long (64 bits for double instruction) after sign extension and left shift by 4 positions is added to the contents of the instruction counter ip, to produce a 64-bit effective address. The maximum distance for a one-slot instruction is 2 GiB on either side of the instruction counter. The ca.r instruction allows the continuation of the immediate value in the instruction code to the next slot of the bundle with the formation of a dual-slot instruction.

The ca.r instruction might be used to compute address of the static module data. But specially for this the another instruction ca.rf is intended (compute address forward relative), which allow to address any address, not only 16-byte bundle-aligned. It computes effective address same as all ip-relative load/store instructions. This reduces the maximal available distance 16 times, so only forward references with unsigned offset are possible, so distance reduction is 8 times only. To use ca.rf, the distance from the current bundle to the data should not exceed 256 MiB. Usually 256 MiB is enough for any module.

Instruction format ca.rf
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							label (28 bits)

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
0						label (expanding to 64 bits instead of 28)

gr[dst] = ip + zero_extend(label)

If a constant with a sign is fits in 28 bits, then it's more efficient to use the ldi instruction, and if fits in 56 bits then ldi along with ca.n. However, when loading constants in bulk, a single ca.r instruction falls on several loading instructions, and then a pair of ldi and ca.n – instructions is less compact than a single ldz.w instruction. As for loading 8-byte integer constants, real constants, vector constants, then using ca.r along with ldz.d, ldz.w and other download instructions, is the recommended, and often the only possible way to load such constants.

Base with offset addressing allows 1 MiB addressing to both sides of the base address when using one-slot instructions (21-bit offset). If the object is beyond 1 MiB, you will have to use dual-slot instructions. But, according to the principle of access locality, with a high probability the program will access next objects located near the first one. This fact can be used, and once calculate the base address, from which several necessary objects are located no further than 1 MiB, and then use one-slot instructions to address them.

Nearest Base Address Calculation Instructions ca.n (compute address near). It is used to optimize local (by place and time) memory access without using dual-slot instructions and long offsets. Another nearest base address instruction ca.nrc (compute address near relative consistent).

ca.n   dst, base, simm
ca.nrc dst, base, simm

First argument is the result register number, second is base address register number, the third is an immediate value is 21 bits long (or 63 bits for a long instruction), extended to 64 bits. The instruction allows the continuation of the immediate value in the instruction code up to 63 bits to the next slot of the bundle with the formation of a dual-slot instruction.

Instruction format ca.n, ca.nrc
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							base							simm (21 bits)

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
0																			(44 bits instead of 21)

The target 64-bit address is calculated (for ca.n and ca.nrc) as:

gr[dst] = gr[base] + (simm << 20)

gr[dst] = ip + gr[base] + (simm << 20)

The following example shows how to use the ca.n instruction to access a group of closely spaced (no more than 512 KiB from each other), but far-away data (the distance to the sym object is more than 512 KiB from the base address).

Without using ca.n (4 double instructions, 8 slots)

lds.w.l  %r1, base, sym + 4
ldz.w.l %r2, base, sym + 8
st.d.l  %r2, base, sym + 16
ldz.d.l %r3, base, sym + 32

Using ca.n (5 single instructions, 5 slots)

ca.n   tmp, base, data_hi (sym); put the nearest address in tmp
lds.w  g11, tmp, data_lo (sym) +4; tmp addressing
ldz.w  g12, tmp, data_lo (sym) +8
st.d   g12, tmp, data_lo (sym) +16
ldz.d  g13, tmp, data_lo (sym) +32

§ 10.2. Multiprecision arithmetic

For hardware support for long arithmetic, it is advisable to add special instructions. In the general case, for intermediate addition/subtraction of parts of high precision numbers it is required to specify the incoming carry (borrow), two operands, the result and the outgoing carry (borrow).

When explicitly coding all dependencies and not using global flags (which is good for parallel/pipeline execution of instructions) it requires 5 parameters: the result, two operands, input and output carry/borrow. There is not enough space in the instructions for all five parameters. Therefore, the high part of 128-bit registers is used to return the carry/borrow.

A special instruction mulh (multiply high) was introduced for hardware support for multiplying long numbers calculating the upper half of a 128-bit product of two 64-bit numbers.

Instruction format addc, subb, mulh
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							ra							rb							rc							0			opx

Instruction format add.addc, sub.subb
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							ra							rb							rc							rd							opx

Syntax:

addc      ra, rb, rc
add.addc  ra, rb, rc, rd
subb      ra, rb, rc
sub.subb  ra, rb, rc, rd
mulh      ra, rb, rc

Table 10.1: Fused instructions
Name	Operation	Description
addc	add with carry	ra = carry (rb + rc), sum (rb + rc)
subb	subtract with borrow	ra = borrow (rb − rc), rb-rc
add.addc	add and add with carry	ra = carry (rb + rc + rd. high), rb + rc + rd.high
sub.subb	subtract and subtract with borrow	ra = borrow (rb − rc −rd.high),rb−rc−rd.high

It is assumed that numbers of arbitrary length are already loaded into the registers. For example, the addition of 256-bit numbers will occur as follows:

addc      a1, b1, c1      ; sum of lower parts, first carry-out
add.addc  a2, b2, c2, a1  ; sum of middles and carry-in, next carry-out
add.addc  a3, b3, c3, a2  ; sum of middles and carry-in, next carry-out
add.addc  a4, b4, c4, a3  ; sum of higher and carry-in, last carry-out

§ 10.3. Software interrupts, system calls

The instruction syscall (system call) does the call to the kernel of the system to process the system request. The system call number is obtained from r1, arguments from subsequent registers.

Unlike interrupts, a system call is an analogue of a function call, and has similarly implemented return from it to the next bundle. Therefore, after the instruction syscall in assembler, you need to put a label to ensure that the subsequent instructions fall into the new bundle. Bits of future predication are cleared.

The first frame registers are rotated, and return address is stored in zero register in the new frame. Subsequent local registers contain syscall arguments.

The sysret (system return) instruction returns from the system request handler, which was called using syscall. The instruction use the return address and frame state from zero register.

Instruction format syscall, sysret
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							0																								opx

The instruction int (interrupt) is provided for sending interrupts to the current core themselves programmatically. The sent interrupt doesn't happen synchronously to the instruction thread, but it can be delayed until the moment when this vector is unmasked. For the user-mode program, when all interrupts are unmasked, the sent interrupt is happen synchronously to the instruction thread. The interrupt index is calculated as gr[src] + simm10. The instruction support both styles of interrupt code passing: hardcoded codes with zero register gz or dynamic code pasing.

Instruction format int
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							0							src							simm10										opx

The rfi instruction (return from interruption) returns from the interrupt handler. It returns to the beginning of the bundle containing the interrupted incomplete instruction (in case of an error), or to a bundle containing the subsequent instruction (in the case of a trap).

Instruction format rfi
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							0																								opx

§ 10.4. Cipher and hash instructions

Table 10.2: AES/hash instructions
Name	Operation
aes.dec ra, rb, rc	aes decrypt round
aes.dec.last ra, rb, rc	aes decrypt last round
aes.enc ra, rb, rc	aes encrypt round
aes.enc.last ra, rb, rc	aes encrypt last round
aes.imc ra, rb	aes inverse mix columns
aes.keygen.assist ra, rb, uimm8	aes key generation assist
clmul.ll ra, rb, rc	carry-less multiply low parts
clmul.hl ra, rb, rc	carry-less multiply high and low parts
clmul.hh ra, rb, rc	carry-less multiply high parts
crc32c ra, rb, rc, rd	crc32c hash

Instruction format aes.enc, aes.enc.last, aes.dec, aes.dec.last, clmul
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							src1							src2							0			opx

Instruction format aesimc
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							src							0							0			opx

Instruction format aeskeygenassist
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							src							round constant										opx

Instruction format crc32c
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							prev							data							len							opx

The crc32c instruction computes crc32c hash. The new hash value is based on previous hash value «prev». The hashed data is in register «data». The len parameter may be any value. If it is bigger than 16, only 16 bytes of data in register data is used.

§ 10.5. Random number generation instruction

The special instructions random are designed to generate random variables. Reading from it returns the next 64-bit random number. The instruction returns random numbers that are compliant to the «U.S. National Institute of Standards and Technology (NIST)» standards on random number generators.

Instruction format random
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							src							0							0			opx

The src operand specifies the used random generator.

Instruction	Source	NIST Compliance
rdrand(0)	Cryptographically secure pseudorandom number generator	SP 800-90A
rdseed(1)	Non-deterministic random bit generator	SP 800-90B & C (drafts)

The numbers returned by rdseed are referred to as "seed-grade entropy" and are the output of a true random number generator (TRNG), or an enhanced non-deterministic random number generator (ENRNG) in NIST-speak. rdseed is intended for use by software vendors who have an existing PRNG, but would like to benefit from the harsware entropy source. With rdseed you can seed a PRNG of any size.

The numbers returned by rdseed have multiplicative prediction resistance. If you use two 64-bit samples with multiplicative prediction resistance to build a 128-bit value, you end up with a random number with 128 bits of prediction resistance (2¹²⁸×2¹²⁸ = 2²⁵⁶). Combine two of those 128-bit values together, and you get a 256-bit number with 256 bits of prediction resistance. You can continue in this fashion to build a random value of arbitrary width and the prediction resistance will always scale with it. Because its values have multiplicative prediction resistance rdseed is intended for seeding other PRNGs.

In contrast, rdrand is the output of a 128-bit PRNG that is compliant to «NIST SP 800-90A». It is intended for applications that simply need high-quality random numbers. The numbers returned by rdrand have additive prediction resistance because they are the output of a pseudorandom number generator. If you put two 64-bit values with additive prediction resistance togehter, the prediction resistance of the resulting value is only 65 bits (2⁶⁴+2⁶⁴=2⁶⁵). To ensure that rdrand values are fully prediction-resistant when combined together to build larger values you can follow the procedures in the «DRNG Software Implementation Guide» on generating seed values from rdrand, but it's generally best and simplest to just use rdseed for PRNG seeding.

The decision for which generator to use is based on what the output will be used for. Use rdseed if you wish to seed another pseudorandom number generator (PRNG), use rdrand for all other purposes. rdseed is intended for seeding a software PRNG of arbitrary width. rdrand is intended for applications that merely require high-quality random numbers.

§ 10.6. CPU identification instructions

The cpuid instruction is used to dynamically identify which features of POSTRISC are implemented in the running processor. The realization of the functional characteristics of these instruction systems is recorded in the series of configuration information words. One configuration information word can be read once the cpuid instruction is executed. The configuration information word number to be accessed is computed as gr[index]+sext(imm10). The 64-bit configuration information is written into the general register dst.

cpuid instruction format
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							dst							index							simm10										opx

Syntax:

cpuid ra, rb, simm10

The configuration information word contains series of configuration bits (fields). For example, The PALEN field of the number of physical address bits supported by the 11th to 4th digits of the configuration word No.1 is recorded as cpuid.1.PALEN[11:4].

The configuration information accessible by the cpuid instruction is listed in the table below. cpuid access to undefined configuration words causes general protection exception. The reserved fields in the defined configuration words read back zero values.

Word number	Bit field	Description
0	31:0	number of implemented configuration words
1	47:32	vendor
	31:16	version
	15:0	revision
1	63:0	capabilities flags
2	63:0	L1I info
3	63:0	L1D info
4	63:0	L2D info
5	63:0	L3D info
6	63:0	L1 ITLB
7	63:0	L1 DTLB
8	63:0	L2 TLB
9	63:0	PMR info

§ 10.7. Instructions for the emulation support

Currently the OS and standard libraries are not implemented for the virtual processor. Therefore, a few special instructions have been added to mimic their minimal emulation.

The write instruction is for outputting a formatted string. It uses the forward ip-relative addressing to address the format string. An unsigned 28-bit ip-relative offset gives a maximum distance of 256 MiB forward from the current position for a one-slot instruction and all available address space for a long instruction. The write instruction allows the continuation of the immediate value in the instruction code to the next bundle slot with the formation of a dual-slot instruction. It is assumed that the effective address point to a zero-terminated string. In assembler, you can use both labels on strings in the rodata section, and directly strings (the assembler will place them in the rodata section and insert the offset into the instruction).

ea = ip + zext(disp)

Instruction format write
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							opx							disp (28 bits)

The following formatters are used to display the content of the current core registers. The common syntax is «%formatter(register)» or «%m(command)».

Table 10.5: write formatters
formatter	description
%%	%
%c	low part of general register as a 1-byte character
%i8, %i16, %i32, %i64	low part of general register as a signed decimal value
%u8, %u16, %u32, %u64	low part of general register as a unsigned decimal value
%x8, %x16, %x32, %x64, %x128	low part of general register as a unsigned hexadecimal value
%b8, %b16, %b32, %b64	low part of general register as a binary value
%f32, %f64, %f128	low part of general register as a floating-point value
%vf32, %vf64	general register as a vector of floating-point values
%vi8, %vi16, %vi32, %vi64	general register as a vector of signed decimal value
%vu8, %vu16, %vu32, %vu64	general register as a vector of unsigned decimal value
%vx8, %vx16, %vx32, %vx64	general register as a vector of hexadecimal value
%m(dump)	full core state dump

The instruction halt without parameters is intended to turn off the processor core, switching it to the deepest level of sleep, without saving a state, from which core may exit only by the reset signal. But in emulator this instruction serves to shut down the emulator.

Instruction format halt
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							0																								opx

Notes: The halt instruction is not used in unittests, because it is automatically added for each test source by test scripts.

Chapter 11. Application Model (Application Binary Interface)

This chapter gathers information about the ABI – software binary interface. This includes questions about what the runtime and program model are, what are the sections and segments of the program, how the program finds its private data, what are the available addressing methods, what are the accepted agreements on the relationships between the procedures and register preservation, relocation types and object file formats are described.

Depending on the execution environment (hardware capabilities of the target architecture, type of operating system) program models are (in order of the history of the emergence and growth of universality):

Absolute – MS DOS for Intel X86
Relative –?
Segmented (with protection based on access windows to a single address space) –?
Implicitly segmented based on a single global address space – IBM AIX for POWER, HP-UX for HP PA-RISC
Virtual memory and multiple isolated address spaces (MAS) – Windows, Unix, Linux
Virtual memory and uniform global address space (SAS) – IBM AS/400, Palm OS, Mungi, Opal

For the POSTRISC system, a combined SAS/MAS environment with implicit segmentation was selected, when each segment can be configured as SAS or MAS.

§ 11.1. Sections and segments

The compiler divides the different parts of the generated object code and data into different sections. During linking, when combining object files, sections with the same name come together and consolidate, getting an output file with one instance of each type of section. These sections of the output file are further grouped into several segments, which are processed by the loader as indivisible units.

The purpose of sections is to allow the compiler to generate separate pieces of code and data, which can be combined with other similar parts from other object files at the build stage. This makes it possible to achieve link locality, and confidence in the correct addressability of the contents of these sections. The most important section attributes are the type of access to the section pages; all data in one section shares the same minimum set of permissions.

The purpose of the segments is to allow the linker to group sections into fewer program units. Each segment has unique addressing methods for it, and sections of one segment are addressed in the same way. The compiler may make assumptions that any two objects in the same segment have a fixed offset relative to each other when the program is executed, but cannot assume the same for two objects in different segments.

The runtime architecture also defines some additional segments, which do not get their contents directly from the compiled object file. These segments are the – heap, stack, and shared memory segments – are created at program startup time or dynamically at runtime.

Table 11.1: Standard scheme of a software module (using ELF format as an example)
segment	section	type of program	Description
TEXT	header	all	file header
	sectab	all	section heading table
	shstrtab	all	section names
	.dynamic	shared	Dynamic linking information Header
	.liblist	shared	A list of the names of the required spherical libraries
	.rel.dyn	shared	Relocation for DATA process data
	.rel.tdata	shared	Relocation for TDATA thread data
	.conflict	shared	Additional dynamic linking information
	.msym	shared	Additional dynamic linking symbol table
	.dynstr	shared	Name of linking external functions
	.dynsym	shared	Link table of external functions
	.hash	shared	Hash table for quick search in the export table
	.rconst	all	Read-only constants (no configuration)
	.rodata	all	Immutable global data (setting at first boot into the system)
	.lita	nonshared	Literal address pool section
	.lit	all	Literals (Literal pool section)
	.tlsinit	all	Initial copy of TDATA data
	.pdata	all	Exception procedure table
	.text	all	Main program code (not corrected during loading, it is possible to configure it at the first load in the system)
	.init	all	Section of the program initialization code
	.fini	all	Program Termination Code Section
	.comment	all	Comment Section
TEXT but not downloadable	rsrc	all	Compiled resources
	line	all	Debug information
	debug	all	Debug information
	unwind	all	Table for stack rollback after exceptional situations
	unwind_info	all	Blocks of information to roll back the stack after exceptional situations
DATA	.data	all	Initialized private process data (setting at boot)
	.xdata	all	Exception scope table
	.sdata	all	Near-address small data initialized private process data (setting at boot)
	.got	shared	GOT table (Global offset table) for references to DATA variables of other modules
	.sbss	all	Small-address (small bss) uninitialized private process data
	.bss	all	Uninitialized private process data
TDATA	.tdata	all	Initialized thread local data (setting at boot)
	.tsdata	all	Near-address small data initialized thread local data (setting at boot)
	.tgot	shared	Module GOT table for the thread (links to TDATA variables of other modules)
	.tsbss	all	addressable small (bss) uninitialized thread local data
	.tbss	all	Uninitialized thread local data

A program in the POSTRISC architecture consists of a main program module, dynamically loaded libraries (the same program modules), stacks of the main and other threads, several heaps. Each program module consists of four types of sections.

The TEXT segment is shared by all processes in the system and is read-only and executable. The addressing within the segment is relative to the instruction pointer. Its CODE section contains program code. Its RODATA section contains immutable data, placed after the CODE section.

The DATA segment contains private process data. The segment is read-write. The addressing within the segment is relative to the instruction pointer. The DATA segment of the main software module, in addition to its private data, contains a table of base addresses for all DATA segments of dynamically loaded libraries.

The TDATA segment contains private process data. The segment is read-write. The segment after creation is unknown distance from everyone else segments and is addressed relative to the selected base register tp. The TDATA segment of the main software module, in addition to its private data, contains a table of base addresses for all segments of TDATA dynamically loaded libraries.

§ 11.2. Data model

There are several data models for binding fundamental integer scalar data types from programming languages to architectural data types.

Table 11.2: Dimensions of fundamental types
Data model	Architectural types
Data model	1-byte	2-byte	4-byte	8-byte
ILP16	char	short int, int, long int, near pointer
LP32	char	short int, int, near pointer	long int, far pointer
ILP32	char	short int	int, long int, pointer	long long int
LLP64	char	short int	int, long int	long long int, pointer
LP64	char	short int	int	long int, long long int, pointer
ILP64	char	short int	wchar_t	int, long int, long long int, pointer

The ILP16 variant was used by very ancient 16-bit systems, LP32 uses MS DOS, ILP32 use all 32-bit systems, LLP64 chose Microsoft for Windows-64, LP64 is selected for Linux-64 and most other 64-bit Unix systems, ILP64 is used in some versions of Unix systems.

The choice between LLP64, LP64, and ILP64 is determined by different criteria. If you need support (without recompiling) an existing array of 32-bit software when migrating to 64-bit systems, then LLP64 is the best choice. The disadvantage of – alteration for 64 bits requires a deep modernization of the program. If you want the existing code array to take advantage of 64-bit addressing with minimal code rework, then ILP64 is a good fit. The disadvantage of a – superficial code upgrade leads to memory overrun where 64 bits are not needed. If you follow a balanced approach between the complexity of converting to 64-bit systems and the need to support existing 32-bit programs, then choose LP64. ILP64 was chosen for POSTRISC, with the addition of the new fundamental type long char to describe four-byte numbers (wchar_t).

Table 11.3: Binding to fundamental types
Data Type	Size and alignment	Machine Type
signed char	1 (1)	signed byte
unsigned char	1 (1)	unsigned byte
char	1 (1)	byte, the sign depends on the compiler
bool	1 (1)	unsigned byte, 0 or 1
[signed] short int	2 (2)	signed 2-byte
unsigned short int	2 (2)	unsigned 2-byte
[signed] long char	4 (4)	signed 4-byte
unsigned long char	4 (4)	unsigned 4-byte
enum	1,2,4,8	depends on the range of values
[signed] int	8 (8)	signed 8-byte
unsigned int	8 (8)	unsigned 8-byte
[signed] long int	8 (8)	signed 8-byte
unsigned long int	8 (8)	unsigned 8-byte
[signed] long long int	8 (8)	signed 8-byte
unsigned long long int	8 (8)	unsigned 8-byte
data pointer: type *	8 (8)	unsigned 8-byte
function pointer: type (*) ()	8 (8)	unsigned 8-byte
float	4 (4)	IEEE single
double	8 (8)	IEEE double
long double	16 (16)	IEEE quadruple

Aggregate data types (structures – struct, class – and arrays) and unions (union) are aligned with their most strictly aligned component. The size of any object, including aggregates and associations, is always a multiple of the alignment of the object. An array uses the same alignment as its elements. Structure and join objects may require inserts to meet size and alignment restrictions. The content of any padding is undefined.

The structure or union as a whole must be aligned on the same byte boundary as its most strictly aligned field.
Each field is assigned the smallest available offset with a suitable alignment. This may require internal inserts between the fields, depending on the previous field.
If necessary, the size of the structure must be increased to be a multiple of the alignment. This may require end insertion, depending on the last member.

C structures and associations can contain bit fields that define integer objects with a specified number of bits. The table shows the permissible values of bit fields for each base type, and the corresponding limits.

Table 11.4: Binding of bit fields to fundamental types
Data Type	Field Width W	Limits
char, signed char	1-8	− 2 ^W−1 … 2 ^W−1 − 1
long char, signed long char	1-16
short, signed short, enum	1-32
int, signed int	1-64
long, signed long	1-64
long long, signed long long	1-64
unsigned char	1-8	0 … 2^W − 1
unsigned long char	1-16
unsigned short	1-32
unsigned int	1-64
unsigned long	1-64
unsigned long long	1-64

Bit fields whose base type (with the exception of enumerated types) is represented without an explicit signed or unsigned definition, considered as unsigned (fixme). Bit fields of enumerated types are considered to be signed, unless an unsigned type is needed to represent all constants from the enumeration type. Bit fields obey the same size and alignment rules as other fields in a structure or union, with the following additions:

Bit fields are placed from right to left (from the least significant bit to the most significant) for little-endian, and from left to right (from the most significant bit to the least significant) for big-endian.
The bit field must completely fit in the machine type corresponding to its declared type. For example, a short bitfield should never cross the mem4 border.
Bit fields can share a machine-type cell with other fields of struct / union, including other bit fields and even non-bit fields (enum types?). Of course, each field of the structure occupies its own part of the machine type.
Unnamed bitfield types do not affect structure alignment or union. Unnamed zero-length bit fields force alignment of subsequent fields to a border corresponding to the size of the bit field.

Bit fields like int and long (signed and unsigned) are usually packed denser, the smaller the base types (less restrictions on crossing the boundaries of the base type). You can use bit fields and types char and short, to force placement within those types, but int is generally more efficient.

§ 11.3. Reserved registers

Although all 128 general-purpose registers are physically equal (except for the difference between global and rotate registers, and some other differences) the software binary interface reserves several general purpose registers for its (special) purposes. Unlike real special purpose registers, these registers are special only in the sense that that the program is obliged to use them only in an authorized way. The choice of numbers for these registers is (almost) arbitrary and not part of the architecture.

The initial contents of the registers sp, tp set by the loader at the start of the process / thread and should be changed by the program only according to ABI rules. The contents of the reg sp should always correctly display the state of the stack and be aligned with the strictest boundary for the base types – 16 bytes. Register r0 must contain the return info when the procedure is called.

Table 11.5: Dedicated General Purpose Registers
Register	Content
r0	link pointer – return address from the procedure. The called procedure receives the return address in the first register of the new frame of the local registers, register r0.
sp	stack pointer – pointer to the top of the stack.
tp	thread pointer – pointer to the beginning of thread local data for the main (static) module. Used by load/store instructions and ca.n only inside the main module.

§ 11.4. Position independent code and GOT

The code segment must not contain relocations (PIC). To create a PIC, the compiler must:

Use for all internal branches the ip-relative branches only rather than branches to absolute addresses.
Similarly, do not use absolute references to static data, instead use addressing with an offset relative to some standard base register. If the code and data segments are guaranteed to be located at a known distance from each other (MAS), then the function from the shared library can calculate the corresponding base address using ip. Otherwise (SAS), the caller must set the base register as part of the call sequence.
Use an additional level of indirection for each control transfer outside the monolithic PIC segment, and for each call to static memory outside the corresponding data segment. Indirectness allows you to save non-PIC target addresses in the DATA segment private for each instance of the program.

The position-independent code cannot contain absolute addresses directly in the instruction code, but uses for addressing data and an offset code relative to the instruction counter. A data-binding-independent code uses for code addressing the offset relative to the instruction counter, but cannot address private data in the same way, but only relative to to the base registers.

The Global Offset Table (GOT) stores absolute addresses and is part of the process's private data, which makes addresses accessible without violating positional independence and sharing of program code. Each program module refers to its GOT table in a position-independent manner and extracts absolute addresses from it. So position-independent links are converted to absolute positions.

Initially, the GOT contains information about relocation points (annotations for the dynamic linker). After the system creates memory segments for the loaded object file, the dynamic linker processes relocation points, some of which will refer to the GOT. The dynamic linker determines the symbolic names associated with them, calculates their absolute addresses, and sets the appropriate values in the corresponding GOT entries. Although the absolute addresses are unknown to the link editor when it builds the object file, but the dynamic linker knows the addresses of all memory segments and can therefore calculate the absolute addresses of the objects contained therein.

If the program requires direct access to the absolute address of the object, this object will have an entry in the GOT. Since the executable file and each shared object have separate GOTs, the address of a symbolic name may appear in several tables. The dynamic linker processes all GOT relocations before transferring control to the process code, which guarantees the availability of absolute addresses at runtime.

Thanks to GOT, the system can select different addresses of memory segments for one shared object in different programs. She can even choose different library addresses for different executions of the same program. At the same time, memory segments do not change addresses after the process image is installed. As long as the process exists, its segments are located at fixed addresses.

Short summary: if the program has several data segments (private or shared), then they are accessed indirectly through the GOT address table. The GOT table is part of one selected – DATA private data segment. However, objects in the DATA segment itself can be addressed indirectly through the GOT table in DATA (for example, if the distance of the relative displacement is too large for implementation in the instruction).

§ 11.5. Program relocation

The relocation or unresolved link is a place in the code or static data, reserved by the compiler to substitute the later calculated value – at the stage of compilation or even later at the stage of loading, or not containing data (field of zero bits), or containing incomplete information (an additional term may be stored to calculate the allowed link). Usually, a direct value is stored in the place of relocation, which is the absolute address or relative offset relative to the base address or counter of the bundles, and used when accessing memory or address calculations.

There are as many types of relocation as there are different ways in the processor architecture of the processor to put an immediate value in the code of the machine instruction (without taking into account the constants for the description of shifts and some other constants, too short and therefore not used for relocation) or in the data object. Link Editor uses these unfilled (under-calculated) line items. at the assembly stage for embedding in the previously compiled code its information about the links in the program between individual segments, sections, object modules and dynamically linked executable modules.

The compiler creates (for later use by the linker) a table of relocation (program moving) records as part of the object file. Moving records describe how the linker (or later the loader) should modify the instruction or data field.

The following distinct data relocation types are defined for data sections in the POSTRISC architecture:

RELOC_WORD. A 4-byte boundary aligned 32-bit field in any data section.
RELOC_DWORD. An 8-byte boundary aligned 64-bit field in any data section.

For code in the POSTRISC architecture (according to the format of single-slot instructions and their extensions to the second slot) The following distinct code relocation types are defined:

RELOC_LDI. A 28-bit signed immediate (or 64 bits for a long instruction) is embedded in the ldi instruction.

41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
opcode other simm 28 bits

83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42
0 extended simm (64 bits instead of 27)
RELOC_JUMP. A constant with a sign length of 28 bits (or 60 bits for a long instruction) for ip-relative offset in the program segment text (or rodata) embedded in call.r, jmp or ca.r instructions (distance ±2 GiB or ±8 EiB for long instructions).

41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
opcode other simm 28 bits

83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42
0 extended simm (60 bits instead of 28)
RELOC_BRANCH. A signed 17-bit immediate (or 30-bit for a long instruction) for offset in the code segment embedded in an instruction like compare-and-branch as a branch distance (distance ±1 MiB or ±8 GiB for a long instruction).

41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

opcode other simm 17 bits

83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42
other extended (30 bits)
RELOC_BINIMM. A constant with a 21-bit sign (or 63 bits for a long instruction) for instructions ld1, lds1, addi, subfi and others.

41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
opcode other simm 21 bits

83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42
extended simm (63 bits instead of 21)
RELOC_BINIMMU. An unsigned constant of 21 bits (or 63 bits for a long instruction) for instructions maxui, minui.

41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
opcode other uimm 21 bits

83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42
extended uimm (63 bits instead of 21)
RELOC_BRCI_SIMM. A 11-bit signed constant (or 40-bit for a long instruction) is embedded in an instruction like compare-with-constant-and-jump as a constant to be compared.

41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
opcode other simm11 other

83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42
extended simm (40 bits instead of 11) other
RELOC_BRCI_UIMM. A 11-bit unsigned constant (or 40-bit for a long instruction) embedded in an instruction like constant-compare-and-jump as a constant to be compared.

41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
opcode other uimm11 other

83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42
extended uimm (40 bits instead of 11) other

Instruction-dependent basic relocation types further differ in different ways of forming the implemented constant and conditions checked in this case. The set of these methods depends on the program model used and additional features of the instruction set architecture.

For example, the Intel X86 architecture has only one basic type of code relocation (in 32-bit mode) – the 4-byte field as part of the instruction, for which there are only two ways of forming the constant – just an absolute address for the data or ip-relative address for the code.

The POSTRISC system is oriented towards a position-independent code. In addition, the special instructions ca.n, ca.r, designed to optimize relative addressing, require special support from the linker. Hence a large selection of possible ways of referencing an object of code or data depending on the location of the object, the remoteness of the object from the base of relative addressing, the presence of indirect links through the GOT table, the number of repeated calls to the same object or objects near it.

The method of referencing an object in the place of relocation (a method of converting a symbol name into an embedded constant) is usually set in assembler as a call to a special function when calculating the constant parameter of an instruction, or as a suffix added to the name of an object. It should be understood that this is actually not a function call, but a mark for the linker (hello to him from the compiler) – exactly how to construct a relocation constant from the name of the object. The set of relocation methods used depends on the architecture of machine instructions (if, for example, long constants are synthesized from parts and divided between several instructions, or several typical sizes of programs with different addressing methods are provided) and from the selected program model (absolute code for the system kernel or user program). The set of POSTRISC assembler functions below is generally traditional for the 64-bit PIC program model and is found (with some variations) in all 64-bit architectures: DEC Alpha, SGI MIPS, IBM PowerPC, Sun UltraSPARC, Intel Itanium.

Table 11.6: Assembler functions to set the relocation method
Group (scope)	Function (Method)	How, at runtime, an offset is obtained from the offset
Absolute addresses (for data only)	symbol	symbol
	expr	symbol+offset
	got(symbol)	mem8[offsettee
Private process data	pcrel (expr)	ip + offset
Private process data	ltoff(expr)	mem8[ip+offsettee
thread local data (main program)	tprel (expr)	tp + offset
thread local data (main program)	@tprel@got(expr)	mem8 [tp + offset]
Private flow data (dynamic modules)	dtprel(expr)	mid = mem4 [gp + mid_offset] local_tp = mem8 [dtv + mid] ea = local_tp + offset
Support for ca.n instructions for all data types	data_hi (expr)	base + (offset << 15)
Support for ca.n instructions for all data types	data_lo(expr)	base+offset
Support for the ca.r instruction for all data types	text_hi(expr)	ip+16×offset
Support for the ca.r instruction for all data types	text_lo(expr)	base+offset
Miscellaneous Functions	segrel(expr)	segbase+offset
Miscellaneous Functions	secrel(expr)	secbase+offset

The mere mention of the symbol symbol means the absolute address of the symbol object. The expression expr means a formula from the absolute address of the object and constant offset: symbol + offset. Absolute addresses at runtime are not calculated and used as is. Absolute addresses can be embedded in the instruction code only if it is an absolute program (system core, drivers).

Function got (symbol) (global offset table) means the absolute address in the GOT table for indirect access to the symbol object. At the same time, this is a request to create a GOT record for the symbol object, if there is no such record yet.

The got function cannot be used by itself, but only with pcrel or tprel, e.g. like @ pcrel @ got (expr), since the GOT table is divided in two (depending on the locality of the link objects – process or thread) and is part of the DATA and TDATA segments, and therefore should be addressed accordingly.

The function pcrel (expr) (program counter relative) means the offset offset relative to the instruction counter. The absolute address of the object is computed at run time as ip+offset. Used to access the code and / or static data of the same module.

Function tprel (expr) (thread pointer relative) means offset offset relative to the base register tp when addressing the thread private data. The absolute address of the object is calculated at runtime as tp + offset. The expr object must belong to the TDATA segment of the main module.

Function dtprel (expr) (dynamic thread pointer relative) means the offset offset of the object expr regarding the beginning of the thread private data of this module dtv [ModID] (taken from the array dtv addressed by the register dtv). The absolute address of the object is calculated at run time as dtv[ModID†+offset. The expr object must belong to the TDATA segment of the ModID module itself.

The function data_lo(offset) describes the lower 15-bit part of the offset offset: data_lo(offset) = sign_extend(offset, 15). The offset offset is usually calculated for position-independent programs as gprel(expr) or tprel(expr) depending on the location of the expr object. Used for addressing relative to the intermediate base address, calculated earlier using the ca.n instructions. This intermediate address can be reused for calls. to an object or its nearest neighbors using short read / write instructions (with offsets of minimum length – not more than 16 bits per offset).

The data_hi(offset) function describes the older part of the offset offset: data_hi(offset) = (offset- sign_extend(offset, 15)) >> 15. Used by ca.n instructions to calculate an intermediate base address before using short read / write instructions. These instructions calculate the absolute address no further than 16 kilobytes from the relative addressable object expr. Long (over 32 kilobytes) offset offset relative to the base register split into two parts offset_hi and offset_lo, so that offset = (offset_hi << 15) + offset_lo, so the younger part is offset_lo always placed in a 16-bit constant with a sign, and the older part offset_hi will be the argument of the ca.n instructions for calculating the intermediate base address.

Objects in the TEXT segment (RODATA section with read-only data) in a position-independent program, it should be addressed relative to the ip instruction counter. However, with the ca.r instruction, you can get only the starting address of the 16-byte bundle, that is, the address closest to the target is 16 bytes aligned. The following instructions for accessing the memory should take into account the short offset (from 0 to 15 bytes long) from this starting address to the object.

Function text_hi (expr) means that part of the ip-relative offset in the segment text used by the ca.r instruction to calculate the absolute address closest to the object, aligned on a 16-byte boundary. The ca.r instruction computes ip+16× text_hi(symbol), where text_hi(symbol) is calculated by assembler as ((symbol− text_lo(symbol)) >> 4). The paired function text_lo(symbol) describes the younger part ip -relative offset as sign_extend(symbol, 4), that is, the difference between the address of the object and the nearest 16-byte boundary. This value is used for direct addressing in load/store instructions. after calculating the intermediate address of the 16-byte bundle using the ca.r instruction.

Function segrel (expr) (segment relative) describes the offset of the object expr relative to the start of the segment. This relocation is for data structures, which are placed in read-only shared segments but must contain pointers. In this case, the relocation point and the relocation object must be located in one segment. Applications using such relative pointers should be aware of their relativity and add the base address of the segment to them at runtime.

Function secrel (expr) (section relative) describes the offset of the expr object relative to the beginning of the section. This relocation is for links from one section to another within the same segment.

As a result, combining the type of relocation and the method of linking to an object, we get a complete set of all valid types of unresolved links, which the linker should be able to handle (minus some never-seen combinations).

Table 11.7: Types of Relocation Entries
Group	Name	Relocation Method
absolute addressing (data only)	R_ADDR_WORD	sym+addend
absolute addressing (data only)	R_ADDR_DWORD	sym+addend
relative to ip	R_PCREL_JUMP	pcrel (sym + addend), jump/call
	R_PCREL_JUMP_EXT	pcrel (sym + addend), jump/call
	R_PCREL_BRANCH	pcrel (sym + addend), compare-and-branch
	R_PCREL_BRANCH_EXT	pcrel (sym + addend), compare-and-branch
	R_PCREL_CA_R	text_hi (pcrel (sym + addend), ca.r
	R_PCREL_CA_R_EXT	text_hi (pcrel (sym + addend), c.ar
section-base relative	R_SECREL_WORD	sym - SC + addend, .mem4
section-base relative	R_SECREL_DWORD	sym - SC + addend, .mem8
segment-base relative	R_SEGREL_WORD	sym - SB + addend, .mem4
segment-base relative	R_SEGREL_DWORD	sym - SB + addend, .mem8
base-relative	R_BASEREL_LDI	L (sym - base + addend)
	R_BASEREL_LDI_EXT	L (sym - base + addend)
	R_BASEREL_BINIMM	sym - base + addend
	R_BASEREL_BINIMM_EXT	sym - base + addend
dynamic layout?	R_SETBASE	Set base
	R_SEGBASE	Set SB
	R_COPY	dyn reloc, data copy
	R_IPLT	dyn reloc, imported PLT
	R_EPLT	dyn reloc, exported PLT
tp -relative	R_TPREL_WORD	tprel (sym + addend), .mem4
	R_TPREL_DWORD	tprel (sym + addend), .mem8
	R_TPREL_LDI	tprel (sym + addend), LDI
	R_TPREL_LDI_LONG	tprel (sym + addend), LDI
	R_TPREL_HI_BINIMM	data_hi(tprel(sym+addend))
	R_TPREL_HI_BINIMM_EXT	data_hi(tprel(sym+addend))
	R_TPREL_LO_BINIMM	data_lo(tprel(sym+addend))
	R_TPREL_BINIMM	tprel (sym + addend), load/store
	R_TPREL_BINIMM_EXT	tprel (sym + addend), load/store

The assembler syntax must be consistent with the set of types of unresolved references that the linker can handle.

For example, almost none of the assemblers/compilers can take into account and handle the subtraction of two addresses from the same segment as a immediate, although it is. At the compilation stage, this subtraction is still unknown, but at the linking stage, when they can be defined, the corresponding types of relocation are not provided in order to pose a similar task to the linker. As a result, the compiler is forced to take these calculations to the stage of loading or executing the program.

The most «advanced» compilation/linking systems support the ability to postpone to the link stage the unresolved links of arbitrary complexity, if they would be reduced to a constant result.

§ 11.6. Thread local storage

Managing thread local storage (TLS) which is private for a thread isn't as simple as per-process private data. TLS sections cannot simply be loaded from a file into memory and made available to the program. Instead, multiple copies should be created (one for each thread) and all of them must be initialized according to the primary image of the TLS section in the program file. The creation of new threads can continue dynamically throughout the entire period of the program.

TLS support should avoid creating TLS data blocks if possible, for example, using deferred memory allocation on the first request (first attempt to access TLS). Most threads will probably never use private data of all dynamic modules at once. Unfortunately, the mechanism of deferred memory allocation requires at least introducing a separate functional level (layer) to control access to TLS objects, which may be too inefficient.

The problem is the very process of compiling TLS data and accessing it when there are many copies of it. The TLS variable is characterized by two parameters: a reference to the TLS block of a particular dynamic module and an offset within this block. To get the address of a variable, you need to somehow map these two parameters to the virtual address space at runtime.

The traditional TLS mapping approach is as follows. One of the general registers (tp or thread pointer) permanently stores the address of the static TLS block of data associated with the current thread. The data block is conditionally divided into two parts: a statically allocated single TLS data block of the main module (exe file) and the dtv vector (dynamic thread vector), storing addresses of dynamically (possibly lazy) dedicated TLS blocks for dynamically loaded dynamic modules. If the dynamic module is loaded into the program, then it is allocated one slot (a place to store the address) in the dtv vector.

Knowing your mid number, the dynamic module can find the beginning of its TLS data for the current thread in dtv[mid] or MEM (tp + mid + offset), where offset is the position of dtv relative to tp (usually 0). Next, you can find the address of the variable as dtv[mid] + var_offset, where var_offset is the position of the variable relative to the dynamic TLS block.

General dynamic model TLS (general dynamic) is the most universal. The code compiled for it can be used anytime, anywhere, and it can access TLS variables defined anywhere. For example, from one dynamic module, access the TLS variable in another dynamic module. By default, the compiler generates code for this model, and can use more limited TLS models only when explicitly allowed by the compiler options.

For the code of this model for the TLS variable are unknown at the build stage (and especially compilation) neither the module number (slot) in which it is located, nor the offset inside the TLS block of this module. Module number (ModuleID) and offset in the TLS block are determined only at runtime (taken from the GOT table where the loader writes them) and passed to a special function __tls_get_addr (the standard name for many Unix), which checks for the existence of a TLS block, creates if it is not, and returns the address of the variable for the current thread. The implementation of this function is also a problem requiring the assistance of the OS.

addr1 = __tls_get_addr (GOT [ModuleID], GOT [offset1])
addr2 = __tls_get_addr (GOT [ModuleID], GOT [offset2])

The code size and runtime are such that it is best to avoid this model altogether. If the module number and / or offset are known, optimization or simplification is possible.

Local dynamic model TLS (local dynamic) is an optimization of the general dynamic model. The compiler uses this model if it knows that TLS variables are used in the same module in which they are defined. Now the variable offsets (at least in the TLS block of this module itself) will be known at the linking stage. The module number is unknown. Still need to call the function __tls_get_addr, but now it can be called only once (with offset 0) to determine the start address of the «block of its» TLS variables. The address of individual variables is simply determined by adding a known offset.

addr0 = __tls_get_addr (GOT [ModuleID], 0)
addr1 = addr0 + offset1
addr2 = addr0 + offset2

Dynamic models using the __tls_get_addr function allow lazy allocation of memory for TLS data at the first request to the block.

Static Load Model TLS (initial exec) assumes that a certain set of dynamic modules will always be loaded together with the main program. Then the loader can calculate the total value of all TLS blocks of such modules and their location in a single TLS block. Separate TLS blocks of different modules in this single block will be located at a fixed distance from the beginning of the block, which the loader computes and stores in the GOT table. Now, to calculate the address of the TLS variable, you do not need to call the function __tls_get_addr, it's just a read from the GOT record, and you don't need to know the module number. Allocation of a single block occurs immediately upon the start of a new thread (without delayed lazy allocation). If such a model is used for a dynamic library, then it cannot be loaded dynamically, but only statically. Addressing occurs relative to the selected register tp with the offset known at the loading stage (taken from GOT).

addr = tp + GOT [offset]

Local static model TLS (local exec) will be obtained if we combine the local dynamic model and the initial exec model, then we get the local exec model: static loading and local calls (without dynamically linked modules). The main module of the program (main) refers to the TLS variables defined in it. Addressing occurs relative to the selected register tp with the offset known at the stage of layout.

addr = tp + offset

The compiler usually (when compiling object modules separately, when creating libraries) doesn't have full information about the future program as a whole. The compiler is forced to make the most careful decisions about the nature of the future program. This usually comes down to the compiler using the most common mechanisms for addressing private data. For TLS, this is the general dynamic addressing model.

Therefore, it is important that the linker, when compiling the finished program, can optimize and make changes to previously compiled object module, and replace for some variables the existing addressing method with another (optimized) one. To do this, at a minimum, the linker must know such places (unresolved references to TLS sections), and the compiler must create the addressing code so that it can be replaced by another. This requires the equivalence of different TLS addressing methods in terms of code size, number, type and number of registers used, etc.

If the optimized version is shorter than the original, after the replacement, the program may leave empty spaces filled with dummy nop instructions. It happens that the optimized version is longer than the original, then the compiler must add the dummies in advance, to allow future linker replacement with an optimized addressing option.

§ 11.7. Modules and private data

The POSTRISC system is focused on code and translation table sharing. It should be possible to replace shared libraries without recompiling their dependent applications. Any software module can be used by several processes. There should be no difference between application code and shared library code. For addressing code and global data, the addressing relative to the instruction pointer is used with software reconfiguration to the conformant regions of private process/thread data.

The one address range is used for mapping code sections of all program modules. This address range is shared by all processes and is executable only. For each process, another address range is allocated for static process data sections of all program modules. For each thread, another address range is allocated for the sections of static data of the thread in all program modules. All three address range types are the of the same size, a multiple of degree 2, and aligned to the same border.

For each program module, the following three values must be equal: offset from the beginning of the code range to the beginning of the code section; offset from the beginning of the private data range of the DATA process to the beginning of the DATA section; offset from the beginning of the private data range of the TDATA stream to the beginning of the TDATA section. Knowing only ip and the base address of the private range (stored in dedicated registers gp and tp for DATA and TDATA, respectively), you can always calculate the location of positionally independent private data using the formula:

base = gp | ip { gtssize − 1: 0}

or indirectly

base = mem [ gp + ip {gtssize − 1: 0} >> tgsize]

Private static data can easily be found by library code. It is not necessary to explicitly pass the correct address of the data segment of the module of the new gp (global pointer) when called through the border of a module or when called through a pointer to a function. A pointer to a function becomes just a pointer to a place in a code segment, without additional levels of indirect access via function descriptor.

Table 11.8: Sample map for several modules
Address subranges	Address granules (loaded modules, used granules)
	0	1	2	3	4	5	6	7	8	9	0	1	2	3	4	5	6	7	8	9	0	1	2	3	4	5	6	7	8	9	0	1	2	3	4	5	6	7	8	9
	m1			-			m2					m3						-				m4						m5					m6			-
TEXT (global)
process A DATA
thread A1 TDATA
thread A2 TDATA
process B DATA
thread B1 TDATA
thread B2 TDATA
thread B3 TDATA
process C DATA
thread C1 TDATA
thread C2 TDATA
thread C3 TDATA
process D DATA
thread D1 TDATA
process E DATA
thread E1 TDATA
thread E2 TDATA
thread E3 TDATA

When dynamically or statically loading a software module, the loader first determines if this module is presented among the loaded modules in the system. If it isn't loaded, the loader determines for the module a maximum of three module sections: CODE, DATA and TDATA. The loader then looks at the system-wide loaded modules region map and looking for an sufficient size unoccupied address range. According to the rules above, similar unoccupied address ranges exist in all three regions at the same distance from the beginning of each region. Having found such a range, the loader reserves it for future use by this software module in all processes and threads.

The module occupies the selected address range until the last process that uses this module terminates. Then the system can unload the module and free the address range, so the next time the same module may be loaded into a different address range. While the module is loaded, its the base address of the text section of the program module is unchanged for all processes using it.

§ 11.8. Examples of assembler code

The following are examples of using ca.r, jmp, bv to load constants, get procedure addresses, procedure calls.

Literals and other local read-only data from the TEXT segment can be loaded using ip-relative addressing. Loading Constant Group:

ca.r base, text_hi (_local_data)
lds.w gb, base, text_lo (_local_data) +0
ldz.w gc, base, text_lo (_local_data) +4
ldz.d gd, base, text_lo (_local_data) +8

Getting the address of a static procedure (within 64 MiB of the current ip):

ca.r base, _myfunc

Getting the address of a static procedure (further 64 MiB from the current ip):

ca.r.l base, _myfunc

Getting the address of a dynamic procedure:

ca.r base, text_hi (_reloc_table)
ldz.d gt, base, text_lo (_reloc_table) + __imp_myfunc

Call a static procedure (within 8 GB of the current ip):

call.r _myfunc
_ret_label:

Call a static procedure (beyond 8 GB of the current ip):

call.r.l _myfunc
_ret_label:

Calling a procedure through a pointer (in the addr register):

call.ri lp, addr, gz
_ret_label:

The call of the explicit dynamic procedure (correction of the call by the compiler):

ca.r base, text_hi (_reloc_table)
ldz.d addr, base, text_lo (_reloc_table) + __imp_myfunc
call.ri lp, addr, gz
_ret_label:

Invoking an Implicit Dynamic Procedure (correction of the call by the linker using the stub function):

call.r _glu_myfunc
_ret_label:
...
_glu_myfunc:
ca.r gt, _reloc_table
ldz.d gt, gt, _imp_myfunc
bv gt, g0
_glu_ret_label:

Private process data (distance up to 1 MiB):

ldz.d gt, gp, _local_data

Private process data (distance greater than 1 MiB):

ca.n gt1, gp, data_hi (_local_data)
ldz.d gt2, gt1, data_lo (_local_data)

thread local data (distance less than 1 MiB):

ldz.d gt, gp, _local_data

thread local data (distance greater than 1 MiB):

ca.n g30, tp, data_hi (_local_data1)
ldz.d g31, g30, data_lo (_local_data1)
ca.n g31, tp, data_hi (_local_data2)
ldz.d g32, g31, data_lo (_local_data2)

Chapter 12. Interrupts and hardware exceptions

Interruption is an action in which the processor automatically stops execution of the current instruction thread. The processor usually saves part of the thread context (at least the address of the instruction must be saved, with which the normal execution of the instruction flow should continue). The state of the machine changes to a special interrupt processing mode. The processor starts execution from the predefined address of the interruption handler routine. Having finished the interrupt processing, the routine-handler (usually) restores the previous state of the processor (the context of the interrupted thread), and makes it possible to continue execution of the thread with an interrupted (or following) instruction (return from interruption).

Exception is an event that, if enabled, forces the processor to interrupt. Exceptions are generated by signals from internal and external peripheral devices, instructions of the processor itself, internal timer, debugger events, or conditional errors. In the general case, exceptions do not coincide with interrupts: different exceptions may generate an interrupt of the same type, one exception can produce several interrupts.

§ 12.1. Classification of interrupts

All interrupts can be classified according to the following independent characteristics: location of the code for interrupt service, synchronism to the context, synchronism to the instruction flow, criticality, accuracy.

According to the code location for its service, interrupts are divided into two groups. Interrupts of the first group depend on the specific implementation of the processor and/or platform. This is a RESET (power up, hardware or «cold» start) INIT (soft or «warm» restart), CHECK (test and, possibly, recovery of the processor and/or platform upon failure), PMI (request to the processor/platform for a implementation specific service). The method for handling such interrupts is unknown to the operating system. The code for processing them is stored in an intermediate layer between the OS and the hardware (PAL). The addresses of the handlers for such interrupts are fixed for this processor implementation, and are tied to the address range of the PAL library. The code, in whole or in part (if the implementation allows PAL updates) is sewn into the write-protected PAL memory area.

Interrupts of the second group are determined by the architecture (fixed) and do not depend on the specific processor implementation. The method of servicing such interrupts is selected by the operating system. The code for their processing is stored in the interrupt table, the location address of this table and its contents are set by the OS. Interrupts of the second group are also called vector interrupts, since the processor uses the interrupt vector number to select the handler code from the interrupt table.

Synchronism to the context specifies the ability to continue the interrupted instruction flow. For RESET or CHECK, the continuation of the interrupted execution context is impossible – it either doesn't exist yet, or it is not restored. A machine check (restart, reset) interrupts the actions of context synchronization with respect to subsequent instructions. For other types of interrupts, after the interrupt is processed, the interrupted thread context is usually restored. These interrupts are also called context-synchronous or recoverable. This means that after the interruption is completed, execution can continue. interrupted sequence of instructions (execution context is saved/restored). An interrupt can be unrecoverable if during its generation or processing The contents of the processor registers, cache memory, write buffers, etc will be lost.

Synchronization to the thread sets the relation of interruption to the interrupted instruction thread. Asynchronous to the thread interrupts are caused by events that are not explicitly dependent on the instructions being executed. For asynchronous interrupts, the address reported to the exception handling routine is it is simply the address of the next thread instruction that would be executed next if the asynchronous interrupt did not occur. Synchronous to the thread interrupts are caused directly by the execution or attempt to execute an instruction from the current thread. Synchronous interrupts are processed strictly in software order, and if available multiple interrupts for a single – instruction in order of precedence for interrupts. Thread-synchronous interrupts are divided into two classes: errors (or faults) and traps.

Error or fault is an interrupt that occurs before the instruction completes. The current instruction cannot (or should not) be executed, or system intervention is required before the instruction is executed. Errors are synchronous relative to the instruction flow. The processor completes the state changes that occurred in the instructions before the erroneous instruction. An erroneous instruction and subsequent instructions have no effect on the machine state. Possible intermediate results of the instruction execution are completely canceled upon error, and after processing the interrupt, the instruction restarts again. Synchronous interrupt errors accurately indicate the address of the instruction that caused the exception that generated the interrupt.

Trap is an interrupt that occurs after the execution of an instruction. A completed instruction requires systemic intervention. Traps are synchronous relative to the instruction flow. The trap instruction and all previous instructions are complete. The following instructions have no effect on the machine condition. The instruction that generated the trap is not canceled or restarted. Synchronous trap traps accurately indicate the location of the next instruction after the instruction that raised the exception that threw the interrupt.

When executing an instruction causes a trap or attempting to execute an instruction causes an error, The following conditions must exist at the breakpoint:

All instructions preceding the instruction that raised the exception are accepted as executed for the interrupted processor. However, the memory access operations associated with these previous instructions may not have yet been performed from the point of view of other processors and memory mechanisms. Not a single instruction after the instruction that raised the exception has yet been accepted as executed.
The instruction that raised the exception either did not start execution (if it were not for raising an exception), or, completed, depending on the type of interrupt. The register iip contains the address of the instruction bundle that generated the exception exception or instruction immediately after the instruction, generated an exception trap. Since the instruction that generated the error will be re-run, the iip register always contains the return address from the interrupt. By the type of interrupt and status bits, you can determine whether this instruction is interrupted or next.

Critical Interrupts. Some types of interruptions require immediate attention, even if other types of interrupts are currently being processed, and it was not yet possible to save the state of the machine (return address and contents of the machine status registers). In addition, the interrupt handler itself may generate an interrupt, which may require a new handler to process. For example, when placing a page table in virtual memory when processing a miss in DTLB or ITLB, a DTLB may miss again.

According to these requirements, interruptions can be classified by severity level. To allow the possibility of a more critical interrupt immediately after the start of processing a less critical interrupt (that is, before the state of the machine is saved), provides several sets of shadow registers to save the state of the machine. Interrupts for each criticality class use their own set of registers.

All interrupts, except for machine verification, are ordered by two categories of interrupt criticality, so that only one interrupt of each category is processed at a time, and while it is being processed, no part of the program state will be lost. Since the group of registers for saving/restoring the processor state upon interruption is a sequential reusable resource, used by all interrupts of the same class, respectively, program status may be lost when an unordered interrupt occurs.

Interrupt Accuracy is an optional feature for synchronous interrupt flow. Exact interrupts are issued on a predictable instruction. The place where the instruction thread breaks is exactly the instruction that causes the synchronous event. All previous instructions (in program order) are completed before passing control to the interrupt handler. The instruction address is stored automatically by the processor. When the interrupt handler completes execution, it returns to the interrupted program and restarts its execution from the interrupted instruction.

Inaccurate interrupts do not guarantee spawning on a predictable instruction. Any instruction that was not yet executed when the interrupt occurred could be the place where the thread was interrupted. Inaccurate interrupts can be considered asynchronous, because the source instruction of the interrupt doesn't necessarily refer to the interrupted instruction. Inaccurate interrupts are lagging from the interrupted thread. Inaccurate interrupts and their handlers usually collect information about the state of the machine, related to interruption for reporting through the system diagnostic software. An interrupted program usually doesn't restart (cannot be restored).

Table 12.1: Classification of Interrupts
PAL code (asynchronous to the thread or inaccurate, critical)		Vector
		Asynchronous to instruction thread, recoverable	Synchronous to the thread
			Inaccurate errors		Accurate, recoverable
Unrecoverable	Recoverable		Unrecoverable	Recoverable	Errors	Traps
RESET, CHECK	INIT, PMI, CHECK	INT (external interrupts)	?	maybe FPU?	TLB, Access rights	Debug, FPU traps

Since not all combinations of handler code location, synchronicity with the context and/or flow, criticality and accuracy are exist, it is convenient to divide all interrupts into four types: failures (aborts), asynchronous interrupts (interrupts), and synchronous interruptions (interruptions), which are also divided into errors (faults) and traps (traps).

Failures. The processor has detected an internal failure, or a processor reset has been requested. A crash is not synchronous to the context or the instruction flow. A crash can leave the current instruction thread in an unpredictable state with partially modified registers and/or memory. Crashes are PAL-stored interrupts.

Machine Checks (MCA). The processor has detected a hardware error that requires immediate action. Depending on the type and severity of the error, the processor may be able to correct the error and continue execution. PAL-CHECK entry point to try to fix the error.
Processor Reset (RESET): The processor was turned on or a reset request was sent. PAL-RESET entry point for the processor to perform a system self-test and initialization.

Asynchronous Interrupts. An external or independent entity (such as an IO device, its own timer, or another processor) needs attention. Interrupts are asynchronous relative to the instruction flow, but usually synchronous with the context, all previous instructions are completed. Current and subsequent instructions have no effect on the machine condition. Interrupts are divided into initialization interrupts, platform control interrupts, and external interrupts. Initialization and interrupts for platform management are PAL interrupts, external interrupts are vectored interrupts.

Interrupt initialization (INIT). The processor received an initialization request. PAL-INIT is executed, and the processor is brought to the specified state.
A platform management request to execute functions such as platform error handling, memory cleaning, or power management. PAL-PMI is performed to serve the request. Program execution can continue at the breakpoint. Platform management interruptions are distinguished by unique vector numbers. Vectors 0-3 are reserved for the platform software, vectors 4-15 are reserved for the processor software.
External interrupts (INT). The processor received a service request. Typically, these requests come from IO devices, although requests can come from any other processor in the system, including itself. To process such a request, an external interrupt vector is used. External interrupts are distinguished by unique vector numbers in the range from 16 to 255. These vector numbers are used to distinguish external interrupts by source and priority. Special case of external interrupts are non-maskable interrupts (NMI), which are used to request critical by priority operating system services. NMI interrupts are assigned vector number 2 of external interrupts.

Errors and traps. Always synchronous with context and flow. These are vector interrupts.

Machine check interruption is a special case of asynchronous interruption. They are usually caused by some hardware, or by a failure of the memory subsystem, or by trying to access an invalid address. Machine verification can be called indirectly by executing an instruction, if the error caused by the execution of the instruction will not be recognized on time and will turn into a hardware failure. The fact that machine verification interrupts cannot be said to be synchronous or asynchronous, as accurate or inaccurate. They, however, are treated as critical class interrupts.

In the case of machine verification, the following general rules apply: 1. No instruction after the one whose address is communicated to the verification interrupt routine in the iip register has started execution. 2. The instruction whose address is communicated to the machine check interrupt routine in the register iip, and all previous instructions may or may not be completed successfully. All those instructions that are ever going to complete seem to be will do so already, and have done so within the context existing prior to the Machine Interruption of the Verification. No further interruption (other than new machine check interruptions) will occur as a result of those instructions.

§ 12.2. Processor state preservation upon interruption

When an interrupt occurs, the processor saves in special registers part of the context of the interrupted instruction stream. This is necessary for the subsequent correct restoration of the interrupted stream after completion of the interrupt processing. These are the registers: iip is a copy of ip, ipsr is a copy of psr.

The processor provides the interrupt handler with some minimum free registers for intermediate computations, so that the interrupt handler can use these registers for its own purposes. Special registers group ifa, cause, iib stores information about the characteristics of the interrupt necessary to recognize and process the interrupt.

Special Registers Group (iip, iipa, ipsr, ifa, cause, iib) used to quickly save part of the machine state during interruptions, service interrupt, and restore the initial state of the machine when returning from the interrupt. This group exists in two instances to service two level interrupts. priority (criticality) and forms a file of 2 banks with 16 special registers.

These registers store information during interruption and are used by interrupt handlers. These registers can only be read or written while psr.ic=0 (while interrupt processing is in progress), otherwise the error «Illegal Operation fault» occurs. For these registers, their contents are guaranteed to be saved only when psr.ic=0. When psr.ic=1, the processor doesn't save their contents.

Special register interruption instruction pointer (iip) saves a copy of the register upon interruption ip and indicates the place of return from the interrupt. In general, iip contains the address of the instruction bundle, which contains the instruction that caused the error, or the address of the bundle that contains the next instruction to return after processing the trap. The specified and the following instructions are restarted; previous ones are ignored. Outside of the interrupt context, the value of this register is undefined.

Special register interruption instruction previous address (iipa), when interruption occurs, saves the address of the last successfully executed (all slots) instruction bundle.

Register format iip and iipa
6 3	6 2	6 1	6 0	5 9	5 8	5 7	5 6	5 5	5 4	5 3	5 2	5 1	5 0	4 9	4 8	4 7	4 6	4 5	4 4	4 3	4 2	4 1	4 0	3 9	3 8	3 7	3 6	3 5	3 4	3 3	3 2	3 1	3 0	2 9	2 8	2 7	2 6	2 5	2 4	2 3	2 2	2 1	2 0	1 9	1 8	1 7	1 6	1 5	1 4	1 3	1 2	1 1	1 0	9	8	7	6	5	4	3	2	1	0
bundle address																																																												0

Special register interruption processor status register (ipsr) upon interruption, it saves a copy of the register psr (machine status), and has the same format and set of fields as psr. ipsr is used to restore processor state when returning from interrupt with the instruction rfi (return from interruption).

Special register interruption extended register (cause) during non-critical (primary) interruption stores information about the interruption that occurred. The cause register contains data for an exception to differentiate between the different types of exceptions that a single type of interrupt can generate. When one of these interrupts is raised, the bits or bits corresponding to the particular exception that generated the interrupt will be set, and all other bits of the register cause are cleared. Other types of interruption do not affect the contents of the register cause. The register cause must not be cleared by software. The register cause stores information about the nature of the interrupt, and recorded by the processor on all interrupt events, regardless of psr.ic, except for «Data Nested TLB faults». cause stores information about an interrupted instruction and its properties, such as read, write, execute, speculative, or non-access. Several bits can be simultaneously set to cause, for example, an erroneous semaphore operation can expose both cause.r and cause.w. Additional information about the bug or trap is available through cause.code and cause.vector.

Register format cause
63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
reserved																								vector
31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
code																reserved								ei		d	n	a	r	w	x

Table 12.2: Register Fields cause
Field	Bit	Description
r	1	Read exception. If 1, then the interrupt is associated with reading data.
w	1	Write exception. If 1, then the interrupt is associated with data recording.
x	1	Execute exception. If 1, then the interrupt is associated with fetch instructions.
n	1	Non-access – translation request instructions (dcbf, fetch, mprobe, tpa).
d	1	Exception Deferral – this bit is set to TLB exception deferral bit (tlb.ed) for a code page containing an erroneous instruction. If translation doesn't exist or translation for the code is prohibited, cause.ed=0. If 1, then the interrupt is delayed.
ei	2	Excepting Instruction is the slot number of the bundle on which the interrupt occurred. For errors and external interrupts, cause.ei=iip.sn but doesn't match traps. For traps, cause.ei defines the instruction slot for the trap.
code	16	interruption Code is the 16-bit code for additional information about the current interrupt.
vector	8	8-bit code for additional information about external interrupt.

Notes: The information in the register cause is not complete. System software may also need to identify the type of instruction which caused the interrupt, examine the TLB input accessed by data or instruction memory access, to fully determine which exception or exceptions caused the interrupt. For example, a data memory interruption can be caused by both security breach exceptions, as well as byte order exclusions. System software would have to look besides cause, type of status psr in ipsr and page protection bits in the TLB input accessed by memory access, to determine if a Defense Violation has also occurred. The bits of the stored register ipsr can be changed when returning from an interrupt via rfi.

Special register interruption faulting address (ifa) upon interruption provides the effective address calculated by the interrupted instruction (virtual, or physical if translation is not used). For loads, stores, atomics, or cache management instructions, which caused an interrupt while accessing memory due to misalignment, a miss in TLB data/instructions or for any other reason, ifa contains an erroneous data address and points to the first byte of an erroneous operand. For other instructions, ifa contains the address of the instruction bundle. For erroneous instruction addresses, ifa stores a 16-byte boundary aligned binding address for the erroneous instruction. ifa is also used to temporarily store the translation virtual address, when the translation input is inserted into the TLB translation table (instructions or data).

Special 128-bit register interruption instruction bundle (iib) upon interruption, if psr.ic=1, saves the current instruction bundle for the failed instruction. The interrupt handler may use iib if needed to disassemble the failed instruction and emulate its execution.

Register format iib
6 3	6 2	6 1	6 0	5 9	5 8	5 7	5 6	5 5	5 4	5 3	5 2	5 1	5 0	4 9	4 8	4 7	4 6	4 5	4 4	4 3	4 2	4 1	4 0	3 9	3 8	3 7	3 6	3 5	3 4	3 3	3 2	3 1	3 0	2 9	2 8	2 7	2 6	2 5	2 4	2 3	2 2	2 1	2 0	1 9	1 8	1 7	1 6	1 5	1 4	1 3	1 2	1 1	1 0	9	8	7	6	5	4	3	2	1	0
slot2																				slot1																																										tp
slot3																																										slot2

§ 12.3. Exception Priority

There are two types of exceptions: those caused directly by the execution of an instruction (synchronous to the instruction stream) or caused by asynchronous events. In both cases, an exception can cause one of several types of interrupts.

The architecture requires that all synchronous interrupts be processed programmatically according to a sequential execution model. The exception to this rule is in the case of multiple synchronous interrupts from a single instruction.

For any instruction trying to raise several exceptions for which the corresponding synchronous interrupt types are allowed, a priority order is defined in which the instruction is allowed to generate single interrupts. This exception priority mechanism, apart from the requirement that synchronous interrupts be generated programmatically, also ensures that at any given time only one of the synchronous interrupt types exists for consideration. The exception priority mechanism also prevents some debug exceptions exceptions that occur in combination with other synchronously generated interrupts.

This section doesn't define the allowed installation of multiple exceptions for which the corresponding interrupt types are blocked. Throwing exceptions for which the corresponding interrupt types are blocked has no effect on throwing other exceptions, for which appropriate interrupt types are allowed. Conversely, if a specific exception for which the appropriate type of interrupt is enabled is shown in the following sections, has a higher priority than another exception, this will prevent the installation of this other exception, regardless of the corresponding type of interruption of another exception is allowed or blocked.

The priority of exception types is listed below from highest to lowest. Some types of exceptions can be mutually exclusive and can be considered as exceptions of the same priority. In these cases, the exceptions are listed according to the sequential execution model.

Table 12.3: Priority exception types
Type	No.	Exception	Description
Aborts	1	Machine reset abort (RESET)	Reboot
Aborts	2	Machine check abort (CHECK)	Processor check
External Interrupts	3	Initialization interrupt (INIT)	Warm restart
	4	Platform management interrupt (PMI)	Platform interrupt (chipset, board)
	5	External interrupt (INT)	External devices, timer, other processors
Runtime errors for the asynchronous register stack (spill-fill faults)	7	RS Data debug fault	Address and memory access match with one of the debug registers
	8	RS Unimplemented data address fault	The presence of non-zero bits in the unimplemented bits of the address
	10	RS Data TLB Alternate fault	Miss in TLB data (without HPT)
	11	RS Data HPT fault	HPT error
	12	RS Data TLB fault	Missing TLB data (after HPT)
	13	RS Data page not present fault	Data page is not in physical memory
	16	RS Data access rights fault	Accessing a virtual memory page in an unauthorized way, for example, reading from a page for which reading is prohibited
	17	RS Data access bit fault	Access to the virtual memory page (first entry)
	18	RS Unsupported data reference fault	Data access is not supported by memory attributes
Fetch faults phase errors	21	Instruction TLB Alternate fault	Miss in TLB instructions (without HPT)
	22	Instruction HPT fault	HPT error
	23	Instruction TLB fault	Missing TLB instructions (after HPT)
	24	Instruction Page Not Present fault	The instruction page is not in physical memory
	25	Instruction Access rights fault	Selection of instructions from the virtual memory page for which execution is not allowed
	26	Instruction Access Bit fault	Fetching instructions from the virtual memory page (first fetch)
Decode faults errors	27	Illegal operation fault	Reserved instruction
	28	Privileged operation fault	Privileged instruction
	29	Undefined operation fault	Invalid instruction form
	30	Disabled floating-point fault	Forbidden FP instruction
	31	Unimplemented operation fault	Unimplemented standard instruction (emulation required)
	32	Unsupported operation fault	Unimplemented dedicated instruction (emulation required)
execute faults	33	Reserved register/field fault	Invalid instruction field value (in particular register number)
	34	Out-of-frame rotated register	Access to the rotated register outside the local frame
	35	Privileged register fault	Attempt of an unprivileged program to perform a privileged operation with a privileged register
	36	Invalid register field fault	Attempt to write an invalid value to registers, TLB
	37	Virtualization fault	Attempted to execute a special instruction in processor virtualization mode
	38	Integer overflow fault	Integer overflow
	39	Integer divide by zero fault	Integer division by zero
	40	floating-point fault	Floating-point error
execute faults memory access	42	Data debug fault	Address and memory access match with one of the debug registers
	43	Unimplemented data address fault	The presence of non-zero bits in the unimplemented bits of the address
	44	Data TLB Alternate fault	Missing TLB data (without HPT)
	45	Data HPT fault	HPT error
	46	Data TLB fault	Missing data TLB (after HPT)
	47	Data page not present fault	Data page not in physical memory
	48	Data access rights fault	Accessing the virtual memory page in an unauthorized way, such as reading from a page for which reading is prohibited
	49	Data access bit fault	Access to the virtual memory page (first entry)
	50	Unaligned data reference fault	Accessing data at an unaligned address
	51	Unsupported data reference fault	Data access is not supported by memory attributes
Traps (traps)	53	Lower-Privilege Transfer trap	Debugger, privilege level change
	54	Taken branch trap	Debugger, taken branch
	55	Instruction Debug trap	Debugger, attempt to jump to an address that corresponds to one of the address ranges in debug registers
	56	System call trap	Debugger, intercept system call
	57	Single step trap	Debugger, trap after each instruction
	58	Unimplemented Instruction address trap	Unimplemented address of the next instruction bundle
	59	floating-point trap	Floating-point instruction requires intervention
	60	software trap	Software trap (trap) instruction

If an instruction raises multiple debug exceptions and doesn't raise any other exceptions, then it is permissible to generate a single debug interrupt (highest priority).

§ 12.4. Interrupt handling

The start addresses for interrupt handler code can be fixed in the architecture (old ARM, MIPS). But it is desirable to provide the ability to switch the entry point (for example, for updating), and also, possibly, for assigning different handlers to different processors in a multiprocessor system, since in a multiprocessor system simultaneous processing of several interrupts by different processors can occur, and no processor can use shared memory blocks for the needs of its interrupt handler. Special register interruption vector address (iva) determines the position of the system table of interrupt handlers in the virtual address space (or the physical address space if translation disabled). The vector table is 64 KiB in size and needs to be aligned on the 64 KiB border, so the lower 16 bits of the register must be zeros.

Register format iva
6 3	6 2	6 1	6 0	5 9	5 8	5 7	5 6	5 5	5 4	5 3	5 2	5 1	5 0	4 9	4 8	4 7	4 6	4 5	4 4	4 3	4 2	4 1	4 0	3 9	3 8	3 7	3 6	3 5	3 4	3 3	3 2	3 1	3 0	2 9	2 8	2 7	2 6	2 5	2 4	2 3	2 2	2 1	2 0	1 9	1 8	1 7	1 6	1 5	1 4	1 3	1 2	1 1	1 0	9	8	7	6	5	4	3	2	1	0
iva																																																0

For each of the 64 types of interrupts in the table, 1024 bytes of code are allocated (64 bundles or 192 short instructions). The address of the interrupt handler is obtained by combining the register iva and the interrupt vector number inum. If some vector is not used, then the place for its code in the table can use the vector preceding by the number. If, nevertheless, the interrupt handler doesn't fit in the table, a transition outside the vector table should be implemented.

Address of the interrupt handler
6 3	6 2	6 1	6 0	5 9	5 8	5 7	5 6	5 5	5 4	5 3	5 2	5 1	5 0	4 9	4 8	4 7	4 6	4 5	4 4	4 3	4 2	4 1	4 0	3 9	3 8	3 7	3 6	3 5	3 4	3 3	3 2	3 1	3 0	2 9	2 8	2 7	2 6	2 5	2 4	2 3	2 2	2 1	2 0	1 9	1 8	1 7	1 6	1 5	1 4	1 3	1 2	1 1	1 0	9	8	7	6	5	4	3	2	1	0
IVA base																																																inum						0

Interrupt handling is implemented as a quick context switch (much simpler than completely changing the context of the process). When an interrupt occurs, the hardware does the following:

If psr.ic=1, the register psr is stored in ipsr.
If psr.ic=1, the register ip is stored in iip, and the address is saved in the register iipa the last fully executed instruction bundle (previous ip).
If psr.ic=1, the interrupted instruction (or its first slot for the long instruction) is stored in iir.
If psr.ic=1 and there is an effective address associated with the interrupted instruction (load/store instructions, atomic memory operations, transitions), this address is copied to ifa.
In cause.ei, the slot number that caused the interrupt is stored.
Other additional information about the interrupted instruction is stored in cause.
In psr.ri, a mask of instructions is saved to continue working after interruption from the middle of the link.
The psr.ic bit is cleared (the ban on saving the state for subsequent critical interrupts is introduced).
The psr.i bit is cleared (prohibition of other interrupts).
The current privilege level psr.cpl changes to the kernel level (zero).
Execution continues from the address: iva + (1024 × interruption_number).

interruption_number is a unique integer value assigned to each interrupt. Vectorization is done by going to the interrupt vector table indexed by this integer. The interrupt vector table contains 1024 bytes (64 instruction bundles) for each interrupt processing routine. The value in the register iva should be aligned on the page border of 64 kilobytes.

Notes: The task of interrupt handlers is to resolve (unmask) external interrupts (by setting the psr.i bit to 1) as soon as possible, to minimize the worst latency for external interrupts.

At the end of the interrupt routine, rfi (return from interruption) is executed which restores the state register of the machine psr from ipsr, and normal instruction execution resumes from the address contained in iip.

Chapter 13. External interrupts

The architecture defines a mechanism for delivering external interrupts to the processor from other devices, external interrupt controllers, other processors; interrupt handling mechanism; a mechanism for sending interrupts to other processors. All this is handled by the processor's embedded interrupt controller.

Traditionally, interrupts are delivered to the processor via a separate serial bus, unlike ordinary data that is delivered via the system bus. This creates a sequencing problem that is traditionally solved by software or complex bus matching logic. If the data writing is followed by an interrupt, it is possible that the interrupt will reach the processor before the data writing takes effect, which will cause the processor to see outdated data. If you use only the system bus to deliver interrupts along with normal data, the ordering problem disappears.

The POSTRISC architecture replaces the traditional serial interrupt bus with a system bus interrupt delivery implementation. Therefore, interrupt transfer capabilities are scaled along with the system bus speed. External IO interrupts are delivered directly via the IO bus, which also speeds up delivery to the system bus.

Unlike PCI, where the device sends everyone a common interrupt signal, Now the device can send a unique vector by writing it to a specific address. The OS can configure for each device in the system the address of the receiver of its interruptions (possibly one per device) and select up to 32 different vectors per device.

The architecture introduces batch interrupt handling to minimize the number of context switches, unlike the previous approach, when each interrupt is processed in its context. This will allow the interrupt handler to handle all pending interrupts without changing the processor priority level. This reduces the number of context switches and the number of processor switches, which will improve performance.

The architecture rejects interrupts based on individual contacts - in favor of interrupts in the form of special signals of the system-wide bus. To add more interrupt sources using the contact mechanism, you need more contacts, and for the signal mechanism there are no restrictions on the number interrupt sources on a shared bus.

External interrupts are not related to the execution of the instruction thread (asynchronous to the thread). The processor is responsible for the sequence and masking (prohibition) of interruptions, sending and receiving interprocessor interrupt messages, receiving interrupt messages from external interrupt controllers, and managing local interrupt sources (from itself). External interrupts are generated by four sources in the system:

External Interrupt Controllers. Interrupt messages from any external source can be sent any processor from the External Programmable Interrupt Controller (EXTPIC), which collects interrupts from several simple devices, or from an IO device capable of sending interrupt messages directly (with a built-in controller). The interrupt message informs the processor that an interrupt request has been made and specifies the unique vector number of the external interrupt. A request for interruption from a simple device is issued if a fact of a steady signal level was detected or when the signal level was different. The processors and controllers of external interrupts communicate via the system bus according to the interrupt message protocol defined by the bus architecture.

Locally attached to the processor devices. Interrupts from these devices are generated by the processor contacts for direct interrupts (LINT, INIT, PMI) and are always directed to the local processor. LINT pins can be connected directly to the local external interrupt controller. LINT contacts are programmable either differential-sensitive or level-sensitive, and for the type of interrupt that is generated. If they are programmed to generate external interrupts, then each LINT pin has its own vector number. Only LINT pins connected to the processor can directly generate level-sensitive interrupts. LINT pins cannot be programmed to generate level-sensitive PMI or INIT interrupts. The INIT and PMI pins generate their corresponding interrupts. An interrupt is generated for the PMI contact with PMI vector number 0.

Internal processor interrupts. These are, for example, interruptions from the processor timer, from the performance monitor, or interruptions due to machine checks. These interrupts are always routed to the local processor. A unique vector number can be programmed for each interrupt source.

Other processors. Each processor can interrupt any other processor, including itself, by sending an interprocess message about the interruption to a specific target processor. The destination of the interrupt message (one of the processors in the system) is determined by the unique identifier of the processor in the system.

§ 13.1. Programmable external interrupt controllers

An external interrupt controller (EXTPIC) provides incoming interrupt signal lines, by which devices inject interrupts into the system in the form of a steady-state signal level (level) or signal level difference (edge).

EXTPIC contains a Redirection Table (RT) with entries for each incoming interrupt line. Each entry in RT can be individually programmed to recognize interruptions on the line (edge or level), which vector (and therefore which priority) has the interrupt, and which of all possible processors should serve the interrupt. RT content is controlled by software (mapped to physical addresses and writable by processors) and receives default values when reset. The table information is used to send messages to the local interrupt controller of the target processor via the system bus.

EXTPIC functionality can be integrated directly into the end device, but any component of the system that is capable of sending interrupt messages on the IO bus, It must behave like EXTPIC and must have EXTPIC functionality.

Table 13.1: EXTPIC controller registers
Name	Address	Description
EXTPIC Version register	Base + 0x00
IO eoi register	Base + 0x08
Redirection Table Entry X	Base + 0x10 and then 8

EXTPIC block format
31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0																								Selected register
Window register
0								max RT num								0								version
eoi

Redirection Table Entry format
31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0																pid
0													p	s	m	t	dm			0				vector

Delivery Mode (DM): delivery method, Delivery Status (S): 0 (Idle) or 1 (Pending), Interrupt Input Pin Polarity (P): 0 (High) or 1 (Low), Trigger Mode (T): edge (0) or level (1), Mask (M): mask interrupt, processor ID (PID): processor ID.

§ 13.2. Built-in interrupt controller

From the point of view of other processors and IO devices, the processor itself is a device with a built-in programmable external interrupt controller. The only difference is that the processor itself programs its built-in interrupt controller, and is not programmed by other processors.

The local interrupt controller determines whether the processor should accept interrupts sent via the system bus, provides local registers for pending interrupts, nesting and masking interrupts, manages interactions with its local processor, provides the ability to interprocess messages to its local processor.

In older architectures, this programming was even carried out similarly to external controllers, via memory-mapped registers. This required each processor to allocate its own address range to display its interrupt controller, made it possible to make strange errors with access to the controller of a «alien» processor.

Later, registers of the embedded controller prefer to implement as special registers inside the processor, without mapping to the address space. This removes the need to map controller registers to physical addresses, and solves the access problem. For example, the new architecture of the integrated Intel X2APIC interrupt controller is implemented, replacing XAPIC (the full chronology is: PIC - APIC - XAPIC - X2APIC), or IA64 SAPIC (streamlined integrated interrupt controller).

The POSTRISC architecture naturally follows the newer approach. The processor software manages external interrupts by changing special processor registers, controlling the built-in external interrupt controller. These registers are summarized in the table below, and are used to prioritize and deliver external interrupts, and for assigning external interrupt vectors to interrupt sources inside the processor such as a timer, performance monitor, and processor validation.

Table 13.2: External interrupt control registers
Name	Description
lid	Local Identification register
tpr	Task Priority register
irr0…irr3	Interrupt Request registers (read only)
isr0…isr3	Interrupt Service registers
itcv	interval time counter vector
tsv	termal sensor vector
pmv	performance monitor vector
cmcv	corrected machine-check vector

Special task priority register (tpr) controls the forced masking (prohibition) of external interrupts depending on their priority. All external interrupt vectors with a number greater than mip (mask interrupt priority) are masked.

Register format tpr
63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
reserved
31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
reserved																								mip

§ 13.3. Handling external interrupts

To minimize the cost of handling external interrupts, you need to reduce the total number of context switches (processor interrupts and return from interrupts). It is desirable to be able to batch process interrupts, that is, once interrupting the processor, process all interrupts awaiting processing. To do this, the mechanism for determining which external interrupts are pending should be separated from the processor interrupt mechanism.

Special register group interrupt request registers (irr0-irr3) store a 256-bit vector of external interrupts awaiting processing (by the number of possible numbers of interrupt vectors from 0 to 255). The bit set to 1 in irr means that the processor has received an external interrupt. Registers are read-only, write is prohibited (invalid operation). Vector numbers 1-15 are reserved for internal and local interrupts. The zero bit of the register irr0 is always zero. It is a special «spurious» or empty interrupt vector. Reading from the register iv clears the bit corresponding to the highest priority interrupt and returns vector index (or spurious vector if there is no received interrupts).

Register format irr
6 3	6 2	6 1	6 0	5 9	5 8	5 7	5 6	5 5	5 4	5 3	5 2	5 1	5 0	4 9	4 8	4 7	4 6	4 5	4 4	4 3	4 2	4 1	4 0	3 9	3 8	3 7	3 6	3 5	3 4	3 3	3 2	3 1	3 0	2 9	2 8	2 7	2 6	2 5	2 4	2 3	2 2	2 1	2 0	1 9	1 8	1 7	1 6	1 5	1 4	1 3	1 2	1 1	1 0	9	8	7	6	5	4	3	2	1	0
bits 63-16																																																rv															0
irr1
irr2
irr3

Privileged register interrupt vector (iv) returns the highest priority unmasked number (vector) among external interrupts awaiting processing. If there is an external interrupt, the processor crosses out the interrupt vector from the waiting category and transfers it to the processed category. All vectors of the same and lower priority are masked until the processor finishes processing this interrupt. If there are no pending external interrupts or all external interrupts are masked, then iv returns the special value 0 (special vector spurious interrupt).

The end indicator is an entry in iv (end of interrupt). This is a signal that the software has finished servicing the last high priority interrupt, whose vector was read by reading from iv. The processor removes this interrupt vector from the category of serviced, and removes the masking of interrupts with a lower or equal priority.

§ 13.4. Handling local interrupts

The processor itself may generate interrupts, asynchronous to the current instruction thread, and not related to external devices, for example, in the case of a time slice end (itc match itm), itc overflow, performance monitor counter overflow, the processor overheat, the processor internal error, etc.

In this case, it is convenient to conditionally present these interrupts as external, and serve them according to the same principles. To do this, you need to map your dedicated external interrupt vector to the asynchronous interrupt from the processor. Accordingly, some interrupt vectors are mapped to specific types of asynchronous intraprocessor interrupts. They cannot be used to program external devices.

The interval time counter vector is associated with the processor interval timer counter (itc) match or overflow. The performance monitoring vector is associated with interrupts from the performance monitor. The corrected machine check vector is associated with interrupts due to the need to correct machine errors. The termal sensor vector is associated with interrupts due to processor overheating.

For these types of interrupts in the corresponding register, you can set the number of the interrupt vector or mask them (field m).

Register format itcv, tsv, pmv, cmcv
31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
reserved																m	rv							vector

§ 13.5. Processor identification and interprocessor messages

The special local identification register (lid) contains the processor core identifier. It serves as the physical name of the processor for all interrupt messages (external interrupts, INIT interrupts, PMI platform interrupts). The contents of the register lid is set by the platform during boot/initialization and based on the physical location of this processor in the system. This value is implementation-dependent and should not be changed by software (available read-only). When receiving interrupt messages on the system bus, processors compare their lid with the destination address of the interrupt message. In case of a match, the processor accepts the interrupt and stores it in its queue of waiting interrupts.

Register format lid
31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
reserved												lid

Each processor can interrupt any other processor, including itself, by sending an inter-processor interrupt message (IPI). Different architectures have different approaches to the organization of interprocessor interrupts and their delivery.

For example, in the X86 architecture, each processor implements a special interrupt instruction register (icr), and the processor generates IPI by writing to this special register of its own. The message delivery method is not determined in this case, as well as the bus used for this (a separate narrow dedicated interrupt bus can be used). The pid field in the register defines the target processor to interrupt. The remaining fields are interrupt parameters (interrupt vector number and delivery mode). Hint is an instruction for the external system to deliver the interrupt exactly to the address (Hint=0), or you can make a load balance and deliver to the choice of the system (Hint=1) to another (unoccupied) addressee.

This method, despite the simplicity and universality of the implementation (the method for delivering interrupts is not defined by the architecture), also has some problems. Usually, sending interrupts is preceded by data modification operations that are performed on the shared bus, and there may be situations when an interrupt sent on the interrupt bus can overtake a data change on the shared bus. This requires the implementation of complex hardware-software synchronization schemes.

Register format icr
63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42	41	40	39	38	37	36	35	34	33	32
target processor id
31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
reserved																h	dm			rv				vector

Another approach is that each processor behaves like any other IO device mapped to a physical address space. A processor generates IPI for another processor by writing to a specific, architecture-specific area of physical addresses. This removes the problem of synchronizing data and interrupts, since they are sent on the same common bus, and doesn't require the implementation of a separate bus for interrupts. At the same time, however, the load on the common bus increases, but insignificantly, since the interrupt signals make up a small percentage of the total traffic of the common bus. By this principle, IPI is implemented in Intel Itanium and IBM Power architectures.

For example, in IA64 architecture, the range of physical addresses is 1 MiB in size from the area of displayed devices allocated to display processors (16 bytes per processor) and transmit interrupt signals. The base address of this range is aligned with the natural border and is fixed architecturally at 0xFEE00000. Any address of the form 0xFEENNNN0 is recognized as an interrupt signal for the processor 0xNNNN. Writing an 8-byte aligned number to an address in this range will send an interrupt to the appropriate processor. Other types of writes are not supported, as well as reading. The PID address field identifies the target processor to interrupt. Hint (h) field of the address is a command for the external system to deliver the interrupt exactly to the address (Hint=0), or you can make a load balance and deliver to the choice of the system (Hint=1) to another (unoccupied) destination. The remaining fields (in the recorded number) are interrupt parameters (interrupt vector number and delivery mode).

Physical address to send the message
6 3	6 2	6 1	6 0	5 9	5 8	5 7	5 6	5 5	5 4	5 3	5 2	5 1	5 0	4 9	4 8	4 7	4 6	4 5	4 4	4 3	4 2	4 1	4 0	3 9	3 8	3 7	3 6	3 5	3 4	3 3	3 2	3 1	3 0	2 9	2 8	2 7	2 6	2 5	2 4	2 3	2 2	2 1	2 0	1 9	1 8	1 7	1 6	1 5	1 4	1 3	1 2	1 1	1 0	9	8	7	6	5	4	3	2	1	0
0xFEE																																												pid																h	0

Record format for interrupt message
6 3	6 2	6 1	6 0	5 9	5 8	5 7	5 6	5 5	5 4	5 3	5 2	5 1	5 0	4 9	4 8	4 7	4 6	4 5	4 4	4 3	4 2	4 1	4 0	3 9	3 8	3 7	3 6	3 5	3 4	3 3	3 2	3 1	3 0	2 9	2 8	2 7	2 6	2 5	2 4	2 3	2 2	2 1	2 0	1 9	1 8	1 7	1 6	1 5	1 4	1 3	1 2	1 1	1 0	9	8	7	6	5	4	3	2	1	0
reserved																																																					dm			vector

The architecture POSTRISC uses the generalization of second method: the interrupt message is sent over the common system bus as a write to the dedicated physical address. Each processor core is mapped as a device to physical memory via the standard PCE-Express config space 4KiB map size. Every processor core may be found in PCI Express config space for the corresponding chipset/socket. The first private byte of the range is used to deliver interrupts, the rest bytes are for remote processor tuning, debugging, monitoring or are reserved. The physical addresses 0xPPPPP0000000-0xPPPPPFFFFFFF are reserved for mapping existing processors (up to 65536 cores per PCIE ECAM). In the current emulator implementation, for simplicity, they are mapped to similar kernel virtual addresses 0xFFFFFFFFE0000000-0xFFFFFFFFEFFFFFFFF.

Physical address for memory-mapped cores
6 3	6 2	6 1	6 0	5 9	5 8	5 7	5 6	5 5	5 4	5 3	5 2	5 1	5 0	4 9	4 8	4 7	4 6	4 5	4 4	4 3	4 2	4 1	4 0	3 9	3 8	3 7	3 6	3 5	3 4	3 3	3 2	3 1	3 0	2 9	2 8	2 7	2 6	2 5	2 4	2 3	2 2	2 1	2 0	1 9	1 8	1 7	1 6	1 5	1 4	1 3	1 2	1 1	1 0	9	8	7	6	5	4	3	2	1	0
reserved																ECAM base																				Bus-Device-Function																offset

By writing 8-byte value with only 8 nonzero bit on address 0xFFFFFFFFENNNN000 we send interrupt to processor core with NNNN device id. The writing to any other address like 0xFFFFFFFFENNNNXXX or the loading from any address inside block leads to platform management interrupt for sender core.

Table 13.3: The processor core memory-mapped physical address block layout
address	bytes
address	0	1	2	3	4	5	6	7
0xFENNNN000	vector	reserved (0)
0xFENNNN008	timecmp (test stuff)
0xFENNNN010	reserved
0xFENNNN018	reserved
...	...
0xFENNNNFF8	reserved

Chapter 14. Debugging and monitoring

The POSTRISC architecture provides debugging tools that enable hardware and software debugging features, such as step-by-step program execution, instruction breakpoints, and data breakpoints.

Debugging tools consist of a special debugging control register dbscr (debug status and control register), a set of debugging events as a subset of interrupts, special registers for comparing instruction addresses ibr (instruction breakpoint register), special registers for comparing data addresses dbr (data breakpoint register).

Debug registers are available for program execution, but they are intended for use only by special debuggers and debugging software, not general software or operating system code.

Monitoring tools include the following resources: special registers and/or bit fields controlling the monitoring, implemented types of counted events, a fixed number of event counters, additional type of interrupts for processing the monitoring counter overflow events.

§ 14.1. Debug Events

Debugging tools are based on a special group of debug interrupts built into the general interrupt mechanism. Debug type interrupts can be thrown for various reasons that can be analyzed in the handler of this interrupt. There are seven types of predefined debugging events:

Table 14.1: Debug events (in priority order)
Name	Event Type
IB	«Instruction address match» debug event occurs on instruction address match. If the address of a instruction bundle matches one of the criteria, specified in the debug registers ibr, a instruction debugging event is raised (if the instruction is not canceled). One or more debug events ibr occur, if the execution of instructions at the address which matches the criteria specified in the registers ibr.
DB	«Data address match» debug event occurs on data address match. If the address for accessing data in memory meets one of the criteria, specified in the debug registers dbr. Data Debug errors are only reported if the qualification predicate is true. The reported trap code returns the matching state of the first 4 dbr registers that matched during the execution of the instruction. Zero, one or more dbr registers can be reported as matching.
TR	Software Trap
TB	«Taken branch» trap occurs on each taken branch instruction received if psr.tb=1. This trap is useful for profiling a program. After the trap, iip and ipsr.ri point to the branch destination instruction, and iipa and cause.ei to the branch instruction that caused the trap. The case of debugging «taken branch» (TB) occurs if psr.tb=1 (that is, debug events «taken branch» are allowed), the branch instruction is executed (i.e., either an unconditional branch, or a conditional branch in which the branch condition is satisfied), and psr.de=1 or dbcr0.idm=0.
SS	«Single step» trap occurs on each successfully finished instruction if dbsc.rss=1 (step-by-step debugging events are allowed). After the trap, iip and ipsr.ri point to the next instruction to be executed. iipa and cause.ei point to the caught instruction.
lp	An interrupt has occurred. The debug event «an interrupt occurred» (IRPT) occurs, if dbcr.irpt=1 (that is, a debug event of the interrupt occurred is allowed) and any non-critical interruption occurs while dbcr.idm=1, or any critical or non-critical interrupt occurs while dbcr.idm=0. Abort Accepted Debug Events, may occur regardless of the installation of psr.de.
IR	Returns from Interrupt

Debug events include instruction and data breakpoints. These debug events set the status bits in the DBSR debug status register. The existence of a set bit in the DBSCR register is considered as a debug exception. Debug exceptions, if allowed, cause debugging interruptions. The debug status and control register (DBSCR) is used to set the allowed debug events, manage timer operations during debugging events, and set processor debugging mode. It contains the status of debug events.

The group of bits DBE (debug enabled events) of the DBSCR register is set in supervisor mode and cannot be changed by the program. Bit groups DBTE (debug taken enabled event) and DBT (debug taken event) of the DBSCR register installed by hardware, they can be read and cleaned programmatically. The contents of the DBSCR register can be read into the general register by the instruction mf.spr.

Debug events are used to force debug exceptions. be registered in the DBSCR debug status register.

To enable the debug event, you need to set the corresponding bit from the DBE group of the DBSCR register and thus raise a debug exception a certain type of event must be allowed by the corresponding bit or bits in the dbcr debug control registers. Once the DBSCR register bit is set and, if debug interrupts are enabled (a bit from the DBE group is 1), a debug interrupt will be generated.

The bit in the special DBSCR debug control register must be set to 1 to allow debugging interrupt corresponding to this bit. Debugging events are not allowed to occur when the corresponding bit in the DBSCR register is 0. In such situations, no debug exception of this type occurs. and no bits of this type of DBSCR register are set.

If the corresponding bit in the register dbscr is 1 (that is, debugging interrupts of this type are allowed) during this debug exception, interruption of debugging will occur immediately (if there is no exception with a higher priority, which is allowed to cause interrupts), the execution of the instruction, causing the exception will be suppressed, and CSRR0 will be set to the address of this instruction.

If debug interrupts of this type are blocked during a debug exception, interruption of debugging will not occur, and the instruction will complete execution (provided, the instruction doesn't cause some other exception, which generates an allowed interrupt).

Notes: If an instruction is suppressed due to an instruction, that raised some other exception that allows the generation of an interrupt, then the attempted implementation of that instruction does no Cause the instruction Complete debugging case. The trap instruction doesn't fall into the category of instructions whose execution is suppressed, starting with instructions, it actually completes execution and then generates an interruption to the system call. In this case, the finished debug exclusion command will also be installed.

A trap debugging event (trap) occurs if dbscr.trap=1 (that is, Trap debugging events are allowed) and the Trap instruction is unconditional or the conditions for the trap are met.

Interrupt instruction error – execution of trap instruction results in the Interrupt instruction error. An interrupt can be used to profile, debug, and enter the operating system. (although the instruction to enter the privileged code (syscall) is recommended, since it has lower costs).

If dbcr.trap=0 (that is, trap-type debugging interrupts are blocked) during the exception of debugging a trap, interruption of debugging will not occur, and the type of exception for the Trap. A Program Interruption will occur instead if the trap condition is met.

Trap «Decrease privilege level». when psr.lp=1, and the transition that occurs lowers the privilege level (psr.cpl becomes 1), this trap occurs. This trap allows the debugger to keep track of privilege drops, for example, to remove permissions granted to higher privileged code. After the trap iip and ipsr.ri point to the effective address of the branch, and iipa and cause.ei to the branch instruction that caused the trap.

When dbcr.idm=1, only non-critical interrupts can trigger debugging events of the interrupt that occurred. This is because all critical interrupts automatically clear psr.de, which would always prevent the associated debugging interrupt from appearing accurately. Also, debug interrupts directly are – critical class interrupts, and thus any debug interrupt (for any other debugging case) would always end up during installation an additional exception to dbsr.irpt after entering the debug interrupt handler. At this point, the debug interrupt routine is unable to determine Is the interruption a valid debugging event? It was related to the initial debugging event.

When dbcr.idm=0, then critical and non-critical class interruptions can cause the Abort Accepted debugging event. In this case, the assumption is that debugging events are not used to cause interruptions. (software can vote DBSR instead) and therefore it's proper to record an exception in DBSR even though that the critical interrupt that causes the interrupt is an accepted debug event, will clear psr.de.

Debug event «Interception return from interrupt» (a call to the ret instruction) occurs if dbcr.ret=1 (i.e. debugging events are allowed when returning from an interrupt) and an attempt was made to execute the rfi instruction. When a debug event occurs on return, dbsr.ret is set to 1 to record a debug exception.

§ 14.2. Debug registers

Debug registers are designed to organize the interception of program calls to specific address ranges for specific purposes (e.g. execution or writing), and allow the debugger to verify the correctness of the program. Their number depends on the implementation. Read/write ability depends on the priority level, processor model. They are used in pairs, with at least 4 pairs for instructions and 4 for data.

The 128-bit instruction breakpoint registers ibr are for debug comparing instruction addresses. A debugging event can be allowed to occur after trying to execute a instruction at an address range specified by a ibr register. Since all instruction addresses must be aligned on the border of the bundle, the four least significant bits of the ibr register are reserved and do not participate in comparison with the address of the instruction bundle.

6 3	6 2	6 1	6 0	5 9	5 8	5 7	5 6	5 5	5 4	5 3	5 2	5 1	5 0	4 9	4 8	4 7	4 6	4 5	4 4	4 3	4 2	4 1	4 0	3 9	3 8	3 7	3 6	3 5	3 4	3 3	3 2	3 1	3 0	2 9	2 8	2 7	2 6	2 5	2 4	2 3	2 2	2 1	2 0	1 9	1 8	1 7	1 6	1 5	1 4	1 3	1 2	1 1	1 0	9	8	7	6	5	4	3	2	1	0
address
x	0			plm				mask																																																				0

The 128-bit data breakpoint registers dbr are for debug comparing data addresses. A debugging event can be allowed to occur after loads, stores, or atomic instructions to an address range specified by a dbr register.

6 3	6 2	6 1	6 0	5 9	5 8	5 7	5 6	5 5	5 4	5 3	5 2	5 1	5 0	4 9	4 8	4 7	4 6	4 5	4 4	4 3	4 2	4 1	4 0	3 9	3 8	3 7	3 6	3 5	3 4	3 3	3 2	3 1	3 0	2 9	2 8	2 7	2 6	2 5	2 4	2 3	2 2	2 1	2 0	1 9	1 8	1 7	1 6	1 5	1 4	1 3	1 2	1 1	1 0	9	8	7	6	5	4	3	2	1	0
address
r	w	0		plm				mask

The contents of the register dbr are compared with the address computed by the memory access instruction. Data debugging event occurs, if enabled, attempted execution of the data memory access instruction, and type, address, and maybe even the meaning of the data access memory matches the criteria specified in the dbr.

All load instructions are treated as reads regarding debugging events, while all store instructions are treated as a write regarding debugging events. Additionally, cache management instructions, and some special cases handled as follows.

The cmp bits determine whether all or some of the bits of the instruction address should match the contents of debug register, whether the address should be inside or beyond a specific range specified by a ibr register for a debugging event to occur.

There are four modes for comparing instruction addresses.

bit matching mode: if ip AND ibr.mask == ibr.address AND ibr.mask.
interval containing mode: For ibr debug events, if the instruction address is greater than or equal to the content of ibr.high and less than the content of inr.low, matching instruction address occurs.
Exceptional interval mode: For ibr debug events, if the instruction address is less than the content of ibr.high or greater than or equal to the content of ibr.low, matching instruction address occurs.

High register part contains breakpoint addresses, low part contains offset or breakpoint mask. At least 4 data and instruction registers are implemented on all processor models. The first registers after zero by number are implemented.

The instruction and data memory addresses provided for compliance are always in the implemented address space. Programming an unimplemented physical address in ibr/dbr ensures that the physical addresses provided by ibr/dbr will never match. Similarly, programming unimplemented virtual addressing in ibr/dbr ensures that the virtual addresses submitted by ibr/dbr will never match.

Table 14.2: Debug breakpoint register fields (dbr/ibr)
Field	Description
Address 63:0	Matching address– 64-bit virtual or physical breakpoint address. The address is interpreted as virtual or physical depending on psr.dt and psr.it. The trap «Instruction data breakpoint address» occurs when load, store, semaphore instruction. For fetching instructions, the lower four bits of ibr.addr{3:0} are ignored when comparing addresses. All 64 bits are implemented on all processors, regardless of the number of address bits implemented.
mask 55:0	mask for the address determines which address bits in the corresponding address register will be compared when determining the conformity of the control point. Address bits for which mask bits are 1 must match the address of the breakpoint, otherwise, the address bit is ignored. Address bits {63:56} for which there are no corresponding mask bits, always compared. All 56 bits are implemented on all processors, regardless of the number implemented bits of the address.
plm 59:56	Mask for all privilege levels – Allows data breakpoints that match the specified privilege level. Each bit corresponds to one of 4 privilege levels. Bit 56 corresponds to privilege level 0, bit 57 to level 1, etc. A value of 1 indicates that debugging comparisons are allowed at this privilege level.
w 62	Write - When dbr.w=1, any not canceled store, semaphore, probe.w.fault or probe.rw.fault to the address, causes the breakpoint to the corresponding address register.
r 63	Read - When dbr.r=1, any unannounced load, semaphore, lfetch.fault, probe.r.fault or probe.rw.fault at the address corresponding to the address register causes a breakpoint. When dbr.r=1, PT access that matches dbr (except those for the tak instruction) will cause an error «Missing in Instruction/Data TLB». If dbr.r=0 and dbr.w=0, the data breakpoint register is locked.
x 63	Execution - When ibr.x=1, executing instructions at the address corresponding to the address register causes a breakpoint. If ibr.x=0, then the instruction breakpoint register is locked. Control points for instructions will be reported, even the instruction is canceled.
ig 62:60	Ignored

The registers dbr/ibr can only be accessed at the highest privilege level 0, otherwise, the «privileged operation» error occurs.

Debug register changes are not necessarily observed with the following instructions. The software must use the data serialization to ensure that modifications to dbr, psr.db, psr.tb and psr.lp observed before the dependent instruction is executed. Because changing the registers ibr and the flag psr.db, affect the subsequent instruction fetching, the software must execute the instruction serialization.

In some implementations, a hardware debugger may use two or more registers for its own use. When a hardware debugger is applied, only 2 dbr and only 2 ibr are available for program use. The software should be able to run with fewer implemented ibr and/or dbr registers if a hardware debugger is present. When a hardware debugger is not implemented, at least 4 ibr and 4 dbr are available for programmatic use.

Implemented debug registers used by the attached hardware debugger, arranged by number first (for example, if only 2 dbr are available software, the registers dbr[0-1]) are available.

Notes: When a hardware debugger is implemented and it uses two or more of debug registers, the processor doesn't force registers between the program and the hardware debugger, that is, the processor doesn't prohibit the program from reading or changing any of the debug registers. However, if the program modifies any of the registers used by the hardware debugger, the processor and/or hardware operation of the debugger may become undefined; the processor and/or hardware debugger may crash.

The instructions mf.ibr (move from instruction breakpoint register), mt.ibr (move to instruction breakpoint register), mf.dbr (move from data breakpoint register), mt.dbr (move to data breakpoint register) are used to indirectly read/write instruction/data debug registers. The sum of general register and simm10 is used to pass index of monitor register number.

mf.ibr  ra, rb  # ra = ibr[rb+imm]
mt.ibr  ra, rb  # ibr[rb+imm] = ra
mf.dbr  ra, rb  # ra = dbr[rb+imm]
mt.dbr  ra, rb  # dbr[rb+imm] = ra

instruction format for debug registers read/write
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							target							index							simm10										opx

§ 14.3. Monitoring registers

Monitoring Registers are designed to count various internal events when executing an instruction thread. Their number depends on the implementation (minimum 4). Read/write ability depends on the priority level, processor model.

6 3	6 2	6 1	6 0	5 9	5 8	5 7	5 6	5 5	5 4	5 3	5 2	5 1	5 0	4 9	4 8	4 7	4 6	4 5	4 4	4 3	4 2	4 1	4 0	3 9	3 8	3 7	3 6	3 5	3 4	3 3	3 2	3 1	3 0	2 9	2 8	2 7	2 6	2 5	2 4	2 3	2 2	2 1	2 0	1 9	1 8	1 7	1 6	1 5	1 4	1 3	1 2	1 1	1 0	9	8	7	6	5	4	3	2	1	0
counter for the number of such events
0																																																								event type

There are at least 8 128-bit performance monitoring registers (mr0…mr7). Unimplemented monitoring registers when reading give zero, writing to them is ignored. Each monitoring register is associated with a specific event for which it is counting.

Table 14.3: The counted events types
Name	Event
	Page access
	DTLB miss
	ITLB miss
	I1-cache miss
	D1-cache miss
	D1-cache write-back
	L2-cache miss
	L2-cache write-back

An overflow of monitor counter raises an asynchronous event.

The instructions mf.mr (move from monitor register) and mt.mr (move to monitor register) are used to indirectly read/write monitor registers. The sum of general register and simm10 is used to pass index of monitor register number.

mf.mr  ra, rb  # ra = MR[rb+imm]
mt.mr  ra, rb  # MR[rb+imm] = ra

instruction format for mf.mr, mt.mr
41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							target							index							simm10										opx

41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							other							simm 28 bits

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
0						extended simm (64 bits instead of 27)

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
0										extended simm (60 bits instead of 28)

41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							other																		simm 17 bits

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
other																													extended (30 bits)

41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							other														simm 21 bits

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
extended simm (63 bits instead of 21)

41	40	39	38	37	36	35	34	33	32	31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
opcode							other														uimm 21 bits

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
extended uimm (63 bits instead of 21)

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
extended simm (40 bits instead of 11)																													other

83	82	81	80	79	78	77	76	75	74	73	72	71	70	69	68	67	66	65	64	63	62	61	60	59	58	57	56	55	54	53	52	51	50	49	48	47	46	45	44	43	42
extended uimm (40 bits instead of 11)																													other

6 3	6 2	6 1	6 0	5 9	5 8	5 7	5 6	5 5	5 4	5 3	5 2	5 1	5 0	4 9	4 8	4 7	4 6	4 5	4 4	4 3	4 2	4 1	4 0	3 9	3 8	3 7	3 6	3 5	3 4	3 3	3 2	3 1	3 0	2 9	2 8	2 7	2 6	2 5	2 4	2 3	2 2	2 1	2 0	1 9	1 8	1 7	1 6	1 5	1 4	1 3	1 2	1 1	1 0	9	8	7	6	5	4	3	2	1	0
iva																																																0

text	data	bss	arch, comments
445205	4576	964	ARMv7-A, thumb mode
649095	4576	964	ARMv7-A, ARM mode (a32)
588115	8280	1304	ARMv8-A (a64)
641257	8320	1312	amd64
584276	4576	952	i686
795319	16688	1304	mips64el
725083	4576	960	mipsel
691715	9148	960	ppc
712559	49144	1304	ppc64
689035	4960	959	rv32g
509583	4960	959	rv32gc (compressed)
689035	4960	959	rv64g,
512500	8668	1299	rv64gc (compressed)
917929	8280	1304	s390x

text	data	bss	arch, comments
519367	8320	1691	x86_64, Os, clang 16
772703	8320	1691	x86_64, O2, clang 16
430514	17032	1784	x86_64, Os, gcc 14
705880	16864	1784	x86_64, O2, gcc 14
756128	8280	1683	postrisc, Os, clang 20, dense calls
770704	8280	1683	postrisc, O2, clang 20, dense calls
801440	8280	1683	postrisc, Os, aligned calls
816000	8280	1683	postrisc, O2, aligned calls

6 3	6 2	6 1	6 0	5 9	5 8	5 7	5 6	5 5	5 4	5 3	5 2	5 1	5 0	4 9	4 8	4 7	4 6	4 5	4 4	4 3	4 2	4 1	4 0	3 9	3 8	3 7	3 6	3 5	3 4	3 3	3 2	3 1	3 0	2 9	2 8	2 7	2 6	2 5	2 4	2 3	2 2	2 1	2 0	1 9	1 8	1 7	1 6	1 5	1 4	1 3	1 2	1 1	1 0	9	8	7	6	5	4	3	2	1	0
iva																																																0