The POSTRISC architecture
This document describes the architecture of a non-existent virtual processor. The processor instruction set architecture and the processor itself are referred to as POSTRISC. This name is because «POST-RISC» is a common name for projects of hypothetical processors, replacing processors with the RISC architecture (Reduced Instruction Set Computer). The virtual processor POSTRISC combines the best (as it seems to me) qualities of existing and past architectures.
Main features of the POSTRISC architecture:
Sources are available at https://github.com/bdpx/postrisc. This repository contains source code, sample programs for POSTRISC, this description of the virtual processor. To build program cmake and clang++/g++ needed. To build this documentation from xml sources xsltproc/xmllint needed.
The postrisc is a console application for all basic tasks. Uses standard streams for input/output that need to be redirected from file or to file. It is called with different keys from the command line (table below).
Option | Operation description |
---|---|
--scan <src.s |
scan and recognize individual tokens (tokens) of the source program in assembler virtual processor |
--scan-html <src.s | scan and mark as html source program in assembler virtual processor |
--assemble <src.s >src.bin | assemble the program src.s in the object file src.bin |
--assemble-c | assemble as an C++ array for embedding into a C++ program |
--base-address vaddr | set virtual base address for image loading |
--device-array paddr vaddr config-size | set device array info: physical address, virtual address, device configuration space size |
--disasm <src.bin | disassemble the file src.bin |
--dumpbin <src.bin | disassemble the src.bin file with binary representation |
--env key=value | add guest app environment variable |
--execute <src.bin | execute the raw program src.bin in the emulator |
--exeapp <src.bin | execute the ELF program src.bin in the emulator. Static PIE executables only, syscall emulator for Linux-like system (very limited). |
--html >out.html | output html-text with information about the syntax of assembler instructions, the format of machine instructions, operation codes, statistics for the instruction set |
--llvm | output llvm tablegen file with information about the instructions encoding, the format of machine instructions, operation codes, etc. Used as «PostriscInstrEncoding.td» in LLVM compiler backend. |
--export-definitions | lists asm known predefined constants |
--dump-file file | dump final emulation state to file |
--log-file file | sets log file path |
--log-level level | set logging level |
--log-subsystem list | set logging subsystem mask |
--memory size paddr | add memory device with size (in hexadecimal) starting from physical address paddr |
--pa-size nbits | set physical address size in bits |
--paging offset nlevels | set paging info: page offset in bits and number of indexing levels. Depth of virtual address space will be: offset+nlevels*(offset-3) |
--profiling | do profiling (per bundle) |
--rom path paddr vaddr | add ROM image, map it to corresponding physical and virtual addresses |
--timing_info | report timing info |
--verbose | verbose logging |
--video paddr vaddr width height | add video device (experimental) |
-- | separator between POSTRISC engine options and emulated guest program options |
For example, running POSTRISC ELF static-PIE image:
/path/to/postrisc \ --exeapp --log-file "test-log.txt" \ -- \ postrisc-app -app-option1 -app-options2
The qtpostrisc is a Qt-based graphical application with assembler editor, debugger, doom graphical backend, etc (Qt5 required). Support same command-line options as console app.
For Wayland systems currently requires switch to X11 via «QT_QPA_PLATFORM=xcb» env:
QT_QPA_PLATFORM=xcb /path/to/qtpostrisc \ --exeapp --log-file "doom-log.txt" \ -- \ doomgeneric.postrisc -doom -options
If this manual for the instruction architecture and assembler syntax doesn't exactly match the sample program (not yet updated), then this gen.html file contains a brief instruction set manual, automatically generated by the assembler.
Example program for assembler POSTRISC program.html does nothing sensible but uses all machine instructions and pseudo-instructions of assembler, all sections of the program, all addressing modes, and its only meaning is joint testing of assembler, disassembler, emulator in the process of writing them. It is a concatenation if separate little tests.
The resulting binary program may be disassembled: out_diz.s and out_dump.s (with reported binary representation).
The results: result.txt. And full system dump: dump.html.
The POSTRISC building is based on cmake. Use generators "MSYS Makefiles" (Windows) or "Unix Makefiles" (Linux). Set -DCMAKE_BUILD_TYPE=Release as default. Set -DCMAKE_CXX_COMPILER=g++ or clang++ (MSVC isn't supported). The «USE_QUADMATH» macro controls the used long floating point internal implementation (quadmath or mpreal). Set -DUSE_QUADMATH=0 for clang (it doesn't support libquadmath), set 0 or 1 for g++.
The author is grateful in advance for reporting errors and inaccuracies in the virtual processor description and in the source code (which is far from error-free).
A set of tools for working with POSTRISC (assembler, disassembler, emulator is implemented) is incomplete. Further directions for improvement (the number in brackets like [2] characterizes the comparative complexity of the tasks):
Chapter 1. Choosing an instruction set
§ 1.1. Bottlenecks issue
§ 1.2. Memory non-uniformity problem
§ 1.3. The technologies of parallel operation execution
§ 1.4. Instruction format budget
Chapter 2. Instruction set architecture (ISA)
§ 2.1. General description of the instruction set
§ 2.2. Register files
§ 2.3. Instructions format
§ 2.4. Instruction addressing modes
§ 2.5. Data addressing modes
§ 2.6. Special registers
Chapter 3. Basic instruction set
§ 3.1. Register-register binary instructions
§ 3.2. Register-immediate instructions
§ 3.3. Immediate shift/bitcount instructions
§ 3.4. Register-register unary instructions
§ 3.5. Fused instructions
§ 3.6. Conditional move instructions
§ 3.7. Load/store instructions
§ 3.8. Branch instructions
§ 3.9. Miscellaneous instructions
Chapter 4. The software exceptions support
§ 4.1. Program state for exception
Chapter 5. The register stack
§ 5.1. Registers rotation
§ 5.2. Call/return instructions
§ 5.3. Register frame allocation
§ 5.4. The function prolog/epilog
§ 5.5. The register stack system management
§ 5.6. Calling convention
Chapter 6. Predication
§ 6.1. Conditional execution of instructions
§ 6.2. Nullification Instructions
§ 6.3. Nullification in assembler
Chapter 7. Physical memory
§ 7.1. Physical addressing
§ 7.2. Data alignment and atomicity
§ 7.3. Byte order
§ 7.4. Memory consistency model
§ 7.5. Atomic/synchronization instructions
§ 7.6. Memory attributes
§ 7.7. Memory map
§ 7.8. Memory-related instructions
Chapter 8. Virtual memory
§ 8.1. Virtual addressing
§ 8.2. Translation lookaside buffers
§ 8.3. Search for translations in memory
§ 8.4. Translation instructions
Chapter 9. The floating-point facility
§ 9.1. Floating-point formats
§ 9.2. Special floating-point values
§ 9.3. Selection for IEEE options
§ 9.4. Representation of floats in registers
§ 9.5. Floating-point computational instructions
§ 9.6. Floating-point branch and nullification instructions
§ 9.7. Logical vector instructions
§ 9.8. Integer vector operations
Chapter 10. Extended instruction set
§ 10.1. Helper Address Calculation Instructions
§ 10.2. Multiprecision arithmetic
§ 10.3. Software interrupts, system calls
§ 10.4. Cipher and hash instructions
§ 10.5. Random number generation instruction
§ 10.6. CPU identification instructions
§ 10.7. Instructions for the emulation support
Chapter 11. Application Model (Application Binary Interface)
§ 11.1. Sections and segments
§ 11.2. Data model
§ 11.3. Reserved registers
§ 11.4. Position independent code and GOT
§ 11.5. Program relocation
§ 11.6. Thread local storage
§ 11.7. Modules and private data
§ 11.8. Examples of assembler code
Chapter 12. Interrupts and hardware exceptions
§ 12.1. Classification of interrupts
§ 12.2. Processor state preservation upon interruption
§ 12.3. Exception Priority
§ 12.4. Interrupt handling
Chapter 13. External interrupts
§ 13.1. Programmable external interrupt controllers
§ 13.2. Built-in interrupt controller
§ 13.3. Handling external interrupts
§ 13.4. Handling local interrupts
§ 13.5. Processor identification and interprocessor messages
Chapter 14. Debugging and monitoring
§ 14.1. Debug Events
§ 14.2. Debug registers
§ 14.3. Monitoring registers
Chapter 15. PAL (Privileged Architecture Library)
§ 15.1. PAL instructions and functions
§ 15.2. PAL replacement
Chapter 16. LLVM backend
§ 16.1. LLVM backend intro
§ 16.2. LLVM backend limitations
§ 16.3. MUSL port
§ 16.4. Code density comparison
§ 16.5. DOOM port
When creating instruction set for existing processor architectures in different years, their architects proceeded from various, often mutually exclusive, goals. Among these goals are the following:
There are different architectures - the ones far gone in one certain direction, up to explicit conceptualism, and universal, seeking a balance of priorities in different directions. The choice made at the stage of designing the instruction set architecture may subsequently affect the possibility of developing the architecture in one direction or another. Errors in the design of architecture can cut off the possibility of effective implementation of architecture on new technologies due to incorrect prediction of the trend of technological innovation, narrow the scope of architecture, reduce the effectiveness of the application of architecture.
The traditional architecture of a programmable computing device is based on the principle of controlling the system by executing a program, which is a sequences of instructions stored in memory. The execution of the instruction consists of a sequence of steps:
Naturally, processor performance equals the performance of the bottleneck of this system. It doesn't make sense to increase the capabilities of one pipeline stage if problems at other stages are not resolved. Accordingly, several processor bottlenecks arise:
You can immediately say that the bandwidth of RAM is the fatal bottleneck of the processor, and this problem is only removed by increasing the amount of built-in cache.
At the heart of traditional architecture are two principles that relate to the central element of this – memory architecture. This is the principle of random access to any memory element (uniformity property) and the principle of controlling the system by executing a program, which is a sequences of instructions stored in memory.
However, early computers already got summary registers, and later the counter registers and indexes appeared, stored as close as possible to the calculator. The emergence of architectures with general-purpose registers meant the final division of memory into fast registers and slower RAM. The appearance of registers made it possible to explicitly track, analyze and plan dependencies according to data in the instruction stream, and, if there are no dependencies, execute the instructions at the same time.
The further memory size increase, the computing devices miniaturization, an increased gap between the the memory and the processor speed gave rise to cache memory. Caching removed some of the problems with speed memory operations without changing the programming paradigm. However, more was needed. Cache levels of increasing size. Current computer circuit as follows: a set of specialized computing devices with its own register files relies on a system of logically uniform memory with implicit multi-layer caching.
The architecture of a computer with 16 general-purpose registers is certainly better than the architecture with 8 registers. And architecture with two pipelines of multiplication-addition of floating-point numbers is better than with one pipeline. It might seem that an architecture with 1024 registers and 16 multiplication-addition pipelines would be almost ideal. However, a register file of 1024 registers with 16×4=64 read/write ports would be a technological absurdity. Caching also reached its limit after the advent of four cache level. Further enhancement of parallel data processing capabilities is carried out by creating massively parallel systems with shared memory, which abandoned the property of uniformity of memory, leaving it only for the local memory of one multiprocessor node. But these issues already lie outside the processor architecture of the processor itself.
The new architecture doesn't abolish the traditional architecture based on logically homogeneous memory and doesn't offer a new programming paradigm. The architecture is still based on logically homogeneous RAM. Architectural changes can only affect the model of a computing device with its state explicitly described by internal registers.
In addition to the memory non-uniformity, there is another fundamental fact that determines the development of architectures – parallelism of operations. Unlike traditional strictly consistent architecture, modern architectures to achieve maximum performance seek to execute more than one instruction at a time, and more than one operation in one instruction.
The problem of parallel computing is ultimately reduced to the problem of organization consistent simultaneous access of many computing devices to logically homogeneous memory, that is, to the same problem of real memory non-uniformity and insufficient bandwidth. Accordingly, it is the level of parallel memory sharing that determines the parallelization technologies used.
There are several technologies for increasing the degree of parallelism of calculations, depending on the hierarchical level of memory for which they are intended. These technologies are implemented either at the ISA level (instruction set architecture) or at the software level. Here we are more interested in the first case, since we want to evaluate the possibilities of parallel operations due to the correct choice of ISA.
Memory hierarchy level | Data exchange | Technology | Hardware | Acceleration | New ISA | Code density | Implementation |
---|---|---|---|---|---|---|---|
Separate register | inside the pipeline | SIMD: subword parallelism | Wide registers and data buses | 4-16 operations in one instruction | 4-16 | 8 | At ISA level, compiler |
Pipeline data | inside the pipeline | Fused instructions | Longer pipeline, additional read port | 2-3 operations in one instruction | 2-3 | 1.25 | At ISA level, compiler |
Separate register file | Crossbar before register file | OOOE+SS: out-of-order super-scalar execution | Increase in the number of ports of the register file, associative hardware for instruction issuing | 2-10 instructions per cycle | 2 5 | 0 | At ISA level, compiler |
Many computing units with local register files | inter-file transfer instructions | MIMD+VLIW: very long instruction word | Wide fetching of instructions, scheduling | 2-8 instructions per cycle | 0 | 0.25 | At ISA level, compiler |
Cache | Explicit sync memory access instructions | SMT/CMP: simultaneous multi-threading, chip multi-processing | Multiport cache, next instruction fetch | 2-4 microkernels (threads) on one chip | 4 | 0 | At the program level |
Local shared RAM | Explicit Sync Memory Access Instructions | SMP: shared memory processing | Memory banks, wide crossbar | 2-64 microchips in one node | 64 | 0 | At the program level |
Computing network | Library network transfer functions | MPP: massively parallel processing | Developed network topology (hypercube, torus, mesh, fat tree) | any number of nodes in the array | 4096 | 0 | At the program level |
The commercially successful ISA implementation is a compromise between the implementation complexity and each technology benefits. Successful ISA implementation doesn't give preference to any one technology (isn't pure conceptual), but organically and in moderate doses combines several technologies.
SIMD (Single Instruction Multiple Data) are instructions for homogeneous vector operations on elements (8,16,32,64 bits long) of a wide register (64 or 128 bits long). They allow to perform several (2-16) operations in one instruction per cycle. However, software handling of exceptional situations is complicated (where is the error in the vector operation?). The program should contain a sufficient proportion of operations that allow vector execution, and the optimizing compiler must be able to find such operations. When accessing the memory there are problems with data alignment. Implementation of wide ports for reading and writing registers.
Fused instructions are three-operand instructions that combine two binary operations. For example: a = b × c + d. This reduces the total number of instructions and doubles (ideally) the number of computational operations performed per clock cycle (but not machine instructions). This requires a longer execution pipeline, and hence the increase in delays during branches. We need an additional read port for the third operand. The construction of an exception handler is becoming more complicated, since collisions are possible in both the first and second of the fused operations. It takes a place in the instruction to encode the fourth operand. There is a discrepancy in the formats of the computational instructions: binary and ternary formats, which complicates decoding, or you have to artificially convert all binary formats into ternary instructions. The program must contain operations that allow fusing, and the optimizing compiler must be able to find such operations. The percentage of fused operations should be large enough. The total number of possible fused instructions O(N2), where N is the number of basic operations, which is quite large. In practice, it is impossible to fuse all instructions, since the amount of decoding equipment and the place in the instruction allocated for the operation code are limited. Therefore, only some frequently occurring combinations of operations fuse.
Predication is conditional execution of instructions. Any instruction turns into a hardware-executed conditional branch statement. For example: if (a) b = c + d. An additional operand encodes the logical condition register. This technology replaces a control dependency with a data dependency and shifts a possible pipeline shutdown closer to the pipeline end. Most poorly predicted branches in short conditional calculations, and hence pipeline stops, It is eliminated due to the simultaneous execution of instructions from different branches of the conditional statement. However, this is a purely power method, which boils down to simultaneously issuing instructions from several execution branches under different predicates on the pipeline. It takes a place in the instruction to encode the extra operand – predicate register.
Superscalar (super-scalar) execution of instructions. Advantages: Execution of several (1-4) instructions per cycle. Disadvantages: Exception handling becomes more complicated, since the completion of instructions is required strictly in program order. Associative hardware of complexity O(N2) is required to analyze and select N simultaneously executed instructions. We need additional read and write ports, additional pipeline stages.
Out-of-order execution or OOOE is the execution of instructions is not in the manner prescribed by the program, and as the operands are ready, which allows you to bypass structural dependencies according to and do some useful things while waiting for the completion of previous instructions, such as reading from memory. However, the handling of exceptions is complicated, since the completion of instructions is required strictly in software order. Associative hardware of complexity O(N2) is required to analyze and select the next instruction from N buffered instructions. Additional pipeline stages are needed. A register file of sufficient size and equipment for dynamic renaming of registers are required.
VLIW (Very Long Instruction Word) or MIMD (Multiple Instruction Multiple Data) is the execution of instruction packages. Advantages: Execution of several (1-4) instructions per cycle. Disadvantages: Need synchronization – accurate knowledge of delay times for all pipelines and memory, and hence program intolerance when changing the processor model and incompatibility with data caching. Need additional read and write ports. The program must contain operations that allow synchronous execution, and the compiler must be able to find such operations. The increase in the size of the program due to empty slots for which no useful instructions were found.
The program size should be as small as possible. The requirement of code density requires efficient usage of space in the instructions. The question arises about the most advantageous distribution of the bit budget between different types of information in the instruction. The following table shows what the instruction bit budget can be spent on:
Type of information | Advantages | Disadvantages |
---|---|---|
Operation code | increasing the variety of implemented functions reduces the data path (the number of instructions for the operation) | Complicating functional units and the compiler |
Wider register numbers | more registers in a uniform register file facilitates variable allocation and data flow organization | It is statistically useless when procedures with a small number of variables prevail, it increases the length of data buses and the number of intersections. |
Additional operand registers | Non-destructive 3-ary instructions and complex fused 4-ary operations reduce the number of data moves, shorten the data path | Problems with additional register reading ports and register renaming ports (for OOOE) |
Explicit predicate description | conditional execution of short conditional statements without branches reduces the branch-delay of incorrect dynamic predictions | Favorable only for short conditional statements that do not exceed the pipeline length. |
Longer constants in the instruction code | loading constants becomes easier, less often special instruction sequences are required for synthesizing long constants | Statistically useless when short constants prevail |
Templates for an early description of the instruction distribution to functional units | facilitating decoding and distribution of instructions among functional units | Problems with porting programs to machines with a different set of functional units |
Explicit description of instructions that allow parallel execution | Simplifice the instructions sheduling to functional units | Useless with unpredictable execution times |
Hints to the processor about the direction and frequency of branches | Reduced downtime due to incorrect dynamic predictions | Useless if the compiler doesn't have the necessary information, harmful if the prediction is incorrect. May conflict with hardware branch predictor. |
Hints to the processor about the frequency and nature of future accesses to the cache line | Reduced cache misses, better cache utilization | Useless if the compiler doesn't have the necessary information or does incorrect predictions, harmful for different predictions of access patterns to the same line. May conflict with microarchitectural hardware prefetcher. |
Explicit clustering of register files with binding to functional units | The data bus length and the number of intersections are reduced, power consumption is reduced, Reduced space for register numbers in instructions | Requires explicit data transfer between register files in different clusters by separate instructions, which lengthens the data path. It doesn't allow reducing or increasing the number of clusters specified by the architecture. It doesn't allow redistributing functional units. |
A successful ISA implementation is a trade-off between the cost of instruction space and the benefits of using coding techniques. Preference should not be given to any one technique. The instruction set architecture combines different coding techniques at the ISA level.
In a broader sense, there is a question about splitting the workflow into separate instructions. The same sequence of operations can be represented in different ways as a sequence of instructions. The complication of the semantics of instructions makes it possible to increase their length and reduce the number without compromising the overall size of the program, and this, in turn, raises the question of a new redistribution of the bit budget.
However, for high-performance architectures, it's more important not even the size of the code at all, but the efficiency of using the cache for instructions. The cache line contains several instructions and begins with a naturally aligned address. The most effective option is to fetch all the instructions in one cache line starting from the first instruction. Fetches from not the line start require aligners and introduce delays, incomplete fetches use the cache not rationally.
Summarizing the above, we can say that a regular format of instructions is needed, which would shorten the data path without significantly complicating semantics and hardware support for each instruction would be dense enough, but when choosing between code density and caching efficiency would give preference to caching. The format of the instructions should give the scalability of parallel computing devices within a single portable architecture.
It should be noted that the growth in the volume of processed information is significantly ahead of the growth in the program complexity. The relative part of the RAM occupied by the program code is constantly decreasing. Therefore, the problem of minimizing the size of the program is gradually relegated to the background, remaining relevant only for embedded systems.
Pipeline parallel organization of a computing device requires a regular instruction format, that is, the constancy of the length of the portion supplied to the input of the pipeline decoder instructions. This is necessary in order to start decoding the next portion before decoding of instructions from the previous portion is completed.
The regularity of the instruction format means the finiteness of all instructions, the limitation of their length to the length of the decoded portion. However, not all instruction lengths are available for implementation. It should also be possible to determine the start of the next instruction before decoding the current instruction. The instructions must satisfy the conditions for the natural alignment of data in memory, which are constantly growing as memory systems evolve.
Format | Instruction lengths | Alignment | Example |
---|---|---|---|
Irregular | 1-15 | 1 | Intel X86 |
Irregular | 2,4,6,8 | 2 | Motorola 68000, IBM S390 |
Semi-regular | 2,4 | 2 | MIPS-16 |
4-byte regular | 4 | 4 | Alpha, PowerPC, PA-RISC, Sparc, MIPS |
8-byte bundles | 4,8 | 8 | Intel 80960 |
8 byte instructions | 8 | 8 | Fujitsu VPP |
16-byte bundles | 5,10 | 16 | IA-64 |
The first three rows of the table relate either to the legacy instruction architectures, or to special architectures for embedded applications for which the program size flashed directly into the ROM is more important than performance.
A regular 4-byte format is used by all modern RISC architectures. Now they are completing the cycle of their development, reaching the limits of improving this format. It is hardly worth hoping for significant progress based on new architectures based on this format.
The format of 8-byte instructions is used only on some vector and graphics processors, where there is generally no possibility of access to smaller memory atoms. Its application for general-purpose architecture would mean more than double the size of programs, which is unacceptable.
Number of registers | Architecture | Advantages | Disadvantages |
---|---|---|---|
8 | Intel X86 | scaled indexed addressing mode, SSE2 double precision vector instructions double | CISC: only 8 non-universal and non-orthogonal registers, lack of uniformity in coding |
16 | AMD X86-64 | PC-relative addressing | compatible with old X86, only 16 registers |
16 | ARM32 | Predication, fused instructions | Combination of the instruction counter with the general register |
32 | ARM64 | fused instructions | usually 2 instructions to adress global/static data (hi/lo parts of address) |
32 | SGI MIPS | First RISC: Fixed Instruction Format, PC-relative addressing (MIPS16) | delayed branches |
32 | Intel 80960 | Regular but not fixed format with 4 and 8 bytes instructions | |
32 | HP PA-RISC | instruction nullification, speculative execution, system calls without interruptions, global virtual address space, inverted page hash tables | delayed branches, comparison in each instruction |
32 | DEC Alpha | out-of-order execution of instructions, a fixed format for instructions, a unified PAL code, the absence of global dependencies outside the registers | insufficient memory access formats, poor code density, lack of good SIMD extensions, inaccurate interrupts |
32 | IBM PowerPC | out-of-order execution of instructions with the ordered completion and exact interruptions, fused instructions «multiply-add», multiplicity of the condition register, saving or restoring several registers with one instruction, global virtual address space, inverted cluster page tables | optional comparison in each computational instruction, dependencies between global flag instructions, inconvenient ABI. |
32 | IBM/Motorola PowerPC | AltiVec Vector Extension | missing double-precision vector instructions (as in SSE2) |
32 | Sun UltraSPARC | Recursive interrupts, register rotation | register windows of a fixed size, large register files but a small number of registers |
128 | Intel IA-64 | Predication, register rotation, instruction bundles | only next execution of instructions, large multi-port register files, sparse code, complex compiler |
128 | IBM Cell | Unified register file for all types | Explicit non-uniform scratchpad memory without cache, explicit DMA for exchange with main memory |
256 | Fujitsu SPARC64 IX-FX | Vector instructions for paired registers. | Separate preparation instructions for specifying numbers from an extended set of registers |
This chapter provides a basic description of the POSTRISC virtual processor instruction architecture (instruction set architecture or ISA).
The architecture prefers security over performance. The exploitation of the unplanned program behavior should be avoided by design as possible. We should avoid ambiguous code interpretation. This was done for security reasons to prevent the return-oriented programming attacks like «return to libc» and to make all binary code available for inspection.
The variable-length instruction encoding allows starting execution from the middle of instruction and extracting unplanned instruction sequences. It is possible an alternative interpretations of program code via decoding from the middle of a variable-length instruction. It should be impossible to continue execution from the middle of the instruction. To ensure this, we can use a fixed format or variable-length self-synchronizing format. The POSTRISC chose a fixed format. So the variable-length instructions are forbidden and only fixed instruction encoding with aligned code chunks is allowed.
Some architectures allow placing data inside code, by design or due to the global data addressing limitations. In such architectures, data parts may be placed near a function that uses them or accumulated into bigger «data islands» for several functions. The data in a code section may lead to possible data execution and exploiting the unplanned program behavior. So the strong separation of code and data should be enforced at architecture level, and mixing of code and data in the code section should be prohibited. This also improves paging/caching/TLB.
The instruction set architecture is aimed to the most parallel extraction of instructions from memory and decoding. The format of the instructions is regular (the length of the decoded portion of the code is constant), but not strictly fixed (when all instructions are necessarily the same length), but almost fixed (inside the regular portion, the initial parts of the instructions are the same length, a possible continuation also has a fixed length). The unit of instruction flow is a 16-byte bundle assembled from three (usually) or two instructions. Bundles are always 16-byte aligned in memory.
Unlike traditional systems like VLIW (very long instruction word), the instruction bundling reflects a parallel fetching and decoding process only, but not the process of dispatching, executing, or completing instructions. The instruction bundles do not describe the binding of individual instructions to functional units, the possibility (or necessity) of parallel execution and/or completion, execution timings. The architecture doesn't expose microarchitectural details to software such as load data delays, branch delays, other fixed pipeline delays (pipeline hazards), or fixed set of functional units. This is necessary for programs portability within a family of machines with different microarchitecture/performance. It is assumed that the program can be used without recompilation on machines with different sets of functional units and timings.
Wherever possible, the instruction set tends to be uniform, that is, if some part of the instruction with the same meaning (for example, the number of the first register, the number of the second register, immediate value, etc.) is present in many instructions, then in all those instructions this part is placed at the same position.
Instruction set architecture uses the non-destructive instruction format for any calculation over registers, i.e. the result register is always encoded separately from the operand registers, unlike CISC dual-argument architectures, where the result is forcibly combined with one of the operands. Accordingly, two-argument unary instructions, three-argument binary instructions and four-argument fused instructions (trinary) are valid.
Fighting unpredictable branches or using vector extension requires the introduction of predicates and conditional execution, but to encode an additional predicate argument, each instruction needs extra space. The POSTRISC architecture uses implicit predication via nullification. Each instruction can be overridden to nop by the previous nullification instructions. Instructions are executed conditionally and canceled instructions are considered as non-ops. When we don't use predication, we don't pay for it in instruction bits.
In the new architecture, to reduce the data path, a limited number of frequently encountered combinations of operations are fused (combined in one machine instruction): addition (or subtraction) with a shift; multiplication with addition or subtraction; addition with a constant and memory access (base + displacement addressing mode); register addition (with shift) and memory access (indexed scaled addressing mode); comparison with the branch according to the result of the comparison; change of the cycle counter with comparison and branch according to the result of the comparison, etc. The architecture assumes the true hardware support for fused operations, rather than just compiling the code with hardware breakdown into the original operations.
In architecture, superscalar out-of-order instruction execution equipment can be effectively used. To do this, the instruction set has several limitations. There are no implicit or optional instruction results, no global registers and flags. The number of possible side effects of the instructions is limited. Most instructions have a single register result. Several instructions have two register results. The number of operands is limited to three (and for most instructions, two) registers.
For the POSTRISC instruction architecture, the underlying technology is parallel (super-scalar) out-of-order execution of complex (fused) instructions with implicit predication.
The instruction fetching and decoding will occur sequentially in program order. Out-of-order concurrent execution will be used to process at least one instruction bundle per cycle. The final completion of the instructions with the analysis of exceptions occurs sequentially in a program order.
All operations on integer data occur in general registers, with 2-3 registers of the source operands (there may be a direct meaning or a direct shift) and one register of the result.
All actions on floating-point data occur in general registers, with 1, 2 or 3 registers of the source operands and one register of the result. Floating-point instructions work on single/double/quadruple precision numbers in scalar or packed vector forms.
Many scalar operation codes are complemented by a wide range of vector operations. A special vector extension is used to process multimedia and numerical data in ordinary registers.
The architecture is of type load/store. Memory accesses are limited to load or store instructions that move data between registers and memory, and don't overlap whit using the loaded value. The memory access instructions usually expect strictly one memory access with a single virtual address translation. The unaligned memory accesses are possible, but strict data alignment is preferred.
Global flags and dedicated registers prevent efficient parallel execution of instructions, but duplicating resources and introducing explicit dependencies between instructions also require extra bits to be explicitly described in the instruction. Branch instructions do not use flags but check the values of general registers. The basic operation is the combination of «compare and jump» in one instruction.
To speed up the subroutine calls, to pass arguments through registers, and to reduce the number of memory accesses, a hardware circular buffer of rotated registers is implemented. It also improves code density by minimizing function prologs and epilogues. The second protected stack for rotating registers also protects the contents of all register frames from erroneous changes. The register rotation also complicates the return-oriented programming - there is no known assumption about the correspondence between the physical registers between different function frames.
Optional hints about the frequency and nature of future cache line accesses carried out (if such information is available) in separate instructions.
For immediates encoding there exist different variants with optional compression, interpreting binary values as signed/unsigned, separate sign bit and unsigned value, etc. The POSTRISC uses simple 2-complement binary representation. Each immediate class is defined as signed or unsigned depend on its usage. Base addressing displacements are defined as always signed. Shift amounts are always unsigned. Compare immediates for less/greater are signed or unsigned depend on type. Compare immediates for equal are chosen to be signed.
Processor resources include register files, special registers, associative search structures, interrupts. Some resources are available for user programs, others are necessary for the functioning of the operating system. Each processor core has its own set of registers that contain the current state of the core. All registers are divided into register files. There are no registers that are not included in any register file.
It is known that for the usual code, increasing the register file size above 32 has negligible results. But using more registers has a sense for high-performance computing, digital signal processing, accelerating 3D graphics, and game physics. IBM uses the 128x128 SIMD register file in its POWER VMX extension and 64x128 in its POWER VSX extension. Fujitsu uses the 256x128 register file in its SPARC FX HPC-ACE extension. Intel Itanium had 128x82 floating-point registers for HPC.
For the POSTRISC architecture, the 128x128 register file is chosen as a compromise between ordinary usage and special computing purposes.
Register file | Number of registers | The size of the registers in bits | Additional info |
---|---|---|---|
General Purpose Registers | 128 | 128 | General-purpose registers are intended for manipulations with scalars 1,2,4,8,16 bytes long or vectors of numbers 1,2,4,8 bytes long. General purpose registers are divided into 120 rotated windowed and 8 global registers. In each group, all registers are equal at the architecture level. Registers can be used to manipulate real numbers of quadruple precision, single and double precision packed vectors of real numbers, packed integer vectors of length 1,2,4,8 bytes. Exceptions from equality: local: r0, globals: tp, fp, sp, gz. |
Special Purpose Registers | up to 128 | 32/64/128 | As the name implies, special-purpose registers have different purposes. Not all of the 128 possible special registers are implemented. The ability to read/write depends on the priority level, register number, etc. |
CPU identification registers | implementation-defined | 64 | The read-only registers for reporting hardware capabilities/features. Available only indirectly. |
Instruction TLB translation registers | implementation-defined | 128 | The fixed translations which can't be evicted from the Instruction TLB buffer. Available only indirectly. |
Data TLB translation registers | implementation-defined | 128 | The fixed translations which can't be evicted from the Data TLB buffer. Available only indirectly. |
Performance monitor registers | implementation-defined | 64 | The counters for the internal processor core statistic like number of TLB misses, instruction/data cache misses, branch mispredictions, etc. Available only indirectly. |
Instruction breakpoint registers | implementation-defined | 64 | The instruction breakpoint register when enabled allows stopping execution on preferred code addresses. |
Data breakpoint registers | implementation-defined | 64 | The data breakpoint register when enabled allows stopping execution on preferred data addresses and/or addressing types like read/write/backstore/etc. |
Existing RISC architectures have exhausted the possibilities of a fixed 32-bit instruction format. Deep loop unrolling, function inlining, other compiler optimization technologies require more than 32 general-purpose (and floating-point) registers, preferably at least 128. However, increasing the number of registers over 32 with the 32-bit RISC instruction length turned out to be difficult. The three-address format requires at least 3×log2(128) or 21 bits for register numbers (and a four-address fused instruction even 28 bits).
The decision to separate code and data forces us to support the effective addressing modes to access global/static/const data outside code section. The approach with several instructions (like the high/low offset parts) to access global data seems unfit. But existing 32-bit instructions aren't enough to access the global data from any code position by one instruction. For the biggest known projects, their size is estimated as 150-250 MiB (210 MiB Chromium, 380 MiB Linux kernel «allyesconfig» build, various CADs, etc), which requires offsets with at least 28-30 bit size for future code blow. The POSTRISC supports programs up to 256 MiB with direct access to global data in one instruction.
Some vector processors (like NEC SX Aurora) or video cards use a longer fixed 64-bit format. But this doubles the program size and doesn't justify the possible benefits for the general purpose architecture. There remains the only intermediate format, consistent with the 2n byte alignment, with 3 instructions for 42 bits (slots), packed in 128-bit bundles. With the 128-bit format we can't transfer control to any instruction in the bundle, except the first, and execute part of the bundle. The bundle is a minimal execution unit. This approach for encoding is similar to Intel IA64 Itanium.
The POSTRISC architecture defines that a 128-bit bundle consists of a 2-bit template and three 42-bit slots. There are two types of instructions: one or two bundle slots length. A bundle may contain three simple one-slot instructions, or a dual-slot instruction and a one-slot (direct order), or a one-slot instruction and a dual-slot (reversed order).
All operation codes are placed in the first slot of double-slot instruction, so the second slot is used for the immediate extensions only. If the instruction format allows expansion to the second slot and the formation of a long instruction, then some immediate fields may have different lengths in short and long formats. For example, simm21(63) means that it is a 21-bit short format field, expandable to 63 bits in a long format.
The splitting of a bundle into instructions is completely determined by a 2-bit template, so that the main and additional instruction codes for different lengths of the instruction format do not overlap. However, they are defined to be always identical. The long instructions are always the extended versions of short instructions with the extended immediates. The following table shows the packaging of the template and instructions into bundles.
Slot 3 (bits 86…127) |
Slot 2 (bits 44…85) |
Slot 1 (bits 2…43) |
Template (bits 0…1) |
---|---|---|---|
42 bits | 42 bits | 42 bits | 00 |
84 bits | 42 bits | 01 | |
42 bits | 84 bits | 10 | |
126 bits (reserved) | 11 |
The following table shows the instruction formats and the instruction fields lengths in bits for one-slot instructions. The high 7 bits of {35:41} always define the primary operation code (or just opcode) of the instruction. Many instructions also have one or two extended opcode (opx). The remaining bits of the instruction contain one or more fields in various formats.
Name format |
Format bits | |||||||||||||||||||||||||||||||||||||||||
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | |
r1i * | opcode | ra | simm28 (64) | |||||||||||||||||||||||||||||||||||||||
RaU28 * | opcode | ra | uimm28 (64) | |||||||||||||||||||||||||||||||||||||||
r1b * | opcode | ra | label28 (64) | |||||||||||||||||||||||||||||||||||||||
br * | opcode | opx | label28 (64) | |||||||||||||||||||||||||||||||||||||||
RaU28 * | opcode | opx | uimm28 (64) | |||||||||||||||||||||||||||||||||||||||
alloc | opcode | opx | framesize | 0 | ||||||||||||||||||||||||||||||||||||||
allocsp * | opcode | opx | framesize | uimm21 (63) | ||||||||||||||||||||||||||||||||||||||
raopxUI21 * | opcode | opx | 0 | uimm21 (63) | ||||||||||||||||||||||||||||||||||||||
raopx2i * | opcode | opx | rb | simm21 (63) | ||||||||||||||||||||||||||||||||||||||
r2si * | opcode | ra | rb | simm21 (63) | ||||||||||||||||||||||||||||||||||||||
r2ui * | opcode | ra | rb | uimm21 (63) | ||||||||||||||||||||||||||||||||||||||
raopx2b * | opcode | opx | rb | 0 | label17 (30) | |||||||||||||||||||||||||||||||||||||
r2b * | opcode | ra | rb | opx | label17 (30) | |||||||||||||||||||||||||||||||||||||
bbit * | opcode | ra | shift | opx | label17 (30) | |||||||||||||||||||||||||||||||||||||
brcsi * | opcode | ra | simm11 (40) | label17 (30) | ||||||||||||||||||||||||||||||||||||||
brcui * | opcode | ra | uimm11 (40) | label17 (30) | ||||||||||||||||||||||||||||||||||||||
RaSIN * | opcode | ra | simm11 (40) | dist-no | dist-yes | opx | ||||||||||||||||||||||||||||||||||||
RaUIN * | opcode | ra | uimm11 (40) | dist-no | dist-yes | opx | ||||||||||||||||||||||||||||||||||||
RaSbN | opcode | ra | shift | opx | dist-no | dist-yes | opx | |||||||||||||||||||||||||||||||||||
RabN | opcode | ra | rb | opx | dist-no | dist-yes | opx | |||||||||||||||||||||||||||||||||||
r4 | opcode | ra | rb | rc | rd | opx | ||||||||||||||||||||||||||||||||||||
r3s1 | opcode | ra | rb | rc | pos | opx | ||||||||||||||||||||||||||||||||||||
r2s2 | opcode | ra | rb | shift | pos | opx | ||||||||||||||||||||||||||||||||||||
r2s3 | opcode | ra | rb | shift | shift | pos | ||||||||||||||||||||||||||||||||||||
r3s2 | opcode | ra | rb | rc | shift | pos | ||||||||||||||||||||||||||||||||||||
gmemx * | opcode | ra | rb | rc | scale | sm | disp | |||||||||||||||||||||||||||||||||||
RbcScale | opcode | 0 | rb | rc | scale | opx | ||||||||||||||||||||||||||||||||||||
Rbc | opcode | 0 | rb | rc | 0 | opx | ||||||||||||||||||||||||||||||||||||
mspr | opcode | ra | 0 | spr | 0 | opx | ||||||||||||||||||||||||||||||||||||
r2 | opcode | ra | rb | 0 | 0 | opx | ||||||||||||||||||||||||||||||||||||
Round | opcode | ra | rb | 0 | rm | opx | ||||||||||||||||||||||||||||||||||||
r2s1 | opcode | ra | rb | shift | 0 | opx | ||||||||||||||||||||||||||||||||||||
r3 | opcode | ra | rb | rc | 0 | opx | ||||||||||||||||||||||||||||||||||||
RabcMo | opcode | ra | rb | rc | mo | opx | ||||||||||||||||||||||||||||||||||||
RabMo | opcode | ra | rb | 0 | mo | opx | ||||||||||||||||||||||||||||||||||||
RbcMo | opcode | 0 | rb | rc | mo | opx | ||||||||||||||||||||||||||||||||||||
fence | opcode | 0 | mo | opx | ||||||||||||||||||||||||||||||||||||||
gmemu | opcode | ra | rb | simm10 | opx | |||||||||||||||||||||||||||||||||||||
int | opcode | 0 | rb | simm10 | opx | |||||||||||||||||||||||||||||||||||||
NoArgs | opcode | 0 | opx |
Field | Length | Description |
---|---|---|
opcode | 7 | primary operation code |
opx | 4, 7, 11 | extended operation code |
ra, rb, rc, rd | 7 | general register number, operand or result |
spr | 7 | special register number |
uimm, simm | 9, 10, 11, 21, 28 | unsigned/signed immediate |
disp | 9, 21, 28 | signed immediate for the address offset |
label | 17, 28 | signed immediate of branch/jump/call |
stride | 10 | signed immediate for base update |
dist-yes, dist-no | 5 | nullification block size |
shift, pos | 7 | bit number, shift value, field legth |
scale | 3 | indexing scale factor |
sm | 2 | indexing scaling mode |
rm | 3 | floating-point rounding mode |
mo | 3 | memory ordering mode |
0 | various | unused (reserved, must be zeros) |
Formats marked in the table with asterisk (*) allow the instruction continuation to the next bundle slot with the formation of a two-slot instruction. The primary codes of single and dual-slot instructions are the same. The assembled code should directly specify a forced extension of the instruction to the second slot by the additional suffix «.l» (long). The assembler adds dummy nop instructions to the code if the long instruction doesn't fit in the rest of the bundle and need to start a new bundle.
addi r23, r23, 1234 addi.l r23, r23, 1234
Notes: Btw, 42-bit slot format is in line with the «Answer to the Ultimate Question of Life, The Universe, and Everything»!
The calculation of effective addresses takes place with cyclic rounding modulo 264. Absolute addressing directly in the instructions is missing. Only position independent code (PIC) can be used. Target addresses for addressing executable code can only be calculated relative to the address of the current instruction bundle (instruction pointer ip) or relative to the base addresses in general registers.
The architecture supports 2 modes for ip-relative code addressing:
EA = ip + 16 × sign_extend(disp)
The call/jump offset takes up 28 bits in the instruction slot and allows to encode the branch to a maximum of ±2 GiB in both directions from the current address. If the two-slot instruction is used, the branch distance is maximum ±8 EiB on either side of the current address.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | other | offset (28 bits) |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
0 | continued (60 bits instead of 28) |
The branch offset takes 17 bits in the instruction slot and allows to encode the branch to a maximum of ±1 MiB in both directions from the current address. If the two-slot instruction is used, the offset takes 30 bits, and the branch distance is ±8 GiB in both directions. The branch condition is encoded by the other parts of the instruction.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | other | offset (17 bits) |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
other | (30 bits instead of 17) |
The linker, creating the image of the program module, must correctly replace all symbolic links for procedures and global data with offsets, where the symbol is accessed, to the location of the symbol itself. That is, for example, calls to the same static procedure from different places in the program occur with different relative offsets.
The architecture also supports the base-relative instruction addressing. The effective address is computed as a sum of 2 registers aligned to the bundle boundary.
EA = (GR[base] + GR[index]) & mask{63:4}.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | other | base | index | other |
Absolute data addressing isn't directly supported. The architecture makes it impossible to put absolute static addresses into the instruction code. Only a position-independent code is available (PIC/PIE). Target absolute addresses can be calculated relative to the address of the current instruction bundle or reserved base registers only. The architecture supports the following data addressing modes:
For relative addressing, the immediate unsigned disp field, which is 28 bits or 64 bits for a dual-slot instruction, after unsigned extension, is added to the contents of the instruction pointer to produce a 64-bit effective address. We assume that the program data sections like «.data» or «.rodata» are placing strictly after the code sections like «.text» in the loaded program. The 28-bit immediate value allows to address 256 MiB forward from the current bundle. The dual-slot instructions allow addressing full 64-bit address space.
EA = ip + zero_extend(disp)
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | target | disp (28 bit) |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
0 | continued (64 bits instead of 28) |
For the base plus displacement addressing mode the disp offset, which is 21 bits or 63 bits for a dual-slot instruction, after sign extension, is added to the contents of the base register, to produce a 64-bit effective address. The 21-bit immediate value disp allows addressing ±1 MiB in both directions from the base address.
EA = GR [base] + sign_extend(disp)
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | target | base | disp (21 bits) |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
continued (63 bits instead of 21) |
For scaled indexed addressing mode, firstly, the contents of the index register is extended according to sm instruction modifier. The sm instruction modifier may be x64 (no extension), u32 (32 bit unsigned), i32 (32 bit signed). Secondly, extended index is shifted left by the scale, then added with a 9-bit signed offset disp (−256…255), and added with the contents of the base register to produce a 64-bit effective address.
EA = GR[base] + (SM(GR[index]) << scale) + sign_extend(disp)
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | target | base | index | scale | sm | disp (9 bits) |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
disp continued (51 bits instead of 9) |
For base with base immediate post-update addressing mode the 10-bit stride immediate is added to base after memory access.
EA = GR[base]
GR[base] = EA + sign_extend(stride)
For base with base immediate pre-update addressing mode the 10-bit stride immediate is added to base before memory access.
EA = GR[base] + sign_extend(stride)
GR[base] = EA
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | target | base | stride (10 bits) | opx |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
stride continued (52 bits instead of 10) |
Other addressing methods can be implemented through the above. Absolute addressing of data can be implemented by using any register with a value of 0 as the base. Static data should be aligned to 4-byte boundary, but if not - can be addressed by base+displacement addressing after placing the base address in one of the free registers. The special instruction ldar (load address relative) makes this preparation easier.
ldar base, text_hi (ip_relative_offset) ldd dst, base, text_lo (ip_relative_offset)
Here ip_relative_offset is the label of the loaded object in the immutable data segment, text_hi is a built-in assembler function for calculating the relative address of the instruction bundle (or aligned 16-byte data portion), text_lo is a built-in assembler function for calculating the displacement within a bundle (portion). Using the ldar instruction, you can address 1 GiB on either side of the current position, or the entire address space, if you use the two-slot version of ldar:
ldar.l base, text_hi (ip_relative_offset) ldd dst, base, text_lo (ip_relative_offset)
Addressing of private data can be implemented by first placing the correct base address in one of the free registers. Special instruction ldan (load address near) allows to calculate the nearest base address pointing to the middle of the page containing the desired object.
ldan base, gp, data_hi (gp_relative_offset) ldd dst, base, data_lo (gp_relative_offset)
Here gp_relative_offset is the label of the object in the data segment, data_hi is a built-in assembler function to calculate the older part of the relative offset (relative to gp) to the middle of the data page where the label is located, data_lo is a built-in assembler function to calculate the offset of the label relative to the middle of the page. Using the ldan instruction, you can address 1 GiB of private data (or the entire address space if you use the two-slot version of ldan).
You can also immediately use the dual-slot memory access instructions with addressing 263 bytes in both directions from the base address.
ldd.l dst, gp, gp_relative_offset
There are several special registers, each 64 bit length. Not all special registers are available for direct access, most are available only for privileged software (at the system level). The table provides information on the purpose of special registers and their availability in protected and privileged mode.
Group | Registers | Description |
---|---|---|
Registers available to the program at any privilege level for direct and/or indirect reading and updating | ip | instruction pointer |
fpcr | floating-point status/control register | |
rsc | register stack control | |
rsp | register stack pointer | |
eip | exception instruction pointer | |
ebs | exception bit stack | |
eca | exception context address | |
Registers available for reading/writing only at the system privilege level | bsp | bottom stack pointer |
peb | process env block | |
teb | thread env block | |
reip | returnable default exception instruction pointer | |
itc | interval time counter | |
itm | interval time match register | |
psr | processor status register | |
pta | page table addresses | |
Debug facility registers | ibr0…ibr3 | instruction breakpoint registers |
dbr0…dbr3 | data breakpoint registers | |
mr0…mr8 | monitoring registers | |
Registers for switching to the kernel and making system calls are available only in the kernel | kip | kernel instruction pointer |
ksp | kernel stack pointer | |
krsp | kernel register stack pointer | |
Registers for interrupt handling (interrupt context descriptors, shadow copies of general registers), interrupts available in the handler | iip | interruption instruction pointer |
iipa | interruption instruction previous address | |
ipsr | interruption processor status register | |
cause | interruption cause register | |
iva | interruption vector address | |
ifa | interruption faulting address | |
iib | interruption instruction bundle | |
Registers of the built-in interrupt controller for controlling external interrupts and asynchronous interrupts from the processor itself (available only at the system level) | tpr | task priority register |
iv | interrupt vector | |
lid | local identification register (read only) | |
irr0…irr3 | interrupt request registers (read only) | |
isr0…isr3 | interrupt service registers (read only) | |
itcv | interval time counter vector | |
tsv | termal sensor vector | |
pmv | performance monitor vector | |
cmcv | corrected machine-check vector |
Direct access to special registers can be obtained using instructions mfspr (move from special-purpose register) and mtspr (move to special-purpose register). You can copy the special register to the general register (mfspr), perform the necessary operations, and then put the new value in a special register (mtspr).
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | ra | 0 | spr | 0 | opx |
Syntax:
mfspr ra, spr mtspr ra, spr
The special register instruction pointer (ip) stores the address of the bundle containing the currently executing instruction. The register ip can be read directly via mfspr instruction, but better to get the ip-relative address (including those with zero offset) using the ldar/ldafr instruction. The register ip cannot be changed directly (via mtspr instruction), but it automatically increases at the end of the bundle execution, and also receives a new value as a result of the execution of taken branch instructions. Also ip is an implicitly implied operand in a relative branch. Because the instruction format is regular and instruction bundles have a fixed length of 16 bytes and are aligned on a 16-byte boundary. The ip register lower 4 bits are always zero, writing them is ignored.
6 3 | 6 2 | 6 1 | 6 0 | 5 9 | 5 8 | 5 7 | 5 6 | 5 5 | 5 4 | 5 3 | 5 2 | 5 1 | 5 0 | 4 9 | 4 8 | 4 7 | 4 6 | 4 5 | 4 4 | 4 3 | 4 2 | 4 1 | 4 0 | 3 9 | 3 8 | 3 7 | 3 6 | 3 5 | 3 4 | 3 3 | 3 2 | 3 1 | 3 0 | 2 9 | 2 8 | 2 7 | 2 6 | 2 5 | 2 4 | 2 3 | 2 2 | 2 1 | 2 0 | 1 9 | 1 8 | 1 7 | 1 6 | 1 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 0 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
bundle address | 0 |
Special floating-point status/control register (fpcr) designed to control the floating-point unit FPU.
Special registers rsc, rsp are used to control register rotation and flushing the contents of the circular register buffer into memory.
Special registers eip (exception instruction pointer), reip (returnable default exception instruction pointer), ebs (exception bit stack), eca (exception context address) are used to implement almost zero-cost software exceptions (like C++ try/catch/throw).
The 64-bit special processor status register (psr) controls the current core behavior. It is writable only at the most privileged level, its changing requires explicit serialization.
63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 | 41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 |
future | |||||||||||||||||||||||||||||||
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
0 | ri | 0 | pl | vm | pp | mc | us | ib | ic | ss | tb | lp | dd | id | pm |
Group | Field | Size | Description |
---|---|---|---|
Miscellaneous | pm | 1 | User performance monitor enabled. If 1, the performance monitor is turned on and counts events, otherwise the performance monitor is disabled. |
Predication | future | 32 | The future field is used to control the nullification of the subsequent instructions. The nullification instruction may mark any of the subsequent 10 instructions as non-executing in this field. A value of 0 for a bit means that the instruction is executing, 1 - is not executing (nullified). The field is automatically shifted to the right when each instruction is executed, with zeros added for the new farthest instructions. In the case of the branch, the mask is completely cleared, thereby canceling all possible nullifications. |
Debugger | id | 1 | Instruction Debug Breakpoint fault. If psr.id=1, breakpoints for instructions are enabled and may cause a Instruction Debug error. Otherwise, errors and traps on the address breakpoint are prohibited. |
dd | 1 | Data Debug Breakpoint fault. If psr.dd=1, breakpoints for the data are enabled and may cause a Data Debug error. Otherwise, errors and traps on the address breakpoint are prohibited. | |
lp | 1 | Lower Privilege transfer trap. If 1, the Lower Privilege Transfer trap occurs when a transition occurs changes (decreases) the privilege level (the number psr.cpl increases to 1). | |
tb | 1 | Taken branch trap. If 1, then any branch that occurs causes the debug trap Taken branch. Interrupting and returning from it doesn't cause this trap. | |
ss | 1 | Single Step Trap. If 1, then the debug trap Single Step occurs after the successful execution of each instruction. | |
Privileges, restrictions | cpl | 1 | current privilege level. The current privilege level of the executable thread. Controls the availability of system registers, instructions, and virtual memory pages. The value 0 is the kernel level, and the value 1 is the user level. Modified by the instructions syscall, sysret, rfi, trap. |
Interrupts | ri | 2 | Restart Instruction. Stores the size of the executed part of the current instruction bundle. Used to partially restart the bundle after call, syscall, interruption. Instructions from the ipsr.ri range are not executed (that is, instructions are skipped while psr.ri is less than that stored when ipsr.ri was interrupted). |
ib | 1 | interruption Bit. If 1, unmasked delayed external interrupts can interrupt the processor and transfer control to the external interrupt handler. If 0, pending external interrupts cannot interrupt the processor. | |
ic | 1 | interruption Collection. If 1, then upon interruption, partial preservation of the context occurs (using the registers iip, iipa, ipsr, ifa, iib). | |
us | 1 | Used Shadow registers. If 1, then during the interruption a partial preservation of the context occurred (shadow registers, iip, ipsr) are used. | |
mc | 1 | Machine Check. If 1, then machine abortions are masked. | |
vm | 1 | Virtual Machine. If 1, attempting to execute some instructions results in a «Virtualization fault» error. If there is no virtualization implementation, this bit is not implemented and is reserved. The psr.vm bit is available only for the rfi and vmsw instructions. |
Special registers bsp (bottom stack pointer) stores bottom limit for downward grown stack, which current position is stored in general register sp. The architecture assumes that all not-used stack pages will be premapped as guard pages and might be allocated in any order, it doesn't use pre-touching for allocated stack frames. The bsp should be page-aligned.
Special registers peb (process env block) and teb (thread env block) store read-only user-mode addresses of the associated process and thread data blocks respectively.
Special register interval time counter (itc) is an unsigned 64-bit number for measuring time intervals and synchronization in intervals of the order of nanoseconds. The increase in itc is based on a fixed ratio with the processor frequency. itc increases by one time in N cycles, where N is an integer defined by the implementation, the power of two is from 1 to 32. Applications can directly read itc for time-based computing and performance measurements. itc can only be written at the most privileged level. The OS must ensure that an interrupt from the system timer occurs before itc overflows. For itc, it is not architecturally guaranteed that any other processors in the multiprocessor system will be synchronized with the time interval counters, nor with the system clock. The software must calibrate itc with a valid calendar time and periodically adjust possible drift.
Modifications of itc aren't necessarily synchronized with the instruction thread. Explicit synchronization may be required to ensure that modifications to itc are observed by the subsequent program instructions. The software should take into account the possible spread of errors when reading the interval timer due to various machine stops, such as interrupts, etc.
Special interval timer match register (itm) is a 64-bit unsigned number which contains the future value of itc at which an «interval time match» interrupt will occur.
Special register pta (page table address) controls the hardware address translation and stores root addresss for page table.
Special registers iip, iipa, ipsr save part of the context (state) of the processor upon interruption.
Special registers iva, cause, ifa, iib manage the interrupt table (iva), as well as recognition and processing of interrupts.
Special registers lid, iv, tpr, irr0 - irr3, isr0 - isr3, itcv, tsv, pmv, cmcv are for the embedded programmable interrupt controller and manage external interrupts.
Special registers ibr0-ibr3, dbr0-dbr3, mr-mr7, are for debugging and monitoring facility.
This chapter describes the basic virtual processor instruction set. It is approximately 300 truly machine instructions and 30 pseudo-instructions (assembler instructions that do not have exact machine analogs and are replaced by assembler with other machine instructions, possibly with argument correction). It includes instructions for working with general registers, branch instructions, instructions for working with special registers. It doesn't include privileged instructions, floating-point instructions, multimedia instructions, support instructions for an extended (virtual) memory system.
The register-register binary instructions have 3 arguments. The first argument is the result register number, the second and third are the numbers of the operand registers.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | ra | rb | rc | 0 | opx |
Syntax:
INSTRUCTION_NAME ra, rb, rc
Instruction | Operation | Description |
---|---|---|
Arithmetic instructions | ||
add | Ra = Rb + Rc | Addition (64 bits) |
addws | Ra = Rb + Rc | Addition (word, sign-extend) |
addwz | Ra = Rb + Rc | Addition (word, zero-extend) |
sub | Ra = Rb − Rc | Subtraction (64 bits) |
subws | Ra = Rb − Rc | Subtraction (word, sign-extend) |
subwz | Ra = Rb − Rc | Subtraction (word, zero-extend) |
absd | Ra = abs(Rb − Rc) | Absolute difference (64 bits) |
absdw | Ra = abs(Rb − Rc) | Absolute difference (32 bits) |
mul | Ra = LOPART(Rb × Rc) | Multiply (the lower part of 128 bits) |
mulws | Ra = sext(Rb × Rc) | Multiply word, sign-extend |
mulwz | Ra = zext(Rb × Rc) | Multiply word, zero-extend |
mulhs | Ra = HIPART (Rb × Rc) | Signed multiplication (the high part from 128 bits) |
mulhu | Ra = HIPART (Rb × Rc) | Unsigned multiplication (the high part of 128 bits) |
div | Ra = Rb / Rc | Signed division |
divu | Ra = Rb / Rc | Unsigned division |
mod | Ra = Rb % Rc | The remainder of the signed division |
modu | Ra = Rb % Rc | The remainder of the unsigned division |
Bitwise instructions | ||
and | Ra = Rb AND Rc | Bitwise AND |
andn | Ra = NOT (Rb) AND Rc | Bitwise AND with inverse of the first operand |
or | Ra = Rb OR Rc | Bitwise OR |
orn | Ra = NOT (Rb) OR Rc | Bitwise OR with inverse of the first operand |
nand | Ra = NOT (Rb AND Rc) | Bitwise AND with the inverse of the result |
nor | Ra = NOT (Rb OR Rc) | Bitwise OR with result inversion |
xor | Ra = Rb XOR Rc | Bitwise XOR |
xnor | Ra = NOT (Rb XOR Rc) | Bitwise XOR with result inversion |
compare instructions (64 bit) | ||
cmpdeq | Ra = Rb == Rc | Comparison for equality |
cmpdne | Ra = Rb != Rc | Comparison of inequality |
cmpdlt | Ra = Rb < Rc | Signed comparison less |
cmpdle | Ra = Rb <= Rc | Signed less-equal comparison |
cmpdltu | Ra = Rb < Rc | Unsigned comparison less |
cmpdleu | Ra = Rb <= Rc | Unsigned less-than comparison |
cmpdgt | pseudo instruction | permutation of arguments and cmpdlt |
cmpdge | pseudo instruction | permutation of arguments and cmpdle |
cmpdgtu | pseudo instruction | permutation of arguments and cmpdltu |
cmpdgeu | pseudo instruction | permutation of arguments and cmpdleu |
compare instructions (32 bit) | ||
cmpweq | Ra = Rb == Rc | Comparison for equality |
cmpwne | Ra = Rb != Rc | Comparison of inequality |
cmpwlt | Ra = Rb < Rc | Signed comparison less |
cmpwle | Ra = Rb <= Rc | Signed less-equal comparison |
cmpwltu | Ra = Rb < Rc | Unsigned comparison less |
cmpwleu | Ra = Rb <= Rc | Unsigned less-than comparison |
cmpwgt | pseudo instruction | permutation of arguments and cmpwlt |
cmpwge | pseudo instruction | permutation of arguments and cmpwle |
cmpwgtu | pseudo instruction | permutation of arguments and cmpwltu |
cmpwgeu | pseudo instruction | permutation of arguments and cmpwleu |
Min/Max instructions | ||
mins | Ra = MIN (Rb, Rc) | Minimum (signed) |
minu | Ra = MIN (Rb, Rc) | Minimum (unsigned) |
maxs | Ra = MAX (Rb, Rc) | Maximum (signed) |
maxu | Ra = MAX (Rb, Rc) | Maximum (unsigned) |
Shift instructions | ||
sll | Ra = Rb << Rc | Left shift and zero expansion |
srl | Ra = Rb >> Rc | Right shift and zero expansion |
sra | Ra = Rb >> Rc | Right shift and sign extension |
srd | Ra = Rb >> Rc | Right shift as a signed division |
The architecture doesn't use bit flags to store comparison results and doesn't use them as implicit operands/results, as, for example, do the architectures Intel X86, SPARC, IBM POWER. The comparison result as a value of 0 or 1 is stored in the general register. In this sense, POSTRISC is similar to MIPS or Alpha architectures. Additionally, to reduce the data path, instructions for determining the minimum/maximum are implemented (comparison and selection in one instruction).
These eight bitwise register-register instructions are enough to implement any binary logic function with a single instruction.
The shift value for the register-register shift instructions is defined as the lower bits of the third register: 5 bits (for 32 bit operations) or 6 bits (for 64-bit operations) or 7 bits (for 128 bit operations). High bits are ignored.
The shift right as division instructions produce a right shift according to the rules for dividing numbers with a sign. First, an arithmetic right shift is performed (with the expansion of the sign bit). If the obtained value is negative, and when shifting to the right, the non-zero bits were forced out (to the left), then the result is corrected (adding a unit). The instruction was introduced to quickly divide signed numbers by 2shift according to the language rules like C/C++, for dividing negative numbers. With this division, the result is symmetrical with respect to zero, and the remainder can be negative.
The register-immediate arithmetic instructions. The first argument is the number of the register of the result, the second is the number of the register-operand, the third is an immediate value of 21 or 63 bit length, sign or zero extended to 64 bits. Instructions of this group allow continuation of the immediate to the next bundle slot with the formation of a dual-slot instruction.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | src | imm21(63) |
Syntax:
INSTRUCTION_NAME ra, rb, simm INSTRUCTION NAME ra, rb, imm
Instruction | Operation | Description |
---|---|---|
Arithmetic instructions | ||
addi | Ra = Rb + imm | Addition |
subfi | Ra = imm − Rb | Subtraction from the intermediates |
addiws | Ra = Rb + imm | Addition (32 bit signed) |
addiwz | Ra = Rb + imm | Addition (32 bit unsigned) |
subfiws | Ra = imm − Rb | Subtract from immediate (32 bit signed) |
subfiwz | Ra = imm − Rb | Subtract from immediate (32 bit unsigned) |
muli | Ra = LOPART (Rb × imm) | Multiplication (the lower part is 128 bits) |
mulwsi | Ra = sext(Rb × imm) | Multiply words, sign extension |
mulwzi | Ra = zext(Rb × imm) | Multiply words, sign extension |
divi | Ra = Rb / imm | Signed division |
divui | Ra = Rb / imm | Unsigned division |
modi | Ra = Rb % imm | The remainder of the signed division |
modui | Ra = Rb % imm | The remainder of the unsigned division |
Bitwise instructions | ||
andi | Ra = Rb & imm | Bitwise AND |
andni | Ra = not (Rb) & imm | Bitwise AND with register inversion |
ori | Ra = Rb | imm | Bitwise OR |
orni | Ra = not (Rb) | imm | Bitwise OR with register inversion |
xori | Ra = Rb xor imm | Bitwise XOR |
compare instructions (64 bit) | ||
cmpdeqi | Ra = Rb == imm | Comparison for equality |
cmpdnei | Ra = Rb != imm | Comparison of inequality |
cmpdlti | Ra = Rb < imm | Signed comparison less |
cmpdltui | Ra = Rb < imm | Unsigned comparison less |
cmpdgti | Ra = Rb > imm | Signed comparison more |
cmpdgtui | Ra = Rb > imm | Unsigned comparison more |
cmpdlei | Ra = Rb <= imm | Signed comparison less or equal (pseudo) |
cmpdleui | Ra = Rb <= imm | Unsigned comparison less or equal (pseudo) |
cmpdgei | Ra = Rb >= imm | Signed comparison more or equal (pseudo) |
cmpdgeui | Ra = Rb >= imm | Unsigned comparison more or equal (pseudo) |
compare instructions (32 bit) | ||
cmpweqi | Ra = Rb == imm | Comparison for equality |
cmpwnei | Ra = Rb != imm | Comparison of inequality |
cmpwlti | Ra = Rb < imm | Signed comparison less |
cmpwltui | Ra = Rb < imm | Unsigned comparison less |
cmpwgti | Ra = Rb > imm | Signed comparison more |
cmpwgtui | Ra = Rb > imm | Unsigned comparison more |
cmpwlei | Ra = Rb <= imm | Signed comparison less or equal (pseudo) |
cmpwleui | Ra = Rb <= imm | Unsigned comparison less or equal (pseudo) |
cmpwgei | Ra = Rb >= imm | Signed comparison more or equal (pseudo) |
cmpwgeui | Ra = Rb >= imm | Unsigned comparison more or equal (pseudo) |
Min/Max instructions | ||
minsi | Ra = smin (Rb, imm) | Minimum (signed) |
minui | Ra = umin (Rb, imm) | Minimum (unsigned) |
maxsi | Ra = smax (Rb, imm) | Maximum (signed) |
maxui | Ra = umax (Rb, imm) | Maximum (unsigned) |
For bitwise register-immediate instructions the immediate value is always sign extended. Since it is possible to invert the immediate in advance, 5 instructions are enough instead of 8 for two registers.
Binary instructions register and immediate shift. Shift or rotation instructions shift the value from the src register for a fixed number of bits in shift. Syntax:
INSTRUCTION_NAME dst, src, shift
Here the first argument is the number of the result register, second argument is the register number, third is the shift/rotate immediate value from 0 to 63.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | src | shift | 0 | opx |
Instruction | Operation | Description |
---|---|---|
slli | shift left logical immediate | Left shift and zero expansion |
srli | shift right logical immediate | Right shift and zero expansion |
srai | shift right algebraic immediate | Right shift and sign extension |
srdi | shift right dividing immediate | Right shift as a signed division |
cntpop | count population | Bit population |
cntlz | count leading zeros | Number of consecutive zeros in the most significant bits |
cnttz | count trailing zeros | Number of consecutive zeros in the least significant bits |
permb | permute bits | The bits permutation according to mask |
The instructions cntpop, cntlz, cnttz count the ones/zeros in the interval of shift bits. cntpop – the total number of ones. cntlz – the length of a continuous sequence of zeros from the beginning interval (the most significant bits), or shift + 1 if there are all zeroes. cnttz – The length of a continuous sequence of zeros from the end span (least significant bits), or shift + 1 if there are all zeroes.
The instruction permb (permute bits) reverses the order of bits/bytes in the register according to the immediate mask shift. The mask determines the sequential involvement in the rearrangement of neighbors: bits, pairs of bits, nibbles (four bits), bytes, byte pairs, and four bytes of the original 64-bit value. For example, a maximum mask of 63 (all units) means a permutation of all pairs (a complete inversion of the order of the bits to the reverse as for FFT), mask 1 is only permutation of adjacent bits, mask 32 is permutation of four bytes, mask 32 + 16 + 8 is reverse order of bytes (endianness) in the register, mask 16 + 8 is reverse the byte order in each four bytes in the register.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | src | 0 | 0 | opx |
Instruction | Operation | Description |
---|---|---|
mov | move register |
Instruction mov (move register) copies data from one register to another.
Syntax:
mov ra, rb
Fused instructions have more that two input paarameters and can perform two or more actions in one instruction.
Name | Operation |
---|---|
mov2 ra,rb,rc,rd | move 2 registers: gr[ra] = gr[rc], gr[rb] = gr[rd] |
addadd ra,rb,rc,rd | add and add: gr[ra] = gr[rb] + gr[rc] + gr[rd] |
addsub ra,rb,rc,rd | add and sub: gr[ra] = gr[rb] + gr[rc] − gr[rd] |
subsub ra,rb,rc,rd | sub and sub: gr[ra] = gr[rb] − gr[rc] − gr[rd] |
muladd ra,rb,rc,rd | multiply and add: gr[ra] = gr[rb] × gr[rc] + gr[rd] |
mulsub ra,rb,rc,rd | multiply and sub: gr[ra] = gr[rb] × gr[rc] − gr[rd] |
mulsubf ra,rb,rc,rd | multiply and sub from: gr[ra] = gr[rd] − gr[rb] × gr[rc] |
mbsel ra,rb,rc,rd | masked bit select: gr[ra] = gr[rb] ? gr[rc]: gr[rd] (bitwise) |
slp ra,rb,rc,rd | shift left pair |
srp ra,rb,rc,rd | shift right pair |
slsrl ra,rb,rc,rd | shift left and shift right logical |
slsra ra,rb,rc,rd | shift left and shift right algebraic |
sladd ra,rb,rc,shift | shift left and add: gr[ra] = gr[rb] + (gr[rc] << shift) |
sladdws ra,rb,rc,shift | shift left and add: gr[ra] = gr[rb] + (gr[rc] << shift) |
sladdwz ra,rb,rc,shift | shift left and add: gr[ra] = gr[rb] + (gr[rc] << shift) |
slsub ra,rb,rc,shift | shift left and subtract: gr[ra] = (gr[rc] << shift) − gr[rb] |
slsubws ra,rb,rc,shift | shift left and subtract: gr[ra] = (gr[rc] << shift) − gr[rb] |
slsubwz ra,rb,rc,shift | shift left and subtract: gr[ra] = (gr[rc] << shift) − gr[rb] |
slsubf ra,rb,rc,shift | shift left and subtract from: gr[ra ] = gr[rb] − (gr[rc] << shift) |
slsubfws ra,rb,rc,shift | shift left and subtract from: gr[ra ] = gr[rb] − (gr[rc] << shift) |
slsubfwz ra,rb,rc,shift | shift left and subtract from: gr[ra ] = gr[rb] − (gr[rc] << shift) |
slor ra,rb,rc,shift | shift left and or: gr[ra] = gr[rb] | (gr[rc] << shift) |
slxor ra,rb,rc,shift | shift left and xor: gr[ra] = gr[rb] ^ (gr[rc] << shift) |
srpi ra,rb,rc,shift | shift right pair immediate |
slsrli ra,rb,shift,count | shift left and shift right logical immediate |
slsrai ra,rb,shift,count | shift left and shift right algebraic immediate |
deps ra,rb,shift,count | deposit set: Insert a group of units |
depc ra,rb,shift,count | deposit clear: Insert a group of zeros |
depa ra,rb,shift,count | deposit alter: Change group of bits |
dep ra,rb,rc,shift,pos | deposit: deposit of parts from two registers |
rlmi ra,rb,shift,count,pos |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | ra | rb | rc | rd | opx |
The instruction mov2 (move 2 registers) moves 2 registers. It may be used for register values swapping and just code path reduction.
Fused instructions of the type shift-addition are intended to reduce the critical data path in address calculations. They combine in one machine instruction a left shift (by the number of bits from 0 to 7) with addition (or subtraction). The open question remains about handling overflow during shear with addition that may occur. with intermediate calculations (shift), but no place for the final result.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | ra | rb | rc | shift | opx |
Pair shift instructions slp (shift left pair), srp (shift right pair) and srpi (shift right pair immediate) shift two registers as a whole to the left (right) by count bits, and puts the lowest part of the integer in the result register. srp takes the count low bits from the second, high bits from the first. The first argument is the result register number, the second and third are the numbers of the pair of shifted operand registers, fourth is register or immediate count from 0 to 63 to indicate the amount of shift. The instruction can be used to implement many useful 64-bit operations: rotation by a fixed number of bits, left or right shift, extraction of a part of the register:
Double shift instructions produce a left and then a right shift (with arithmetic or logical extension). They can be used to extract the bit portion from the register and other manipulations.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | ra | rb | shift | count | opx |
The dep (deposit) instruction combines the count the least significant bits from the first operand register and the remaining bits from the second operand register. dep takes high bits from the second, count low bits from the first. The first argument is the number of the register of the result, the second and third are the numbers of the combined registers, the fourth param count is immediate number from 0 to 63 to indicate the portion size of the first merged register.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | src | src | shift | pos |
Direct deposit instructions copy from one register to another with a change in part of the register: deps (deposit set) – insert a unit block, depc (deposit clear) – insert a block of zeros, depa (deposit alter) – invert the block of bits. The block has a length of count bits and is located after the first shift bits. If the value of count+shift is greater than the size of the register (64 bits), ones/zeros bit filling or inversion continues from the beginning of the register. The first argument is the number of the result register, the second is the number of the source operand register, the third and fourth are immediate values shift and count from 0 to 63.
The rlmi instruction extracts a portion of bits of a given length/position from the register and puts it at the specified position in the result register.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | src | shift | count | pos |
Conditional move instructions copies data from one of two registers depend on condition.
Syntax:
NAME ra, rb, rc, rd
Description:
ra = cond(rb) ? rc : rd
Instruction | Condition |
---|---|
cmovlsb | least significand bit is set |
cmovweq | word equal 0 |
cmovwlt | word less than 0 |
cmovwle | word less than or equal 0 |
cmovdeq | doubleword equal 0 |
cmovdlt | doubleword less than 0 |
cmovdle | doubleword less than or equal 0 |
The 1st group of the general-purpose register load/store instructions uses the base plus offset addressing mode. The first argument is the number of the loaded (stored) register target, second is the base register number, third is an 21 bits length immediate offset disp. The instructions in this group allow continuation of the immediate value disp in the instruction code for the next slot of the bundle with the formation of a dual-slot instruction (63 bit offset). The offset disp after the sign extension is added to the base register to produce a 64-bit effective address.
EA = gr[base] + sign_extend(disp)
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | target | base | disp21 |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
continued disp (63 bits instead of 21) |
The 2nd group of general-purpose load/store instructions uses the ip-relative addressing. The first argument is the number of the loaded (or stored) register target, second is an unsigned forward offset disp with a length of 28 bits. The instructions in this group allow continuation of the immediate value disp in the instruction code for the next slot of the bundle with the formation of a dual-slot instruction (64 bit offset).
EA = ip + zero_extend(disp)
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | target | uimm28 |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
0 | continued disp (64 bits instead of 28) |
The 3rd group of general register load/store instructions uses the basic scaled indexed addressing method. The first argument is the number of the loaded or saved register target, second is base register number base, third is index register index, next is shift amount scale, last is short offset disp 9 bits long, sign extended to 64 bits. The instructions in this group allow continuation of the immediate value disp in the instruction code for the next slot of the bundle with the formation of a dual-slot instruction (52 bit offset).
EA = gr[base] + (SM(gr[index]) << scale) + sign_extend(disp)
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | target | base | index | scale | sm | disp |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
continued disp (51 bits instead of 9) |
The 4th group of load/store instructions use base addressing with base updating after usage by the immediate stride. Arguments: target register, base register, signed immediate stride (10 bits). The instructions in this group allow continuation of the immediate value stride in the instruction code for the next slot of the bundle with the formation of a dual-slot instruction (52 bit offset).
For load: (ld[s]Nmia):
EA = gr[base] tmp = MEM(EA) gr[base] = gr[base] + sign_extend(stride) gr[target] = tmp
For store: (stNmia):
EA = gr[base] MEM(EA) = gr[target] gr[base] = gr[base] + sign_extend(stride)
The 5th group of load/store instructions use base addressing with base updating before usage by the immediate stride. Arguments are same as for post-update.
For load: (ld[s]Nmib):
EA = gr[base] + sign_extend(stride) tmp = MEM(EA) gr[base] = EA gr[target] = tmp
For store: (stNmib):
EA = gr[base] + sign_extend(stride) MEM(EA) = gr[target] gr[base] = EA
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | target | base | stride | opx |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
continued stride (52 bits instead of 10) |
The signed immediate disp is added to base to form an effective base. The signed immediate stride (non-zero 11-bit) is added to base to form a new base. For loads, if target is same as base, base update doesn't occurs or loaded value replaces updated base. For stores, if target is same as base, base update occurs after the old value memory storing.
Size in bytes | Operation | Description, parameters | ||||
---|---|---|---|---|---|---|
1 | 2 | 4 | 8 | 16 | ||
ldbz | ldhz | ldwz | lddz | ldq | load | base with offset addressing:
INSN target,base,disp21 |
ldbs | ldhs | ldws | ldds | load signed | ||
stb | sth | stw | std | stq | store | |
ldbzr | ldhzr | ldwzr | lddzr | ldqr | load | ip-relative addressing:
INSN target,disp28 |
ldbsr | ldhsr | ldwsr | lddsr | load signed | ||
stbr | sthr | stwr | stdr | stqr | store | |
ldbzx | ldhzx | ldwzx | lddzx | ldqx | load | scaled indexed addressing:
INSN target,base,index,scale,disp |
ldbsx | ldhsx | ldwsx | lddsx | load signed | ||
stbx | sthx | stwx | stdx | stqx | store | |
ldbzmia | ldhzmia | ldwzmia | lddzmia | ldqmia | load | base update with immediate stride after memory access:
INSN target,base,stride |
ldbsmia | ldhsmia | ldwsmia | lddsmia | load signed | ||
stbmia | sthmia | stwmia | stdmia | stqmia | store | |
ldbzmib | ldhzmib | ldwzmib | lddzmib | ldqmib | load | base update with immediate stride before memory access:
INSN target,base,stride |
ldbsmib | ldhsmib | ldwsmib | lddsmib | load signed | ||
stbmib | sthmib | stwmib | stdmib | stqmib | store |
Instructions of the unconditional branch will jump to the effective address. Additionally, the return address can be stored in the general register. Using predication can turn an unconditional jump into a conditional jump.
Instruction | Operation | Description |
---|---|---|
jmp label | jump relative | ip-relative jump |
jmpr rb,rc | jump register indirect | base-relative jump |
jmpt rb,rc | jump table | jump to table-relative address |
jmptws rb,rc | jump table word signed index | jump to table-relative address |
jmptwz rb,rc | jump table word unsigned index | jump to table-relative address |
Branch relative forms is an universal instructions conditional or unconditional static branch or procedure call to a relative address.
Relative branch instructions are generated according to the LDAR rule. After the operation code, there is a register for saving a possible return address, and a 28-bit field for encoding the offset (with a sign) relative to ip. This gives a maximum distance of ±2 GiB in both directions from the current position for a one-slot instruction and all available address space for a long instruction. The jmp instruction allows the continuation of the immediate offset in the instruction code to the next slot of the bundle with the formation of a dual-slot instruction.
ip = ip + 16 × simm
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | opx | simm (28 bits) |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
0 | extended label (60 bits instead of 28) |
The instruction jmpr (branch register indirect) is used to branch according to the base addresses in the register. The instruction jmpr, when calculating the target address, discards the 4 least significant bits of the result, so that the address always aligned with the beginning of the bundle is always obtained.
ip = (gr[base] + gr[index]) & mask{63: 4}
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | 0 | base | index | 0 | opx |
The jmpt (jump table), jmptws, jmptwz (jump table word indexed) instructions are intended for organizing table-driven select statements (C language operator switch with continuous distribution of variants, preferably starting from zero). Traditionally, in most architectures, the table-driven switch operator uses table of absolute addresses for storing entry points into the code of options. This table is private for each process (if the loader base code address is different). If the architecture implements the possibility of relative addressing, then the table of absolute addresses can be replaced by a table of relative offsets, shared by all processes, and put it in the read-only data section.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | 0 | base | index | 0 | opx |
jmpt: ip = base + mem4[base + 4 × index]
jmptws: ip = base + mem4[base + 4 × sign_extend(index)]
jmptwz: ip = base + mem4[base + 4 × zero_extend(index)]
.text ; limit = 7 bdgti selector, limit, default ldafr base, table jmpt base, selector label_0: ... label_1: ... ... label_7: ... default: ... .rodata table: dw (label_0 - table) ... dw (label_7 - table)
The instructions for the conditional branch calculate the condition and (if the condition is true) jump to the effective address. Traditionally (x86, x64, SPARC), conditional branch is implemented using two instructions – comparison (with the generation of flags of the logical result) and conditional branch (by flags). However, conditional branches are very common in programs. Therefore, POSTRISC uses combined compare and conditional branch instructions to compress code and shorten the critical data path.
Instruction | Operation |
---|---|
bdeq ra, rb, label | branch if doubleword equal |
bdne ra, rb, label | branch if doubleword not equal |
bdlt ra, rb, label | branch if doubleword less than |
bdltu ra, rb, label | branch if doubleword less than unsigned |
bdle ra, rb, label | branch if doubleword less than or equal |
bdleu ra, rb, label | branch if doubleword less than or equal unsigned |
bdgt ra, rb, label | branch if doubleword greater than |
bdgtu ra, rb, label | branch if doubleword greater than unsigned |
bdge ra, rb, label | branch if doubleword greater than or equal |
bdgeu ra, rb, label | branch if doubleword greater than or equal unsigned |
bdeqi ra, simm, label | branch if doubleword equal immediate |
bdnei ra, simm, label | branch if doubleword not equal immediate |
bdlti ra, simm, label | branch if doubleword less than immediate |
bdgti ra, simm, label | branch if doubleword greater than immediate |
bdltui ra, uimm, label | branch if doubleword less than unsigned immediate |
bdgtui ra, uimm, label | branch if doubleword greater than unsigned immediate |
bweq ra, rb, label | branch if word equal |
bwne ra, rb, label | branch if word not equal |
bwlt ra, rb, label | branch if word less than |
bwltu ra, rb, label | branch if word less than unsigned |
bwle ra, rb, label | branch if word less than or equal |
bwleu ra, rb, label | branch if word less than or equal unsigned |
bwgt ra, rb, label | branch if word greater than |
bwgtu ra, rb, label | branch if word greater than unsigned |
bwge ra, rb, label | branch if word greater than or equal |
bwgeu ra, rb, label | branch if word greater than or equal unsigned |
bweqi ra, simm, label | branch if word equal immediate |
bwnei ra, simm, label | branch if word not equal immediate |
bwlti ra, simm, label | branch if word less than immediate |
bwgti ra, simm, label | branch if word greater than immediate |
bwltui ra, uimm, label | branch if word less than unsigned immediate |
bwgtui ra, uimm, label | branch if word greater than unsigned immediate |
bbs ra, rb, label | branch if bit set |
bbsi ra, shift, label | branch if bit set immediate |
bbc ra, rb, label | branch if bit clear |
bbci ra, shift, label | branch if bit clear immediate |
bmall ra, uimm, label | branch if mask all bits set |
bmany ra, uimm, label | branch if mask any bit set |
bmnone ra, uimm, label | branch if mask none bit set |
bmnotall ra, uimm, label | branch if mask not all bit set |
Relative branch instructions are formed according to the rules BRC, BRCI, BRCIU, BBIT. After the operation code, the first compared register, the second compared register (or shift immediate), and a 16-bit field for encoding the offset (with a sign) relative to ip. This gives a maximum distance of ±1 MiB in both directions from the current position. In the case of a long instruction, the maximum distance increases to ±8 GiB on both sides of the current position.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | srcA | srcB | opx | label17 |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
0 | label30 |
Instructions bgt (blt), bge (ble), bgtu (bltu), bgeu (bleu) are pseudo-instructions with a replacement of the order of the arguments and are reduced to instructions «less».
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | src | shift | opx | label17 |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
0 | label30 |
Relative branch instructions which use immediate. After the operation code, the first compared register, the second compared register (or constant), and a 17-bit field for encoding the offset (with a sign) relative to ip. This gives a maximum distance of ±1 MiB in both directions from the current position. In the case of a long instruction, the maximum distance increases to ±8 GiB on both sides of the current position.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | src | simm11 | label17 |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
simm40 | label30 |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | src | uimm11 | label17 |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
uimm40 | label30 |
The loop control instructions are for optimization (by shortening the critical execution path) the most common forms of loops with a constant step. Loop control instructions add step (1 or -1) to the loop counter (first argument register) according to loop condition, check the loop continuation condition (compare the counter with the second argument register), and, if the condition is true, make a relative branch to the effective address (label argument).
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst/src | src | opx | label17 |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
0 | label30 |
Syntax:
INSTRUCTION_NAME ra, rb, label
A variant of loop control instructions in which register numbers are the same is a special case. The architecture determined that in this case, in comparison (as the boundary of the counter change) the old register value will participate. This can be used, for example, for branch that occur in the event of an overflow.
Instruction | Operation |
---|---|
repdlt | Add 1 and branch if doubleword less (signed) |
repdltu | Add 1 and branch if doubleword less (unsigned) |
repdle | Add 1 and branch if doubleword less or equal (signed) |
repdleu | Add 1 and branch if doubleword less or equal (unsigned) |
repdgt | Add -1 and branch if doubleword greater (signed) |
repdgtu | Add -1 and branch if doubleword greater (unsigned) |
repdge | Add -1 and branch if doubleword greater than or equal (signed) |
repdgeu | Add -1 and branch if doubleword greater than or equal (unsigned) |
A similar style of loop implementation with minimal software management costs found on almost all DSP (digital signal processor) processors. The general purpose processors have limited form (with special register iteration counter) implemented in the IBM PowerPC and Intel Itanium architectures, and universal instructions for general-purpose type add-compare-jump registers are available in the HP PA-RISC architecture (instructions addb, addib), in the DEC VAX architecture (aobleq, aoblss, sobgeq, sobgtr), in the IBM S/390 architecture (brct, bctr, bxle).
Instruction ldi (load immediate) loads a constant into the register (high 64 bits are reset). The first argument of the instruction ldi is the register number of the result, the second is the immediate value 28 bits long (for the short form it sign extended to 64 bits) or full 64 bits (for a dual-slot instruction).
Instruction ldih (load immediate into high 64-bit) loads a constant into the upper part of the 128-bit register (the lower 64 bits remain unchanged). The first argument of the instruction is the result register number, the second is the immediate value 28 bits long (for the short form it sign extended to 64 bits) or full 64 bits (for a dual-slot instruction).
INSTRUCTION_NAME dst, simm
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | simm28 |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
0 | simm (extended to 64 bits) |
The instruction nop (dummy none operation instruction) intended for the sole purpose is the code alignment to fill in the missing slots in the bundles of instructions, and for the optimal selection of instructions (fetch) from memory.
For example, if necessary, insert a label in the code, the compiler should add (if necessary) the last (incomplete) bundle with dummy instructions, and put the first instruction after the label in a new bundle (since the branch is possible only at the beginning of the bundle). Or, for example, various implementations can gain performance gains, if the destination address of the frequently performed jump is aligned on the 32/64/128-byte boundary (not just the beginning of the bundle, but the beginning of the cache line).
This instruction should not be used for any other purpose. The architecture doesn't contain software delays when loading data (load delays), conditional branch (branch delays), pipeline delays (pipeline hazards).
The nop instruction is processed at the sampling stage, but may not be fed to the next stages of the pipeline (issue), retire and never cause an interrupt (detect stage) itself. This instruction has no dependencies on either reading or writing.
The nop instruction is automatically added by the assembler to populate incomplete instruction bundle, if necessary, place the next instruction in a new bundle (in the case of a tag or a long instruction). The instruction has one immediate argument (unused).
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | opx | simm28 |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
0 | extended simm (64 bits instead of 28) |
Undefined instruction codes are reserved, and can be used for future extensions (new instructions). But one instruction undef is specially defined forever as reserved. It can automatically be added by assembler to fill in an incomplete bundle of instructions. after instructions to unconditionally jump, call a function, or return from a function. It is also used to fill the tail of code segments. The instruction has one immediate argument (unused).
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | 0 | opx |
Software exceptions are for C++-like throw/try/catch exceptions and for more common SEH-like exceptions. The POSTRISC is planned to support deterministic exception handling via frame-based unwinding with sufficient hardware support. The really zero-cost for exception usage is expected for no-exception cases and fast unwinding for exception cases.
The 128-bit link register r0 preserves 18-bit eip offset, which allow alternate return point in case of exception. The exception landing pad address should be after current return address no further than 4 MiB offset. The return instructions may jump to usual return address or to the landing pad depending on exception state.
6 3 | 6 2 | 6 1 | 6 0 | 5 9 | 5 8 | 5 7 | 5 6 | 5 5 | 5 4 | 5 3 | 5 2 | 5 1 | 5 0 | 4 9 | 4 8 | 4 7 | 4 6 | 4 5 | 4 4 | 4 3 | 4 2 | 4 1 | 4 0 | 3 9 | 3 8 | 3 7 | 3 6 | 3 5 | 3 4 | 3 3 | 3 2 | 3 1 | 3 0 | 2 9 | 2 8 | 2 7 | 2 6 | 2 5 | 2 4 | 2 3 | 2 2 | 2 1 | 2 0 | 1 9 | 1 8 | 1 7 | 1 6 | 1 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 0 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
return address | 0 | ri | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
preserved caller future | eip offset | out-size | framesize |
alternate_retaddr = retaddr + 16 × ZEXT(eip_offset)
Special register eip always holds the address of next proper part of unwinding code. This register is automatically restored during normal subroutine return. It's modified during object construction and destruction. Special register eca holds the throwing value (usually the address of throwing object).
Two return address will be saved to link register during subroutine call: for normal return and for exception return. Because registers are 128 bits long, it's enough place for both. But because we need store also frameinfo and previous future vector, exception return address is stored as an positive offset from normal return address. Exception landing pad should be after function body near 4MiB.
So we don't need to return some optional pair (normal return value and optional exception info), and always do the check after each call for possible software exception. Excepted subroutine finally return directly to the proper next part of unwinding code.
The instruction ehthrow sets special register eca to the value (gr[src] + simm21). Usually it should be the address of exception context. This triggers execution to jump to eip address.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | opx | src | simm21 |
The instruction ehadj should be called after the successful construction of the object which requires destruction. It checks the current eca context and jumps to current eip if exception is set. Otherwize, it adjusts eip register to the new actual unwinding code address and continues normally to the next instruction.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | opx | simm (28 bits) |
The instruction ehcatch copys the exception context eca to general register, clears eca, and adjust new eip value to ip+offset×16.
The instruction ehcatch should be called before the catch block or before the object destructor. For the catch block it adjusts eip register to the end of catch block. Before object destructor it should adjusts eip register to the position after destructor.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | opx | dst | 0 | label17 (30) |
The instruction ehnext should be called after the object destructor. It restores exception context saved in ehcatch before destructor call and checks for possible double exception fault. If it is the second software exception at the time of unwinding first software exception, then hardware exception occurs. Otherwise, if it is normal destructor call during unwinding first software exception, execution continues to new eip address. Otherwise, if it is normal destructor call without any unwinding, execution continues to next instruction.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | opx | src | 0 | label17 (30) |
Traditionally (in most architectures) the register file is a global resource, where all registers are visible to all program procedures. If the procedure wants to use a register, the contents of this register must be stored in memory, and later restored from memory. The work of saving/restoring registers is usually divided between the procedure that makes the call (caller) and the one that is called (callee). For example, the first 14 registers out of 32 existing may be required to save caller, then the remaining 18 registers must save callee. The optimal split depends on the processor architecture: the number of registers and their universality (orthogonality), and for new architectures it is usually determined experimentally by comparing the code effectiveness for different variants, based on the analysis of a statistically large codebase.
Within one procedure, you can optimize the use of registers well, but in the case of several procedures, and especially if they are compiled separately, the use of register resources becomes suboptimal. A typical example of extreme inefficiency is recursive procedures. Even if the recursive procedure uses only one of the N available registers, each recursive call to such a procedure wants to use exactly this specific register, therefore, this register is repeatedly spilled/filled, despite the presence of many unused registers.
Summing up these arguments, we can say that a significant percentage of all memory accesses are the operations of spilling/filling registers, which in essence are not related to useful work. This fraction is not very dependent on the total number of registers due to the binding of the procedure code to specific registers. So an increase in the number of registers, although it helps to improve the efficiency of large and complex procedures, doesn't help in any way to reduce inter-procedure save and restore registers. This proportion grows with an increase in the number of procedures and a decrease in their average size (as is usually the case for object-oriented programming languages).
The solution to this problem is to implement hardware registers rotation. The registers is no more global resource. Each procedure called gets its own working subset of registers. The registers saving/restoring is not required while the registers working set of several called procedures fits in the register file.
For example, in the POSTRISC architecture, a file of 128 general-purpose registers is divided into two subsets: up to 120 rotable or stackable registers r0 - r119 (locally visible only to the current procedure) and 8 static registers g0 - g3, tp, fp, sp, gz (globally visible to all procedures). The mechanism of the register stack is implemented through the circular renaming of registers as a side effect of procedure calls and returns from procedures. The renaming mechanism is not visible to the program. There 128 rotate registers in the hardware circular buffer, which allows an easy way to cyclically calculate the remainder. In total, the general-purpose logical file of registers has 136 (128 + 8) registers, of which up to 128 maximum are simultaneously available to the program.
Static registers must be maintained and restored at the procedure boundary in accordance with programmatic conventions (APIs). Stackable registers are automatically saved and restored by the corresponding hardware mechanism without the explicit participation of the program. All other register files are visible for all procedures and must be saved/restored programmatically in accordance with program agreements.
circular register buffer (128) | Global (8) | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
local A | not available | global | |||||||||||||||||||||||
not available | local B | not available | global | ||||||||||||||||||||||
not available | local C | not available | global | ||||||||||||||||||||||
local D (cont) | not available | local D | global | ||||||||||||||||||||||
not available | local E | not available | global |
The above diagram shows the process of using the hardware buffer of local rotate registers. Five procedures A, B, C, D, E call each other, pass call arguments through the register buffer, place their local variables in the buffer. As the hardware circular buffer is exhausted (in procedure D), the registers are flushed onto the stack in memory and the buffer is reused from the beginning. Of course, not the entire buffer is discarded, but the necessary count to create a new frame.
clean | dirty | local | invalid | clean |
In general, the register buffer contains the following five parts (order matters):
Part | Description |
---|---|
clean | these registers belong to inactive frames and have already been flushed to the stack in memory, but have not yet been used by other frames (if there is an advanced reset of registers to memory or advanced recovery from memory) |
dirty | these registers belong to inactive frames and have not yet been flushed to the stack in memory (obligatory dumping to memory is required before using under other frames) |
local | these are local registers of the active frame |
invalid | is garbage left over from past procedure calls, or registers that have never been used (can be used to expand the current active frame, to create a new active frame, or to expand the zone of clean-registers when returning from procedures or when reading registers from memory ahead of time) |
Local (120) | Global (8) | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
local A | not available | global | ||||||||||||||||||
local B | not available | global | ||||||||||||||||||
local C | not available | global | ||||||||||||||||||
local D | not available | global | ||||||||||||||||||
local E | not available | global |
Each procedure «sees» only its local registers, and the first physical local register is visible under the logical number r0.
The diagram below shows an example of working with the register stack. First, we have a register frame for the current function of 17 registers (r0 - r16). The last 5 of them (r12 - r16) the function uses to place the parameters for calling the next function. The return address when calling will fall into the first register parameter (r12), as well as the number of stored registers and output frame size (12, 5) - these two numbers can be packed together with return address in link register. This register number for the return address, as the boundary between the stored registers and the output parameters, is indicated in the call instructions.
After the call, the second function has at its disposal a register frame of 5 registers. The return address is visible in the register r0. Then the second function expands its register frame to the required number of registers for local computing (up to 10 registers).
After completion of work, the second function restores the saved part of the frame of the first function, and gives the parameter registers back to it. The number of registers to be returned is indicated in the instructions for returning from the function, and, according to ABI, it must match the number of incoming parameter registers.
physical numbering | caller function registers | callee registers immediately after the call (input parameters) | callee extends the register frame | caller registers after returning |
---|---|---|---|---|
0 | are hidden | are hidden | are hidden | are hidden |
1 | ||||
2 | ||||
3 | ||||
4 | r0 | r0 | ||
5 | r1 | r1 | ||
6 | r2 | r2 | ||
7 | r3 | r3 | ||
8 | r4 | r4 | ||
9 | r5 | r5 | ||
10 | r6 | r6 | ||
11 | r7 | r7 | ||
12 | r8 | r8 | ||
13 | r9 | r9 | ||
14 | r10 | r10 | ||
15 | r11 | r11 | ||
16 | r12 | r0 | r0 | r12 |
17 | r13 | r1 | r1 | r13 |
18 | r14 | r2 | r2 | r14 |
19 | r15 | r3 | r3 | r15 |
20 | r16 | r4 | r4 | r16 |
21 | not available | not available | r5 | not available |
22 | r6 | |||
23 | r7 | |||
24 | r8 | |||
25 | r9 | |||
26 | not available | |||
27 | ||||
28 | ||||
29 | ||||
30 |
special register register stack control (rsc) stores information about the status of the circular register buffer, and the current active frame of local general purpose registers.
6 3 | 6 2 | 6 1 | 6 0 | 5 9 | 5 8 | 5 7 | 5 6 | 5 5 | 5 4 | 5 3 | 5 2 | 5 1 | 5 0 | 4 9 | 4 8 | 4 7 | 4 6 | 4 5 | 4 4 | 4 3 | 4 2 | 4 1 | 4 0 | 3 9 | 3 8 | 3 7 | 3 6 | 3 5 | 3 4 | 3 3 | 3 2 | 3 1 | 3 0 | 2 9 | 2 8 | 2 7 | 2 6 | 2 5 | 2 4 | 2 3 | 2 2 | 2 1 | 2 0 | 1 9 | 1 8 | 1 7 | 1 6 | 1 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 0 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
0 | ndirty | soc | bof | sof |
Four fields hidden from direct access store the positions and sizes of register portions local to the rotation buffer. Their sizes depend on the implementation (register ring buffer size), except for sof whose size is always 7 bits. So, for example, for a buffer of 128 registers, the number of bits for each position is 7, for 256 is 8. Field sof (size of frame) is the size of the last active frame (possibly empty), Field bof (bottom of frame) is the position in the buffer of the beginning of the last active frame and border with the dirty section, Field soc (size of clean) is the size of the clean section, The ndirty field is the number of dirty registries.
The special register stack pointer (rsp) contains the memory address, where the next local register should be saved when the hardware circular register buffer is full. Since the address must be aligned on an 16-byte register size boundary for register spilling/filling, then the lower 4 bits of the register rsp are always zero, the writing them is ignored. A specific architecture implementation can spill/fill the registers in aligned groups of 2-16 registers at a time to optimize the work with memory, so the additional least significant bits of the register may be fixed as zero.
6 3 | 6 2 | 6 1 | 6 0 | 5 9 | 5 8 | 5 7 | 5 6 | 5 5 | 5 4 | 5 3 | 5 2 | 5 1 | 5 0 | 4 9 | 4 8 | 4 7 | 4 6 | 4 5 | 4 4 | 4 3 | 4 2 | 4 1 | 4 0 | 3 9 | 3 8 | 3 7 | 3 6 | 3 5 | 3 4 | 3 3 | 3 2 | 3 1 | 3 0 | 2 9 | 2 8 | 2 7 | 2 6 | 2 5 | 2 4 | 2 3 | 2 2 | 2 1 | 2 0 | 1 9 | 1 8 | 1 7 | 1 6 | 1 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 0 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
address | 0 | 0 |
In the Berkeley RISC research project, where register rotation was first applied (?), only eight of the 64 existing registers were visible to the program. A full set of 64 registers is called the register file, and a portion of eight – register window. The file allows up to eight procedure calls with their own sets of registers. Until the program calls a chain longer than eight calls, the registers never had to be stored in RAM, which is scary slow compared to register access. For many programs, a chain of six calls is enough.
A direct descendant of the RISC Berkeley project is Sun Microsystems' SPARC (UltraSPARC) architecture. Compared to the prototype, this processor provides the simultaneous visibility of four sets of eight registers each (has 32 simultaneously visible registers). Of these, 8 are global and 24 are windowed. Three sets, eight registers each, are implemented as a «register window». The eight registers i0…i7 are input to the current procedure, eight registers l0…l7 are local to the procedure of the current level, and eight registers o0…o7 are the output for calling the next level procedure. When a new procedure is called, the register window shifts to sixteen registers, hiding old input registers and old local registers, and making the output registers of the current procedure the input registers of the new procedure. Additionally, eight registers g0…g7 are globally visible to procedures at all levels.
The size of the frame and the number of output registers, unfortunately, are fixed in SPARC. It's also bad that flashing registers pushed from the stack into memory is implemented through interrupts, and the fact that the dumping place is not separated from the regular stack of automatic objects.
In the AMD 29000 architecture (64 global and 128 window visible registers), the register rotation design was further refined with variable-sized windows, which helps resource utilization in the general case, when fewer than eight registers are needed to call the procedure. A second separate stack for saving registers was also implemented.
Register rotation was used in the architecture of Intel 80960 (i960) processors for embedded applications (32 visible registers, of which 16 global and 16 windowed, with a fixed rotation step of 16 registers).
The last (of implemented) known processor that uses register rotation is Intel Itanium (IA64 architecture). It has 128 registers, of which 32 are static and 96 are windowed. It is possible to set a frame of any size from 0 to 96 registers with any number of output registers. To spill registers into memory without processor interrupting, an asynchronous hardware mechanism is implemented. The spill occurs on a separate (second) stack, which grows towards the main stack and is not visible to the user program explicitly. Both stacks share the same memory array.
The rotation of the registers is also applied in the new architecture of the educational processor MMIX, which replaced the legacy MIX processor in examples for new editions of Donald Knuth's book «The Art of Programming». MMIX architecture has a register file of 256 registers visible to the program, allows using the variable window size of visible registers, and even allows to change the boundary between the global and rotate registers visible to the program dynamically, which is usually not used in real architectures.
Because the POSTRISC architecture uses hardware register rotation, then the call/return instructions execution is closely related to the operation of the circular buffer of local registers. When the procedure is called, the current frame of local registers is partially saved, when returning from the procedure, the previous frame is restored.
The POSTRISC may be extended in the future by big-SIMD facilities (256 or even 512 bit) using register pairs/groups for SIMD. Such SIMD register pairs/groups shouldn't cross a register frame boundary. The register frame base (bottom of frame) and the preserved frame size should be a multiple of register pair/group (2 or 4) to guarantee SIMD register pairs/groups alignment. The link info may be stored only in the even (or multiple 4) register to guarantee register pairs alignment. Currently, only 2-register alignment is required for frame size.
Procedure call instructions callr, callri, callmi, callmrw, callplt perform similar actions. They vary by way of computing the target call address only.
The first argument for all call instructions is the register in which the return address and other link info will be stored. All local registers starting from r0 with lower numbers up to it exclusively will be hidden after the register window rotation. All local registers, starting from the register specified in the instructions and with greater numbers, that are currently allocated will become the initial frame of the new procedure.
Then, the branch effective address is calculated (differently for different instructions). The return address along with the current frame info is stored in the return register.
Then the register window is rotated, the frame of local registers is partially saved, and the branch to the target address is performed. The new procedure always sees its return address and previous frame info in the first rotated register r0, and the input parameters in the following registers r1, r2 ...
6 3 | 6 2 | 6 1 | 6 0 | 5 9 | 5 8 | 5 7 | 5 6 | 5 5 | 5 4 | 5 3 | 5 2 | 5 1 | 5 0 | 4 9 | 4 8 | 4 7 | 4 6 | 4 5 | 4 4 | 4 3 | 4 2 | 4 1 | 4 0 | 3 9 | 3 8 | 3 7 | 3 6 | 3 5 | 3 4 | 3 3 | 3 2 | 3 1 | 3 0 | 2 9 | 2 8 | 2 7 | 2 6 | 2 5 | 2 4 | 2 3 | 2 2 | 2 1 | 2 0 | 1 9 | 1 8 | 1 7 | 1 6 | 1 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 0 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
return address | 0 | ri | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
preserved caller future | eip offset | out-size | framesize |
If the call instruction is the last in the bundle, call instruction saves the return address as a pointer to the next bundle after the current bundle, and and the stored slot number ri is set to zero. If the call instruction isn't the last in the bundle, then the current bundle address and the next slot number ri were saved.
In the general case, return to middle of bundle may be less optimal but saves code size. The processor anyhow fetch and execute whole bundle, but discards execution of first ri instructions. For better performance, the bundle before call instruction may be filled with dummy nop instructions to shift call instruction to the end of bundle. There is corresponding compiler command line parameter to choose between «dense» and «aligned» calls.
For example, dense calls:
ldi %r33, 1234 ; r33 is future r1 (param for myfunc) callr %r32, myfunc ; r32 is future r0 (link info) callr %r32, myfunc2 callr %r32, myfunc3
For example, aligned call:
ldi %r33, 1234 ; r33 is future r1 (param for myfunc) nop 0 callr %r32, myfunc ; r32 is future r0 (link info) ; this is next bundle and aligned return address add %r34, %r12, %r12 ; next instruction after return from myfunc sub %r14, %r22, %r11
Instruction | Description |
---|---|
callr dst,label | ip-relative call |
callri dst,base,index | call register indirect |
callmi dst,base,disp11 | memory-indirect call, base addressing |
callmrw dst,base,disp11 | memory-indirect call, word, base relative addressing |
callplt dst,uimm28 | call procedure linkage table: indirect, relative addressing |
alloc framesize | allocate register stack frame |
allocsp framesize,uimm21 | allocate register stack frame, update SP |
ret | return from the subroutine |
retf uimm21 | return from the subroutine, update SP |
The instruction callr (call relative) makes a procedure call using ip-relative addressing using 28-bit immediate signed offset. This gives the maximum distance of the ±2 GiB to both sides of the current position for a one-word instruction. A long form of instruction is also implemented.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | simm (28 bits) |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
0 | simm (60 bits) |
EA = ip + 16 × simm
call (EA)
Instruction callri (call register indirect) is the address of the procedure call from the register. The branch address is calculated as base plus index. The callri instruction discards the 4 least significant bits of the address, so the call address is always aligned at the beginning of the bundle.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | base | index | 0 | opx |
EA = (gr[base] + gr[index]) & mask {63:4},
call (EA)
The instruction callmi (call memory indirect) takes the callee address from memory using base+displacement addressing. The instruction discards the 4 least significant bits of the loaded value, so that the address always aligned with the beginning of the bundle is always obtained. The instruction is intended to load from address table with aditional checks for finalized state of virtual page. The vtables should be relocated by linker and set as finalized to disable future access rights changes (hardware-assisted one way relro). The 10-bit displacement is enough to support vtables (or other function pointer tables) with up to 1024 items.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | base | simm10 | opx |
EA = gr[base] + sign_extend(simm10)
EA = mem8 (EA)
EA = EA & mask {63:4},
call (EA)
The instruction callmrw (call memory indirect relative word) takes the callee relative offset from memory using base+displacement addressing. This offset is used to compute callee address relative to base address. The instruction discards the 4 least significant bits of the loaded value, so that the address always aligned with the beginning of the bundle is always obtained. The instruction is intended to load from address table with aditional checks for finalized state of virtual page. The vtables should be relocated by linker and set as finalized to disable future access rights changes (hardware-assisted one way relro). The 10-bit displacement is enough to support vtables (or other function pointer tables) with up to 1024 items.
EA = gr[base] + sign_extend(simm10)
offset = mem4(EA)
EA = (base + offset) & mask {63:4},
call (EA)
The callplt instruction (call procedure linkage table) takes the address of the call from memory using ip-relative addressing. The instruction discards the 4 least significant bits of the loaded value, so that the address always aligned with the beginning of the bundle is always obtained. The instruction is intended to load from address table with aditional checks for finalized state of virtual page. The import tables should be relocated by linker and set as finalized to disable future access rights changes (hardware-assisted one way relro).
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | uimm28 |
EA = ip + zero_extend(uimm28)
EA = mem8 (EA)
EA = EA & mask {63:4},
call (EA)
The ret and retf instructions (return from subroutine) is used to return control from the procedure. It also restores the caller procedure register window state, and retf roll-back fixed-size stack frame.
Unlike other branch instructions, these instructions may use special hardware structures to predict the destination branch address. If the prediction array branch target buffer is generally used for branch address prediction, then for ret instructions it can be additionally (for better prediction accuracy) implemented hardware branch target stack as a short stack of saved return addresses.
While restoring the previous frame state the ret instructions may load part or all of the previous frame from memory if necessary (when the circular hardware register buffer overflows). Instruction may return control until a complete recovery from memory is completed, but the architecture guarantees that attempts to use not yet recovered from memory local registers in the subsequent instructions will be delayed until recovery is performed (via the scoreboard mechanism of the registers).
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | 0 | opx |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | opx | 0 | uimm21 (63) |
The link register is implicit argument for both ret instructions. It is the first current function local register and provides the return address and the previous frame info. The argument for retf is the displacement which is used for the optional stack rollback (maybe be 0). The instruction may cause an error if link register contains a broken frame info and there is no place in the local registers for the outgoing and preserved frame parts of the previous procedure since the maximum frame size is 120 registers.
Each callee procedure after call obtains the remaining frame part of the calling procedure starting from the link register(the parameters and maybe slightly more). If callee wants to increase the size of its register frame it should use the alloc (allocate register stack frame) instruction. The first parameter of the instruction is local register, which will be the last in the frame of our procedure (from r0 to r119).
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | opx | framesize | 0 |
If there is not enough free space in the rotate registers hardware buffer to accommodate a new frame, the alloc instruction flushes registers from the previous functions frames onto the register stack in memory. The instruction can return control before a complete flush is completed, but the architecture guarantees that attempts by the subsequent instructions to use local registers not yet flushed to the stack will be delayed until the flush (through the scoreboard mechanism of the registers).
The new eip is set up from reip. The reip should point to simple universal function epilog with just ret instruction. This epilog should live in the highest corresponding usermode/kernel region. The reip should be set up during thread start.
The next minimum program for a virtual processor demonstrates the use of the callr, alloc and ret instructions.
.text ; at the beginning of the program, the register stack is empty alloc 54 ; expand frame to 54 registers ehadj endfunc ldi %r47, 1 ; will be saved when called ldi %r53, 3 ; first argument ldi %r52, 2 ; second argument ldi %r51, 1 ; third argument ; func procedure call, all registers up to 50 will be saved, ; return address, eip, frame size (50) are saved in r50 callr %r50, func ; at this point, after returning, the frame will be again 54 halt func: ; at the starting point, the func procedure has a 4-register frame ; their previous numbers are 50, 51, 52, 53, new - 0, 1, 2, 3 ; extend the frame to 10 registers (plus regs 4,5,6,7,8,9) alloc 10 write "r0 = %x128(r0)" ; print packed return info write "r1 = %i64(r1)" ; print 1st argument write "r2 = %i64(r2)" ; print 2nd argument write "r3 = %i64(r3)" ; print 3rd argument ret endfunc: .end
Result of execution:
r0 = 000000010000c232_fffffffff1230020 r1 = 1 r2 = 2 r3 = 3
Here: 0xfffffffff1230020 - return bundle address, 0x0000c232 - packed: previous frame size (50 registers), and output frame size (3 parameters and link), offset between return address and previous eip exception return address (endfunc label). 0x00000001 - previous future mask, nonzero because callr is a middle from 3 instructions in the bundle, so we return to the bundle middle and skip one instruction.
The instruction allocsp is introduced for code compression. Its function similar to alloc, but additionally it push usual stack. The allocsp adiust sp down by immediate size.
alloc framesize allocsp framesize, uimm21
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | opx | framesize | uimm21 (63) |
The type of prolog/epilog depend on:
In the examples below r1…r5 are arguments, and r6…r10 are optional local registers. The frame stack grows from up to down addresses.
The simplest function which doesn't allocate local registers (uses arguments only) and doesn't allocate stack frame. All instructions inside function don't generate software/hardware exceptions (never touch memory, divide, etc). Then it's enough just:
insn # can't fail, can use only args r1..r5 ... insn # can't fail, can use only args r1..r5 ret
The next function can generate software/hardware exceptions but doesn't allocate local registers (uses arguments only) and doesn't allocate stack frame. Here in case of exception control can be transferred to eip, so we need proper eip before execution.
The special register reip is introduced to not blow code with multiple copies of standard universal epilog which consists from only ret instruction. It stores the address of such epilog. The proper initialization of reip to avalable standard universal epilog is at runtime at thread start.
Each call instruction setup eip by reip copy, so we won't worry about proper eip just after call. So even if instructions may fail, we don't need additional setup at function start.
insn # can fail, can use only args r1..r5 ... insn # can fail, can use only args r1..r5 ret
The next function doesn't allocate stack frame but allocate local registers. The alloc instruction here does the local registers allocation. The register allocation may trigger register spilling to memory so may fail and trigger hardware exception. But again, because eip stores the copy of reip, we won't worry about eip.
alloc 11 insn # can fail, can use r1..r5 and r6..r10 ... ret
The next function allocates local registers and allocates the fixed-size stack frame. In this case we need to set new eip before execution to the label before return for proper traditional stack unwinding. The stack frame should be no bigger than pagesize, so we don't touch next page after the stack guard page.
std %gz, %sp, -frame_size_immediate # touch new stack frame allocsp 11, frame_size_immediate ehadj before_return # immediately after allocsp ... insn # can fail, can use r1..r10 ldwz %r7, %sp, +offset # using sp for local frame addressing ... before_return: addi %sp, %sp, frame_size_immediate ret
The next function allocates local registers and allocates the fixed-size stack frame. The stack frame is bigger than pagesize so proper guard-page extension via store probing is required.
# guard page probing for frame size bigger than pagesize std %gz, %sp, -page_size * 1 std %gz, %sp, -page_size * 2 ... std %gz, %sp, -page_size * n # allocation only after probing allocsp 11, frame_size_immediate ehadj before_return # immediately after allocsp ... insn # can fail, can use r1..r10 ldwz %r7, %sp, +offset # using sp for local frame addressing ... before_return: addi %sp, %sp, frame_size_immediate ret
The before_return block:
... before_return: addi %sp, %sp, frame_size_immediate ret
may be changed to one retf instruction:
...
before_return:
retf frame_size_immediate
and, if there is a space in previous bundle, then retf may be copied into it, and before_return block may be potentially amortized once for several functions with same frame size:
... retf frame_size_immediate before_return: retf frame_size_immediate
The next function allocates local registers and allocates stack frame with variable size (uses variable length arrays or alloca function) possibly with initial size more than pagesize. In this case we have 2 rollback points: for the case of failure in local register alocation, and for the case of failure in initial stack alocation. The sp can't be used for access local stack frame (because of variable frame size), so some local temp register is used to save/restore old sp value (r6 in example) with negative offsets.
# optional guard page probing for frame size bigger than pagesize std %gz, %sp, -page_size * 1 std %gz, %sp, -page_size * 2 ... std %gz, %sp, -page_size * n # allocation only after probing, r6 is allocated on the fly allocsp 11, initial_frame_size_immediate addi %r6, %sp, initial_frame_size_immediate ehadj before_return # immediately after saving fp in r6 ... insn # can fail, can use r1..r10 ldwz %r7, %r6, -offset # using r6 for local frame addressing # alloca or VLA # optional guard page probing for big frame size std %gz, %sp, -page_size * 1 std %gz, %sp, -page_size * 2 ... std %gz, %sp, -page_size * m # allocation only after probing sub %sp, %sp, additional_frame_size # end of alloca or VLA stw %r7, %r6, -offset # using r6 for local frame addressing ... before_return: mov %sp, %r6 ret
One alloc instruction, along with instructions for calling procedures and returning control, in principle, it is enough for user programs to handle the register stack. But for system programs that handle interrupt processing, returning from an interrupt, context switching, initialization of the register stack, some more instructions are needed.
Instruction without parameters rscover (register stack cover frame) is used, to put the last (active) frame of the register stack into the dirty state (registers belonging to inactive procedure frames). After executing this instruction, the size of the active frame of the local registers is zero. This instruction prepares the register stack for subsequent disconnection or switching.
Instruction without parameters rsflush (register stack flush) used to flush all inactive frames of the register stack into memory (transfer from the dirty state to the clean state). After executing this instruction, the register stack can be disabled without fear of data loss.
Instruction without parameters rsload (register stack load) used to load from memory the last inactive frame of the register stack and be ready to activate it. After executing this instruction, the register stack is ready to work (a group of clean registers appears in it).
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | 0 | opx |
ABI defines a standard relationship function relationship, namely stack frame location, using registers, passing parameters.
The standard function call convention applies only to global functions. Local functions (not available from other object files) may be used by other agreement unless it prevents the correct recovery after an exception.
Convention about the use of registers in standard function calls divides all the global registers available to the program into two categories: saved (preserved) and non-saved (scratch) registers.
preserved registers are guaranteed to be saved when the procedure is called. The called procedure (callee) guarantees the safety of the contents of such a register with a normal return from it. She either doesn't touch this register, or saves it contents somewhere and restores before returning.
Unsaved (scratch) registers may not be saved when a procedure is called. The calling procedure (caller) must store the contents of such a register in memory (on the stack) or in another, but persistent register, if it doesn't want to lose its contents when calling callee. The callee function called uses this register for its needs without restriction.
The architecture provides 128 general purpose registers of 128 bits, and several 64/128 bit special purpose registers. General purpose registers are divided into global (static) and rotatable. The following table shows how registers are used.
registers | volatility |
---|---|
sp | stack pointer is saved. The address of the top of the stack must be aligned on a 16-byte boundary. It should always point to the last placed stack frame, growing down in the direction of lower addresses. The contents of the word at this address always points to a previously placed stack frame. If required, may be reduced by the called function. The stack top pointer must be updated atomically with a single instruction to avoid any period of time in which the interrupt can happen with a partially updated stack. |
tp | thread pointer is saved. This register stores the base address of the TDATA segment of the main program module. |
r0 | communication register, saved automatically by the rotation mechanism of the registers. |
r1-r32 | Used to pass parameters to the called function (not saved). Registers r1 and r2 store the return value. |
Static registers g0-g7 must retain their values in the process of accessing the function. Functions that use these registers must save their values before changing, and restore them before returning from the function.
External signals can interrupt the flow of instructions at any time. Functions called during signal processing do not have any special restrictions on their use of registers. In addition, when the signal processing function returns control, the process resumes its work with correctly restored registers. Therefore, programs and compilers are free to use all registers above except reserved for use by the system without fear of signal processing programs that inadvertently change their values.
The operating system provides each thread with its own stack, in which data is placed on both sides. The stack of rotated registers grows from the bottom towards the higher addresses, work with it under the control of the equipment and is not visible to ABI. The usual stack of software local objects grows from top to bottom towards lower addresses. Each frame (frame) corresponds to an activation record of a procedure in a call chain. The stack pointer sp (stack pointer) always points to the first byte after the top of the stack. The stack frame should be aligned at the 16-byte boundary, and should be a multiple of 16 bytes in size.
The last function in the call chain, which itself doesn't call anyone, may not have its own frame. Such functions are called leaf or terminal (in the graph of dependencies between functions). All other functions must have their own stack frame in the dynamic stack. The following figure shows the organization of the stack frame. sp in the figure means the pointer (register r1) of the top of the stack the called function after it has executed the code setting the stack frame.
Stack frame organization
highest address + -> Frame header (return address, gp, rsc) | Register storage area (aligned on the boundary of 16 bytes) | Local variable space (aligned on the boundary of 16 bytes) sp ---> + - The title of the next frame (sp + 0) lowest address
The following requirements apply to the stack frame:
The header of the stack frame consists of a pointer to the previous frame (link info), storage areas rsc, lp and gp, resulting in 32 bytes. Link info always contains a pointer to the previous frame in the stack. Before function B refers to another function C, it must save the contents of the communication register received from function A in the storage area lp for the stack frame of function A, and must set its own stack frame.
Except for the header of the stack frame and inserts for alignment at the 16-byte boundary, the function should not allocate space for areas that it doesn't use. If the function doesn't call other functions and doesn't require anything from the rest of the stack frame, then it should not set the stack frame. The parameter saving area follows the stack frame, the register saving area should not contain any inserts.
For machines of the RISC type (where there are many registers) it is generally more efficient to pass arguments to the called functions in the registers (real and general purpose), rather than constructing a list of arguments in memory or pushing them onto the stack. Since all calculations must somehow be performed in registers, then extra memory traffic can be eliminated if the caller can calculate the arguments in the registers and pass them in the same registers of the called function (callee), and she can immediately use them for its calculations. The number of arguments that can be passed in this form is limited by the number of available registers in the processor architecture.
For POSTRISC, up to 16 parameters are passed in general registers and are visible in the callee new frame in registers r1…r16. The caller passes parameters starting from any register. Exact register number on the caller side depend on caller local frame size.
Parameter storage area, which is located at a fixed distance of 32 bytes from the pointer to the top of the stack, reserved in each frame of the stack for use under the argument list. A minimum of 8 double words is always reserved. The size of this area should be sufficient to preserve the longest list of arguments passed to the function if it owns a stack frame. Although not all arguments for a particular call are in storage, consider their list formation in this area, with each argument occupying one or more double words.
If more arguments are passed than are allowed to be stored in registers, the remaining arguments are stored in the parameter storage area. Values passed through the stack are bitwise identical to those that would be placed in registers.
For variable argument lists, the ABI uses the type va_list, which is a pointer to the location in memory of the next parameter. Using the simple va_list type means that variable arguments should always stay in the same location despite the type, so that they can be found at runtime. This ABI defines the location, which is the common registers r8-r18 for the first eight double words and the parameter storage area on the stack for the rest. Alignment requirements, such as for real types, may require so that the va_list pointer is pre-aligned before accessing the value.
The return value of the function. Functions must return type values int, long, long long, enum, short, char, or pointers to any type, in register r1, extended to 64 bits (zeros or sign).
Arrays of characters up to 8 bytes long, or bit strings up to 64 bits long, will be returned in the g8 register, right justified. Structures or joins of any length, and character strings longer than 8 bytes, will be returned in the storage buffer allocated by the caller. The caller passes the address of this buffer as a hidden optional argument.
Functions must return a single real result of type float, double, long double (quadruple) in the r1 register, rounded to the desired precision. Functions must return complex numbers in the registers r1 (real part) and r2 (imaginary part), rounded to the desired accuracy.
The architecture defines a model in which the control flow is passed to the next sequential instruction in memory, unless otherwise directed by a jump instruction or interrupt. The architecture requires the program to appear that the processor is executing instructions in the order in which they are located in memory, although in reality the order can be changed inside the processor. The instruction execution model described in this chapter provides a logical representation of the steps involved in executing the instruction. The branch and interrupt sections show how flow control can be changed during program execution.
If the branch direction is incorrectly predicted, the branch instruction causes the pipeline to stop. All speculatively launched instructions are reset from the fetch stage to the stage of writing results – for almost the entire length of the pipeline.
Predication is a conditional execution of instructions. The purpose of conditional execution is to remove badly predicted branches from the program. In this case, any instruction becomes a hardware-executed conditional branch operator. For example:
if (a) b = c + d.
add (a) b = c, d
The optional argument «a» (predicate) sets the logical condition – to execute the instruction or not. This technology replaces a control dependency with a data dependency and shifts a possible pipeline shutdown closer to the pipeline end. All instructions issued with a false value of the predicate are rejected at the completion stage (retire) or earlier (up to the decode stage) without interruptions.
The instruction predication may be explicit or implicit. With explicit predication, each instruction contains an additional argument – a one-bit predicate register, and, accordingly, the architecture contains a file of several predicate registers (16 predicates in the ARM-32 architecture, 64 in Intel-Itanium).
When implicit predication, the architecture contains a special register-mask for storing information about the conditionality of the execution of future instructions. Before executing the instruction, the first bit from this register is taken as its predicate. Then the register is shifted by one bit, while the current bit is lost. The subsequent instruction takes the second bit as the predicate. The register is constantly updated from the other end with «clean» bits corresponding to unconditionally executed instructions.
Some instructions may write data to this register, thereby canceling the unconditional execution of some future instructions according to the bitmask. These are the so-called nullification instructions. For example, using mask 0b10011 containing 3 1-bits, the 3 instructions (1, 2, and 5th) after the nullification instruction will be canceled.
The advantage of conditional execution is the elimination of most branches in short conditional calculations, and hence the pipeline stops. However, this is a purely power method, which boils down to simultaneously issuing instructions from several execution branches under different predicates on the pipeline. In addition, when explicitly predicting, a place in the instruction is required to explicitly encode the optional argument – the predicate register.
Predication is more suitable for short conditional calculations. It makes no sense to apply predication for loops, or conditional statements longer than the gain from the continuous operation of the pipeline without branches. However, it is the only means of removing downtime for poorly predictable branches (for example, a conditional branch depending on unpredictable data).
An implicit predication scheme was chosen for the POSTRISC architecture. This is due to the fact that according to statistics collected for other architectures where there is a predication, approximately 90% of instructions are executed without using predication, so spending several bits for the predicate in each instruction is not profitable. On the other hand, the remaining 10% of the instructions depend on unpredictable data and, without predication, introduce a significant delay in the pipeline operation. Therefore, architecture without predication will also be suboptimal.
The special field psr.future is used to control the nullification of the subsequent instructions. The least significant bit of the register corresponds to the current instruction, other bits correspond to the subsequent instructions. At the end of the instruction, a right shift occurs. In the case of the branch, the future mask is completely cleared, thereby canceling all possible established nullifications.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | nullification condition | dist-no | dist-yes | opx |
Based on the condition of nullification (for different instructions consists of 2 registers, or register and shift, or register and immediate value) either the next «n-yes» instructions are nullified or the next «n-no» subsequent instructions (after first block «n-yes» instructions) are nullified in the psr.future.
The following are examples of removing branches from short conditional statements and the corresponding use of nullification. In all cases, the management dependency is converted to data dependence and masking for future instructions. Everywhere it is assumed that the calculation of conditions gives side effects in the form of possible exceptional situations and should occur strictly predictively. If there can be no side effects in the form of exceptional situations, then, naturally, the calculation of a difficult condition can be done without predication and reduce unnecessary manipulations.
Conditional statement | Predication |
---|---|
if (c1) {x1; } else {x2; } |
c1 c1yes, c1no x1 (c1no) x2 (c1yes) |
if (c1) { x1; if (c2) x2; else x3; x4; } else { x5; if (c3) x6; else x7; x8; } |
c1 c1yes, c1no x1 (c1no) c2 c2yes, c2no (c1no) x2 (c1no, c2no) x3 (c1no, c2yes) x4 (c1no) x5 (c1yes) c3 c3yes, c3no (c1yes) x6 (c1yes, c3no) x7 (c1yes, c3yes) x8 (c1yes) |
if (c1) {x1; } else if (c2) {x2; } else if (c3) {x3; } else {x4; } |
c1 c1yes, c1no c2 c2yes, c2no (c1yes) c3 c3yes, c3no (c1yes, c2yes) x1 (c1yes) x2 (c2yes) x3 (c3no) x4 (c3yes) |
if (c1 & & c2) {x1; } else {x2; } |
c1 c1yes, c1no c2 c2yes, c2no (c1no) x1 (c1no, c2no) x2 (c2yes) |
if (c1 || c2) {x1; } else {x2; } |
c1 c1yes, c1no c2 c2yes, c2no (c1yes) x1 (c2no) x2 (c1yes, c2yes) |
if (c1 || (c2 & & c3)) { x1; } else { x2; } |
c1 (p0) p2, p3 c2 (p3) p4, p5 (unc) c3 (p4) p2, p3 x1 (p2) x2 (p3) |
if (c1 & & (c2 || c3)) { x1; } else { x2; } |
c1 (p0) p2, p3 c2 (p2) p4, p5 (unc) c3 (p5) p4, p5 (unc) x1 (p4) x2 (p3) |
Nullification instructions mark in the special field psr.future the fact that the execution of the subsequent instructions was canceled. Nullification instructions create mask of 1s for nullified instruction for if or else block, and or them with current future mask. Nullification instructions assume that the «if»-block precedes the «else»-block.
Next instructions cancel future instructions depending on the result of comparing two registers.
Instruction | Operation |
---|---|
nuldeq | nullify if doubleword equal |
nuldne | nullify if doubleword not equal |
nuldlt | nullify if doubleword less |
nuldle | nullify if doubleword less or equal |
nuldltu | nullify if doubleword less unsigned |
nuldleu | nullify if doubleword less or equal unsigned |
nulweq | nullify if word equal |
nulwne | nullify if word not equal |
nulwlt | nullify if word less |
nulwle | nullify if word less or equal |
nulwltu | nullify if word less unsigned |
nulwleu | nullify if word less or equal unsigned |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | ra | rb | opx | dist-no | dist-yes | opx |
Next instructions cancel future instructions depending on the result of comparing the register and the 14(40)-bit immediate value, with or without a sign. The conditions are the same as for compare with immediate and branch instructions.
Instruction | Operation |
---|---|
nuldeqi | nullify if doubleword equal |
nuldnei | nullify if doubleword not equal |
nuldlti | nullify if doubleword less |
nuldlei | nullify if doubleword less or equal |
nuldltui | nullify if doubleword less unsigned |
nuldleui | nullify if doubleword less or equal unsigned |
nulweqi | nullify if word equal |
nulwnei | nullify if word not equal |
nulwlti | nullify if word less |
nulwlei | nullify if word less or equal |
nulwltui | nullify if word less unsigned |
nulwleui | nullify if word less or equal unsigned |
nulmall | nullify if mask all bit set |
nulmany | nullify if mask any bit set |
nulnone | nullify if mask none bit set |
nulnotall | nullify if mask not all bit set |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | ra | imm11 | dist-no | dist-yes | opx |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
imm40 | 0 |
Instructions nulbs (nullify if bit set) and nulbsi (nullify if bit set immediate) cancel future instructions depending on whether or not a bit is set in the register.
Analogous instructions nulbc (nullify if bit clear) and nulbci (nullify if bit clear immediate) cancel future instructions depending on whether or not a bit is clear in the register.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | ra | rb | opx | dist-no | dist-yes | opx |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | ra | shift | opx | dist-no | dist-yes | opx |
Floating-point scalar values may be checked for nullification. Two registers may be compared, or single register value may be classified (normalized, signed, denormal, NaN, INF, etc).
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | ra | rb | opx | dist-no | dist-yes | opx |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | ra | classify | opx | dist-no | dist-yes | opx |
Assembler eliminates the need to manually set predication distances. You can use named markers whose distances are computed automatically. General syntax:
INSTRUCTION NAME regular_parameters (pred1, pred2, pred3, ...)
The predicate list indicates in which previously defined predicate instruction this instruction is the last in if or else-block. These predicates should be mentioned in previous nullification instructions no higher than 31 instructions from the current one for «yes» predicate and 63 instructions from the current one for «no» predicate.
write "test nullification (explicit distances)" ldi %r10, 0 nuleq %r10, %r10, 5, 4 write "0" ; nullified in 5 write "1" ; nullified in 5 write "2" ; nullified in 5 write "3" ; nullified in 5 write "4" ; nullified in 5 write "5" ; nullified in 4 write "6" ; nullified in 4 write "7" ; nullified in 4 write "8" ; nullified in 4
write "test nullification (predicate names)" ldi %r10, 0 nuleq %r10, %r10, equal, nonequal write "0" write "1" write "2" write "3" write "4" (equal) write "5" write "6" write "7" write "8" (nonequal)
Both variants print «5 6 7 8» (4 instructions else-block) and avoid printing «0 1 2 3 4» (5 instructions if-block) due to predication.
The else-block may be empty if both «yes» and «not» distances refer to same instruction (dist_yes == dist_not). To create zero-length else-block, the last instruction of if-block should be marked as last for else-block also. To create zero-length if-block, the nullification instruction itself should be marked as last in if-block.
In the next sample all 3 subsequent instructions will be nullified, because nullification condition is true. And there is no else-block. Both markers of block ends «(equal)» and «(nonequal)» are on same instruction.
nuleq r10, r10, equal, nonequal write "0" ; part of equal-block write "1" ; part of equal-block write "2" (equal, nonequal) ; part of equal-block
In the next example all 3 subsequent else-block instructions will be executed, because nullification condition is false, and if-block for true nullification condition isn't presented. The «(equal)» marker or if-block is set just on predication instruction, so the size of if-block is zero - the distance from end of block to nullification instruction.
nuleq r10, r12, equal, nonequal (equal) write "0" ; part of nonequal-block write "1" ; part of nonequal-block write "2" (nonequal) ; part of nonequal-block
From the most applications point of view, memory is defined as a linear array of bytes, indexed from 0 to 264−1. Each byte is identified by its index or address, and each byte contains a value. This information is sufficient for programming applications that do not require special features in any system environment. Other objects are constructed as sequences of bytes.
The architecture supports composite types of size 1,2,4,8,16 bytes. The following is the terminology used in this guide for composite data types. It is considered that the word size is 4 bytes.
Byte is a 8 contiguous bits starting at an arbitrarily addressable byte boundary. Bits are numbered from right to left from 0 to 7.
Halfword is a two contiguous bytes starting on an arbitrary (but multiple of two) byte boundary. The bits are numbered from right to left from 0 to 15.
Word is a four contiguous bytes starting on an arbitrary (but multiple of four) byte boundary. Bits are numbered from right to left from 0 to 31.
Doubleword is a eight contiguous bytes starting on an arbitrary (but multiple of eight) byte boundary. The bits are numbered from right to left from 0 to 63.
Quadword is a sixteen contiguous bytes starting on an arbitrary (but multiple of 16) byte boundary. The bits are numbered from right to left from 0 to 127.
Octaword (optional) is a 32 contiguous bytes starting on an arbitrary (but multiple of 32) byte boundary. The bits are numbered from right to left from 0 to 255.
This chapter additionally defines physical addressing, physical memory map, physical memory properties, memory ordering.
An extension of simple memory model include: virtual memory, cache, memory mapped IO, multiprocessor systems with shared memory, and, together with services, provided by the operating system, describes the mechanism which allows explicit management of this extended memory model.
A simple sequential execution model allows at most one memory access at a time and requires so that all memory accesses seemed to be executed in program order. Unlike this simple model, a relaxed memory model is further defined. In multiprocessor systems that allow multiple locations of data copies, aggressive architecture implementations can allow time intervals during which different copies have different meanings.
The program accesses the memory using the effective address calculated by the processor, when it performs a download, write, jump, or cache management instruction, and when it selects the next sequential instruction. The effective address is converted to a physical address according to the translation procedures. The physical address is used by the memory subsystem to execute memory access. The memory model provides the following features:
Architecture allows memory to take advantage of benefits efficiency from poor sequencing of memory access between processors or between processors and external devices.
Memory accesses by a single processor seem to be completed sequentially from the point of view of the programming model, but this may not end in order with respect to the final position in the memory hierarchy. Order is guaranteed at every level of the memory hierarchy just to access the same address from the same processor.
The architecture must provide instructions to allow the programmer to guarantee consistent and ordered state of memory.
The following defines the resources of the operating system for translating virtual addresses to physical addresses, physical addressing, memory sequencing and physical memory properties, status registers to support virtual memory management, virtual memory errors.
The blocks of RAM, ROM, flash, memory mapped IO and other control blocks occupy a common 64-bit physical address space with byte addressing. Accesses to RAM and the IO address ranges can be performed either through virtual addressing, by mapping to a 64-bit physical address space, or directly through physical addressing.
While software should always consider physical addressing as 64-bit, in fact, PALEN less than 64 bits of the physical address can be implemented in hardware. As shown below, the physical address consists of two parts: unimplemented and implemented bits. At least 40 bits of physical addressing must be implemented.
The system software can determine the specific value of PALEN by reading the PALEN field of the configuration word with the cpuid instruction.
Not all of these available addresses have real devices under them. The hardware at startup maps the available address blocks to the physical memory ranges and notifies the system about mapping. Similarly, the control ranges of the registers of external devices are mapped to physical addresses. Most physical addresses usually remain unused.
6 3 | 6 2 | 6 1 | 6 0 | 5 9 | 5 8 | 5 7 | 5 6 | 5 5 | 5 4 | 5 3 | 5 2 | 5 1 | 5 0 | 4 9 | 4 8 | 4 7 | 4 6 | 4 5 | 4 4 | 4 3 | 4 2 | 4 1 | 4 0 | 3 9 | 3 8 | 3 7 | 3 6 | 3 5 | 3 4 | 3 3 | 3 2 | 3 1 | 3 0 | 2 9 | 2 8 | 2 7 | 2 6 | 2 5 | 2 4 | 2 3 | 2 2 | 2 1 | 2 0 | 1 9 | 1 8 | 1 7 | 1 6 | 1 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 0 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
reserved | implemented physical address bits |
When the processor model doesn't implement all the bits of the physical address, the missing bits must be zero. If the software generates physical addresses with non-zero unimplemented bits, a runtime error occurs. Accessing instructions for unimplemented physical addresses results in the error «unimplemented instruction address». Accessing data by unimplemented physical addresses results in the error «unimplemented data address». Any accesses to the implemented but unused addresses end with an asynchronous «machine check abort» when the platform reports an operation timeout. The exact machine behavior of the check is implementation-dependent.
Memory accesses give a significant performance hit when accessing operands, which are not aligned at the natural address boundary. A naturally-aligned 2-byte number in memory has a zero bit in the low order of the address. A naturally-aligned 4-byte number in memory has two zero bits in the least significant bits of the address. A naturally-aligned 8-byte number in memory has three zero bits in the least significant bits of the address. A 16-byte number, naturally aligned in memory, has four zero bits in the least significant bits of the address. In general, a naturally aligned object of size 2N bytes has N zero bits in the least significant bits of the address.
Struct data types must provide natural alignment for all of their fields by inserting (paddings). Additionally, it should be possible to use the structs as elements of an array, by using the final padding with the strictest alignment among all struct fields.
Using the example of the following S C language structure, containing a set of various scalars and a character string, the location of the fields in memory is shown.
struct { int a; /* usual 4 bytes */ double b; /* usual 8 bytes */ int c; /* usual 4 bytes */ char d[7]; short e; /* usual 2 bytes */ int f; } S;
C language rules for mapping structures allow the use of paddings (byte skipping) to align scalars in memory on natural boundaries.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
4 bytes (a) | padding | ||||||
8 bytes (b) | |||||||
4 bytes (c) | d[0] | d[1] | d[2] | d[3] | |||
d[4] | d[5] | d[6] | padding | 2 bytes (e) | padding | ||
4 bytes (f) | final padding |
In the example, to map the structure to memory, alignment was made along the boundary that is natural for each scalar. This alignment gives an additional four missing bytes between a and b, one byte between d and e, and two bytes between e and f. Since the alignment for the double precision number b is the strictest for this structure, then the whole structure should be aligned on an 8-byte boundary. This gives 4 more bytes at the end of the struct.
Unaligned memory accesses throw an error «Unaligned data address». POSTRISC will not contain any hardware support for unaligned memory accesses, limiting itself to the installed program handler of the corresponding interrupts. Therefore, the software is required to align all scalar values on their natural boundaries in memory.
Since the instruction fetch, aligned load/store, and operations with semaphores operate only on aligned target addresses, they are atomic. The operation is atomic if for other agents working with memory (other processors, IO devices), memory access from our processor is an indivisible transaction (and vice versa). If our processor stores data to memory, then no other agent will be able to read from memory a mixture of old data and the newly written data replacing them. Similarly, if our processor reads data, then it will never read from memory a mixture of old data and new-write data replacing them from another agent. Of course, at the machine architecture level, these rules only apply to atoms memory, that is, correctly aligned objects of 1, 2, 4, 8, or 16 bytes in size. For arbitrary objects in memory, the atomic nature of their change is not guaranteed by architecture, and software tricks must be applied.
If scalars (individual data elements or instructions) were indivisible, then there would be no concept of «byte order». It makes no sense to consider the order of bits or groups of bits within the smallest addressable memory atom, because this order for an atom cannot be observed and determined. The question of order arises only when scalars, which the programmer and processor refer to as indivisible objects, occupy more than one addressable memory atom.
For most existing computer architectures, the smallest addressable memory atom is a 8-bit bytes. Other scalars consist of groups of 2, 4, 8, or 16 bytes in length. When a 4-byte scalar moves from register to memory, it occupies four consecutive byte addresses. Thus, it becomes necessary to establish the order of byte addresses relative to the scalar value: which byte contains the most significant eight bits of the scalar, which byte contains the next eight bits of importance, and so on.
For a scalar consisting of several atoms (bytes) of memory, the choice of byte order in memory is essentially arbitrary. There is N! ways to determine the order of N bytes within a long number, but only two of these orderings are actually used.
The order in which the smallest address is assigned to a byte that contains eight bits of a scalar of the lowest order (the rightmost bits), the next consecutive address is next in ascending order of eight bits, and so on. This order is called little-endian because it is least significant (from to the smaller end) the bits of the scalar, regarded as a binary number, are the first to go into memory. Intel-X86 is an example of an architecture using this byte order.
In a little-endian machine, bytes within a large number are numbered from right to left in decreasing order of byte addresses, so the low byte is stored in memory at the lowest address. This is a direct byte order (a format for storing and transmitting binary data, in which the least (least significant) bit (byte) is transmitted first.
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
The order in which the smallest address is assigned to a byte that contains eight bits of a scalar of the highest order (the leftmost bits), the next consecutive address is the next in descending order of eight bits, and so on. This order is called big-endian because the most significant ones (from the larger end) the bits of the scalar, regarded as a binary number, are the first to go into memory. IBM PowerPC is a sample architecture using this byte order.
In a big-endian machine, bytes within a large number are numbered from left to right in ascending order of byte addresses, so the low byte is stored in memory at the highest address. This is the reverse byte order (a format for storing and transmitting binary data in which the most significant (most significant) byte is transmitted or stored first. The terms little/big-endian comes from Gulliver's Travel Jonathan Swift.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Using the example of the following S structure of the C language containing a set of various scalars and a character string, shows the location of fields in memory under different conventions on byte order. Comments show values for each element of the structure. These values show how the individual bytes that make up each element of the structure are mapped into memory.
struct { int a; /* 0x1112_1314 (4 bytes) */ double b; /* 0x2122_2324_2526_2728 (8 bytes) */ int c; /* 0x3132_3334 (4 bytes) */ char d[7]; /* "A","B","C","D","E","F","G" bytes array */ short e; /* 0x5152 (2 bytes) */ int f; /* 0x6162_6364 (4 bytes) */ } S;
C language rules for mapping structures allow the use of inserts (byte skipping) to align scalars in memory at desired (natural) boundaries. In the examples below, the mapping of the structure into memory is done with natural alignment border for each scalar. This alignment gives an additional four missing bytes between a and b, one byte between d and e, and two bytes between e and f. The same amount of padding is present in big-endian and little-endian mappings.
The contents of each byte, as defined in the S structure, are displayed as a hexadecimal number or character (for line elements). Cell addresses (offsets from the beginning of the structure) are shown below the data stored at this address.
0x14 0 |
0x13 1 |
0x12 2 |
0x11 3 |
padding 4 |
padding 5 |
padding 6 |
padding 7 |
0x28 8 |
0x27 9 |
0x26 10 |
0x25 11 |
0x24 12 |
0x23 13 |
0x22 14 |
0x21 15 |
0x34 16 |
0x33 17 |
0x32 18 |
0x31 19 |
«A» 20 |
«B» 21 |
«C» 22 |
«D» 23 |
«E» 24 |
«F» 25 |
«G» 26 |
padding 27 |
0x52 28 |
0x51 29 |
padding 30 |
padding 31 |
0x64 32 |
0x63 33 |
0x62 34 |
0x61 35 |
padding 36 |
padding 37 |
padding 38 |
padding 39 |
0x11 0 |
0x12 1 |
0x13 2 |
0x14 3 |
padding 4 |
padding 5 |
padding 6 |
padding 7 |
0x21 8 |
0x22 9 |
0x23 10 |
0x24 11 |
0x25 12 |
0x26 13 |
0x27 14 |
0x28 15 |
0x31 16 |
0x32 17 |
0x33 18 |
0x34 19 |
«A» 20 |
«B» 21 |
«C» 22 |
«D» 23 |
«E» 24 |
«F» 25 |
«G» 26 |
padding 27 |
0x51 28 |
0x52 29 |
padding 30 |
padding 31 |
0x61 32 |
0x62 33 |
0x63 34 |
0x64 35 |
padding 36 |
padding 37 |
padding 38 |
padding 39 |
For POSTRISC architecture, the primary is the little-endian direct order. All operations on data in registers/memory are carried out according to this order. Implementations may include optional support for big-endian addressing for loading/storing numbers.
The bit numbering within bytes doesn't affect the byte numbering convention (big-endian or little-endian). The byte numbering convention doesn't matter when accessing the full aligned data in memory. However, the numbering agreement is important when accessing less or not aligned data, or when manipulating data in registers, as follows:
Retrieving the 5th byte from an 8-byte number into the low byte of the register requires a right shift 5 bytes according to the little-endian agreement, but the right shift is 2 bytes according to the big-endian agreement.
The manipulation of data in the register is almost the same for both conventions. In both integers and floating-point numbers store the sign bits in the leftmost byte and their least significant bit in the rightmost byte, so the same integer instructions and floating-point instructions are used unchanged for both conventions. However, big-endian character strings have their most significant character on the left, while little-endian strings have their most significant character on the right.
In addition to little-endian and big-endian, there are other (combined) options for storing long scalars in memory. For example, some architecture (PDP-11?) stores double-byte numbers according to the little-endian order, but 4-byte numbers as pairs of double-byte numbers but according to big-endian order. It happens that integers are stored according to one principle, and real ones according to another, for example, if a floating-point coprocessor (ARM, TMS320C4x) is added to the integer processor later.
There are several memory-consistency models for SMP systems:
Atomic operations can be reordered with loads and stores.
The instruction fetching is incoherent with data, so self-modifying code can't be executed without special instruction cache flush/reload instructions plus maybe jump instructions.
The POSTRISC follows the weak memory model. And same the weak memory model with acquire loads and release stores also called release-consistency model. Only the acquire/release atomic instructions are synchronization points.
Architecture | Loads can be reordered after | Stores can be reordered after | Atomics can be reordered with | Dependent loads can be reordered | Incoherent instruction cache/ pipeline |
|||
---|---|---|---|---|---|---|---|---|
loads | stores | loads | stores | loads | stores | |||
Alpha | + | + | + | + | + | + | + | + |
ARM | + | + | + | + | + | + | + | |
RISC-V WMO | + | + | + | + | + | + | + | |
RISC-V TSO | + | + | ||||||
PA-RISC | + | + | + | + | ||||
POWER | + | + | + | + | + | + | + | |
SPARC RMO | + | + | + | + | + | + | + | |
SPARC PSO | + | + | + | + | ||||
SPARC TSO | + | + | ||||||
x86 | + | + | ||||||
AMD-64 | + | |||||||
IA-64 | + | + | + | + | + | + | + | |
IBM-Z | + | |||||||
Postrisc | + | + | + | + | + | + | + |
Notes: On Alpha the dependent loads can be reordered. If the processor first fetches a pointer to some data and then the data, it might not fetch the data itself but use stale data which it has already cached and not yet invalidated. Allowing this relaxation makes cache hardware simpler and faster but leads to the requirement of memory barriers for readers and writers. On Alpha hardware (like multiprocessor Alpha 21264 systems) cache line invalidations sent to other processors are processed in lazy fashion by default, unless requested explicitly to be processed between dependent loads. The Alpha architecture specification also allows other forms of dependent loads reordering, for example using speculative data reads ahead of knowing the real pointer to be dereferenced.
The processor implementation must follow the programmatic order of executing the instructions of a single-threaded program. But the effects of the actions of one thread on the memory can be observed by other threads not in the programmatic order of this thread Depending on the guarantees that the architecture explicitly gives and the permissions that the implementation is explicitly allowed, talk about a stricter or weaker ordering of memory. POSTRISC is an architecture with weak memory ordering. There are no obvious restrictions on the visibility of third-party processors or other devices (for example, input-output) actions on the memory of the current thread. Similarly, the current thread has no explicit guarantees on the other agents actions visibility order.
Special instruction «fence» used as memory barrier. Supported mo types are acquire, release, acquire-release (acq_rel), sequential-consistent (seq_cst).
fence.mo
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | 0 | mo | opx |
Special instructions «load-atomic» and «store-atomic» used to push the visibility of changes from one thread to another. relaxed - normal operation, acquire - acquire changes, release - submit changes, acquire-release - sequential-consistent. Before changing the general data, the thread performs acquire by load-acquire from the watchdog variable. Similarly, after a change is made to the general data, the stream pushes the changes, executing release by store-release to the watchdog variable.
Those are the instructions ldab, ldah, ldaw, ldad, ldaq (load), and stab, stah, staw, stad, staq (store).
INSN_MNEMONIC.mo target, base
These instructions are one-way barriers. They do not allow speculative and out-of-order execution of operations with memory through themselves. Acquire doesn't allow the subsequent instructions to move forward, and release doesn't let the instructions before it lag behind. With the correct (pairwise) use of acquire-release, a closed section of code is obtained, locked at the top (acquire) and at the bottom (release).
The possible memory orderings for atomic load: relaxed, acquire, seq_cst (sequentially-consistent). The possible memory orderings for atomic store: relaxed, release, seq_cst (sequentially-consistent).
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | target | base | 0 | mo | opx |
The load-op atomic instructions copy the old value of a variable from memory to register, and sets a new value in memory (test-and-set), or obtained from the old (fetch-and-add and analogues). The possible memory orderings: relaxed, acquire, release, acq_rel (acquire-release), seq_cst (sequentially-consistent).
ea = gr[base] gr[dst] = mem[ea] mem[ea] = dst op gr[src]
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | base | src | mo | opx |
ld_op_type.mo dst, base, src
Instruction | Description |
---|---|
swap[b|h|w|d|q].mo | swap 1-16 bytes |
ldadd[b|h|w|d].mo | addition, 1-8 bytes |
ldand[b|h|w|d].mo | bitwise AND, 1-8 bytes |
ldor[b|h|w|d].mo | bitwise OR, 1-8 bytes |
ldxor[b|h|w|d].mo | bitwise XOR, 1-8 bytes |
ldsmin[b|h|w|d].mo | signed minimum, 1-8 bytes |
ldsmax[b|h|w|d].mo | signed maximum, 1-8 bytes |
ldumin[b|h|w|d].mo | unsigned minimum, 1-8 bytes |
ldumax[b|h|w|d].mo | unsigned maximum, 1-8 bytes |
The store-op atomic instructions update value in memory via corresponing operation. Comparing to load-op, they don't return old variable value from memory to register, so may be implemented as a one-way communication. The possible memory orderings: relaxed, acquire, release, acq_rel (acquire-release), seq_cst (sequentially-consistent).
ea = gr[base] mem[ea] = mem[ea] op gr[src]
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | 0 | base | src | mo | opx |
store_op_type.mo base, src
Instruction | Description |
---|---|
stadd[b|h|w|d].mo | addition, 1-8 bytes |
stand[b|h|w|d].mo | bitwise AND, 1-8 bytes |
stor[b|h|w|d].mo | bitwise OR, 1-8 bytes |
stxor[b|h|w|d].mo | bitwise XOR, 1-8 bytes |
stsmin[b|h|w|d].mo | signed minimum, 1-8 bytes |
stsmax[b|h|w|d].mo | signed maximum, 1-8 bytes |
stumin[b|h|w|d].mo | unsigned minimum, 1-8 bytes |
stumax[b|h|w|d].mo | unsigned maximum, 1-8 bytes |
Instructions cas[b|h|w|d|q] (compare and swap 1-16 bytes), Designed for non-blocking interactions in a multi-threaded multiprocessor environment. Both instructions are atomic indivisible memory operations that cannot be partially performed.
The cas instruction reads an N-byte number from memory at the address from the base register, and compares it with the value in the register dst. If the values match, the instruction saves the new value from the src register at this address. Otherwise, the instruction doesn't save anything at this address. The base address must be aligned at the N-byte boundary. The read value is stored in the register dst.
value = mem [base] if (value == gr [dst]) { mem [base] = gr [src] } gr [dst] = value
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | base | src | mo | opx |
Using the following procedure, a stream can modify the contents of a memory cell even if there is a possibility that the stream can be interrupted and replaced by another thread that will update the cell, or that a thread on another processor can simultaneously modify a cell. First, an 8-byte number is entirely loaded into the register. Then, the updated value is computed and placed in another sval register. Then the casq instruction is executed with parameters test (register number where the initial value), base (base address register number) and sval (register number that contains the updated value). If the modification was completed successfully, the original value will be returned. If the memory cell doesn't contain the original value (the current thread was interrupted or the thread of another processor interrupted), the update will not be successful, general register with the number dst of the instruction casq contains the new current value of the memory cell. If the memory cell doesn't contain an original value, the thread may try to repeat the procedure again using the new current value.
loop: ldad.relaxed test, base mov save, test ... addi sval, dst, 12; some kind of modification ... casd.relaxed sval, base, test bne save, test, loop
The instruction casd can be used for controlled sharing of a common data area, including the ability to send messages (to a linked message list) when a common area is in use. To achieve this, an 8-byte number in memory can be used as a control number. A value of zero indicates that the common area is not in use, and that no messages exist. A negative value indicates that the area is in use by someone, and that no messages exist. A positive value indicates that the shared area is in use, and that the value is the address of the most recent message added to the list. Therefore, any number of threads wishing to capture a common area, can use casd to update the check number, to indicate that the area is in use or add messages to the list. The only thread that has captured the shared area can also safely use casq to remove messages from the list.
The instruction casq can be used similarly to casd. Additionally, it has other uses. Consider a linked data list, with a check number used to address the first message in the list, as described above. If multiple threads are allowed to delete messages using casd (and not just the only thread that captured the common area), then the list will probably be incorrectly modified. This can happen if, for example, after one thread reads the address the very last message to move the message, another thread will delete the first two messages, and then adds the first message back to the linked list («ABA» issue in IBM terminology). The first thread, continuing the interrupted execution, will not be able to determine that the list has changed. By increasing the size of the control word to a pair of 8-byte numbers, containing the address of the first message and the modification tag (change number), which increases by 1 each time the list is modified, and using casq to update both fields together, the possibility of incorrect list updating can be reduced to an insignificant level. Namely, incorrect modification can occur only if the first stream was interrupted and during this time the number of changes to the list is exactly a multiple of 264, and only if the last change to the list uses the original address of the message.
The architecture of any processor needs a simple and effective mechanism for distinguishing memory accesses and IO operations for IO devices mapped to the address space. When accessing the memory, it is possible to cache data with a write back, blocking operations with semaphores are allowed, optimizing reordering of load/store operations and optimizing write-combining(coalescing) of write operations are available. For operations with mapped IO devices, write through is strictly necessary and cannot be cached, possible side effects even when reading, you need a strict order (sequential) without permutations/merges.
In addition, a dedicated, fixed address range must exist in the physical address space for the bootloader code. It's some analog of the PC EPROM and BIOS. It contains entry point from which the execution starts after the system restart, and other embedded code, implementation-dependent (processor-dependent code or PDC) and platform (system-dependent code or SDC). It is a read-only memory block, although updating may be permitted.
Memory attributes define speculativeness, cacheability, orderliness, and write policy. If virtual addressing is enabled, the memory attributes that define the actually displayed physical page are determined by the TLB. If physical addressing is enabled, memory attributes are determined based on the physical address.
The software must use the correct address subspaces when using physical addressing. Otherwise, incorrect access to IO devices with side effects is possible.
An address range can be either cacheable or non-cacheable. If the range is cacheable, the processor is allowed to distribute a local copy corresponding physical memory at all levels of the processor cache hierarchy. Distribution can be changed by cache management instructions.
The cached page is memory coherent, i.e. the processor and memory system guarantee that there is a consistent representation of memory for each processor. Processors support multiprocessor cache coherence based on physical addresses between all processors in the coherence domain (tightly coupled multiprocessors). Coherence doesn't depend on virtual aliases, since they are forbidden.
The processor is not required to support coherence between local instruction and data caches; that is, locally, the entry may not be observable by the local instruction cache. Moreover, multiprocessor coherence is not required from the instruction cache. However, the processor must ensure that the operations of other IO agents like «Direct Memory Access» (DMA) are physically coherent with a cache of data and instructions.
For an uncached access, the processor doesn't provide any coherence mechanisms. The memory system must ensure that a consistent memory representation is seen by each processor.
When writing to cached memory with write-back, only the processor-owned local copy of the data cache line changes. Writing to a lower level cache system (or to the level of the physical arrangement of data in memory) occurs when a changed cache line is explicitly (or implicitly) pushed out of a higher level cache. With write through policy, data changes affect all levels of caching immediately.
For non-cached address ranges, a write-combining (coalescing) can be set, which tells the processor that multiple writes to a limited memory area (typically 32 bytes) can be assembled together in the write buffer and made later as one large combined write. The processor can combine writes for an indefinite period of time. Several writes can be combined into one large, which accumulates in the buffer. Write-combining – means to increase processor efficiency. A processor with multiple write buffers should provide the preemptive order, using buffers approximately the same, even if some buffers are only partially full.
The processor can flush data from write buffers to memory in any order. The combined writes aren't performed in the original order. Write-combining can be either spatial or time-based. For example, writing bytes 4 and 5 and writing bytes 6 and 7 are combined into a single writing of bytes 4, 5, 6, and 7. In addition, writing bytes 5 and 6 is combined with subsequent writing of bytes 6 and 7, into a single write of bytes 5, 6, and 7 (with the removing of the first write to the byte 6).
The memory attributes may be defined in several ways.
Memory attributes may be defined via special registers at the level of physical address ranges. In X86 the special memory type range registers (MTRRs) are a set of processor supplementary capability control registers that provide system software with control of how accesses to memory ranges by the CPU are cached. It uses a set of programmable model-specific registers (MSRs) which are special registers provided by most modern CPUs. Possible access modes to memory ranges can be uncached, write-through, write-combining, write-protect, and write-back. In write-back mode, writes are written to the CPU's cache and the cache is marked dirty, so that its contents are written to memory later. Write-combining allows bus write transfers to be combined into a larger transfer before bursting them over the bus to allow more efficient writes to system resources like graphics card memory. This often increases the speed of image write operations by several times, at the cost of losing the simple sequential read/write semantics of normal memory. Additional bits, added in AMD64, allow the shadowing of ROM contents in system memory (shadow ROM), and the configuration of memory-mapped I/O.
Memory attributes may be defined at the level of virtual addresses via virtual page properties as an additional part of cached translation info. Then such per-page memory attributes may redefine previous per-range physical address atributes or restrict them in compatible manner.
Memory attributes may be determined by physical memory mapping only. In this case, fixed address ranges have specified memory attributes. Memory attributes are set implicitly during the initial physical address ranges mapping at reset and can't be changed further.
In the POSTRISC, the last way will be choosen. Memory attributes of physical address ranges are defined from their mapping to corresponding physical adreess ranges. They can't be redefind further via special registers and/or page properties. So the physical address space is divided into fixed parts with mmio-like and memory-like address ranges.
Addresses | Use |
---|---|
0 to 1 GiB | mmio-like for compatible devices (not 64-bit ready). |
1 to 4 GiB | memory-like for compatible devices (not 64-bit ready). |
4-256 GiB | mmio-like for 64-bit ready devices |
over 256 GiB | memory-like main space. |
From the system point of view, the physical address space is a bunch of devices, each of which is mapped to continuous address range. Everything is the memory-mapped device: memory RAM units are devices, external io devices are naturally memory-mapped devices, even processor cores are memory-mapped devices.
The bus controller which controls memory mapping is also the memory-mapped device. The special «device array» device maps all device configuration spaces (similar to PCI root complex). Each device has 4 KiB configuration space maximum in device array.
At least one block address in the physical memory map should be fixed in architecture: starting address in ROM for code execution after reset. Other blocks layout may be also fixed. Or may be known from the ROM code.
start | end | size | description |
---|---|---|---|
0x0000000000000000 | 0x00000000ffffffff | 4GiB | reserved |
0x0000000100000000 | 0x00000001000fffff | 1MiB | chipset control |
0x00000001f0000000 | 0x00000001ffffffff | 256MiB | ROM |
0x0000000200000000 | 0x00000002ffffffff | 4GiB | PCIE ECAMs (16x256MiB) |
0x0000004000000000 | 0x0000004fffffffff | 64GiB | PCIE BARs |
0x0000010000000000 | 0x000003ffffffffff | 2TiB | RAM |
The memory map should be consistent with memory attributes. Chipset control, PCIE config spaces, memory-mapped io: should be mapped to mmio-like ranges. Memory devices: should be mapped to memory-like ranges. ROM devices: may be both, but the startup ROM should be memory-like.
Instructions to clear the cache icbf (instruction cache block flush) and dcbf (data cache block flush) supplant the entire contents of the write buffers, whose addresses are no more than 32 bytes from the aligned address (at the boundary of 32 bytes), specified by icbf or dcbf, forcing the data to become visible. The icbf and dcbf instructions may also preempt additional write buffers.
Instruction without parameters msync (memory synchronize) – this is a hint for the processor to speed up the flushing out of all pending (buffered) stores, regardless of their addresses. This makes pending entries visible to other memory agents.
There is no way to know when the preemption of writes will be completed. The ordering of joined records is not guaranteed, so that later writes may occur before previous writes. To ensure that preceding linked entries are made visible before later entries, software must serialize between entries.
The processor can at any time flush connected writes to memory in any order before the software explicitly requires it.
Pages that allow writes to be joined are not necessarily coherent with write buffers or caches of other processors, or with local processor caches. Downloads to connected pages of memory by the processor see the results of all previous writes by the same processor in the same connected page of memory. Memory calls made by a connecting buffer (such as buffer streams) have an unordered non-sequential memory ordering attribute.
The MMGR family includes instructions for working with special registers, barrier instructions, cache management, dynamic procedure calls, interprocess communication, etc.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | s | nt | opx | label28 | |||||||||||||||||||||||||||||||||||||
opcode | s | nt | opx | base | simm21 | ||||||||||||||||||||||||||||||||||||
opcode | s | nt | opx | base | index | scale | sm | disp |
The second register contains the base address (only the address register). The rest of the instruction is reserved for storing the offset, a 9-bit signed number. Formulas to get the effective address:
ip + 16 × sign_extend(label23)
gr [base] + sign_extend(disp)
gr [base] + (gr [index] << scale) + sign_extend(disp)
Instructions ECB (Evict cache block), FETCH (Prefetch data), FETCHM (Prefetch data, modify intent), WH64 (Write hint 64 bytes) regulate the use of cache resources.
FETCH - load the block into the cache for reading N times (if N=0, then free the block).
FETCHM - load a block into the cache for modification N times (if N=0, then push the block out of the cache into memory).
This chapter additionally defines operating system resources to translate 64-bit virtual addresses to physical addresses. The virtual memory model introduces the following key features that distinguish it from the simplified presentation of application programs:
Translation lookaside buffer (TLB) support high-performance paged virtual memory systems. Software handlers for populating and protecting TLBs allow the operating system to control translation policies and protection algorithms.
Page table (PT) with hardware browsing capabilities has been added to increase TLB performance. PT is a continuation of the processor TLB, which is located in RAM and can be automatically viewed by the processor. The use of PT, its size, is entirely under software control.
Sparse 64-bit virtual addressing is supported by provisioning large translation structures (including multi-level hierarchies, like a cache hierarchy), effective support for processing translation misses, pages of different sizes, fixed (non-replaceable out) translations, mechanisms for sharing TLBs and page table resources.
The main addressable object in the architecture is an 8-bit byte. Virtual addresses are 64 bits long. An implementation may support less virtual address space. Virtual addresses visible by the program are translated into physical memory addresses by the memory management mechanism.
From an application point of view, the virtual addressing model represents a 64-bit single flat linear virtual address space. General purpose registers are used as 64-bit pointers in this address space.
Less than 64 bits of a virtual address may be implemented in hardware. Unimplemented address bits must be filled with copies of the last implemented bit (be a sign extension of the implemented part of the address). Addresses in which all unimplemented bits match the last implemented bit are called canonical. Implemented virtual address space in this case consists of two parts: user and kernel. For N implemented virtual address bits, the user addresses ranges from 0 to 2N-1-1, and the kernel addresses ranges from 264-2N-1 to 264-1.
So, for example, for 48 bits:
0x0000000000000000 - start of user range 0x00007FFFFFFFFFFF - end of user range 0xFFFF800000000000 - beginning of the kernel range 0xFFFFFFFFFFFFFFFF - end of kernel range
Each virtual address consists of a page table index (1 bit), virtual page number (VPN) and page offset. The least significant bits form the page offset. The virtual page number consists of the remaining bits. Page offset bits don't change during translation. The border between page offset and VPN in the virtual address changes depending on the page size, used in virtual display. In the current implementation, 16 Kib page sizes are available, and super pages are multiples of 16 Kib (32 MiB and 64 GiB).
6 3 | 6 2 | 6 1 | 6 0 | 5 9 | 5 8 | 5 7 | 5 6 | 5 5 | 5 4 | 5 3 | 5 2 | 5 1 | 5 0 | 4 9 | 4 8 | 4 7 | 4 6 | 4 5 | 4 4 | 4 3 | 4 2 | 4 1 | 4 0 | 3 9 | 3 8 | 3 7 | 3 6 | 3 5 | 3 4 | 3 3 | 3 2 | 3 1 | 3 0 | 2 9 | 2 8 | 2 7 | 2 6 | 2 5 | 2 4 | 2 3 | 2 2 | 2 1 | 2 0 | 1 9 | 1 8 | 1 7 | 1 6 | 1 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 0 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
sign extension | virtual page number | 16 KiB page offset |
Switching between physical and virtual addressing modes is controlled by the privileged special register pta. The mode field sets the page translation mode. After restarting the processor, this flag is zero. Virtual addressing is allowed via pta.mode!=0.
pta.mode | description |
---|---|
0 | without translation (physical addressing) |
1 | reserved |
2 | 2 translation levels |
3 | 3 translation levels |
4 | 4 translation levels |
A variable page size is needed to help the software display system resources, to improve TLB utilization. Typically, operating systems choose a small range of page sizes to implement their virtual memory algorithms. Large pages can be statically distributed. For example, large areas of virtual addressing space can be allocated to the kernel of the operating system, frame buffers, or mapped IO regions. The software can also selectively pin these translations by placing them in translation registers.
Page size can be specified in: translation cache, translation registers, and PT. Page size can also be used as a parameter for TLB cleanup instructions.
The page sizes are encoded as a 4-bit field ps (pagesize). Each field defines the display size of 2ps+12 bytes.
Virtual and physical pages should be aligned on their natural border. For example, 64 kilobyte pages are aligned at the 64KiB border, and 4 megabyte along the border of 4 megabytes.
Processors using variable virtual page sizes, are characterized by the need for hardware implementation of the fully associative TLB buffer. Processors that use only one page size can be bypassed in part by an associative buffer, although usually fully associative.
abbreviation | designation | description |
---|---|---|
r | read | read access with the usual load/store instructions |
w | write | write access with normal load/store instructions |
x | execute | code execution access |
b | backstore | saving/restoring registers from the hardware register stack |
f | finalized | final state, page rights cannot be changed, gives the right to read addresses for indirect call instructions through trusted import tables and virtual function tables |
p | promote | the right to elevate privileges of the current thread to the kernel level |
The software can check page level permissions with the instructions mprobe, mprobef, which check the availability of this virtual page, privilege level, read/write permissions at the page level, and read/write permissions with a security key.
Executable-only pages may be used, to increase privileges on entering operating system code. User level code should usually go to such a page (managed by the operating system) and execute the instruction epc (Enter Privileged Code). When epc has successfully elevated privileges, the subsequent instructions are executed at the target privilege level indicated by the page. A branch can (optionally) lower the current privilege level if the page where the branch is made has a lower privilege level.
Virtual addresses are translated to physical addresses using a hardware structure called Translate Lookaside Buffer (TLB) or translation cache. Using the virtual page number (VPN), the TLB finds and returns the physical page number (PPN). A processor usually have two TLB architectural buffers: instruction TLB (ITLB) and data TLB (DTLB). Each TLB buffer translates, respectively, references to instructions and data. In a simplified implementation, a single (combined) buffer used for both types of translation can be implemented. The term TLB itself refers to the union of instructions, data, and translation cache structures.
When the processor accesses the memory in the TLB, a translation record is searched with the corresponding VPN value. If the corresponding translation record is found, the physical page number PPN (physical page number) is combined with the page offset bits, to form a physical address. In parallel with the translation, page permissions are checked by privilege level and the permissions granted for reading, writing, and executing are checked.
If the required translation is not found in the TLB, the processor itself can search the page table in memory and install it in the TLB. If the required input cannot be found in the TLB and/or page table, the processor generates a miss error in the TLB so that the operating system establishes the translation. In a simplified implementation, the processor may generate an error immediately after a miss in the TLB. After the operating system installs the translation in the TLB and/or page table, the erroneous instruction may be restarted and execution continues.
6 3 | 6 2 | 6 1 | 6 0 | 5 9 | 5 8 | 5 7 | 5 6 | 5 5 | 5 4 | 5 3 | 5 2 | 5 1 | 5 0 | 4 9 | 4 8 | 4 7 | 4 6 | 4 5 | 4 4 | 4 3 | 4 2 | 4 1 | 4 0 | 3 9 | 3 8 | 3 7 | 3 6 | 3 5 | 3 4 | 3 3 | 3 2 | 3 1 | 3 0 | 2 9 | 2 8 | 2 7 | 2 6 | 2 5 | 2 4 | 2 3 | 2 2 | 2 1 | 2 0 | 1 9 | 1 8 | 1 7 | 1 6 | 1 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 0 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
ppn | pl | ma | a | d | 0 | p | ar | v | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
vpn | rv | ps | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
rv | asid |
Translation field | Description |
---|---|
v | Valid bit. If this bit is 1, then translation can be used in the search. |
ar | Global permissions for the virtual page. |
p | Present bit. This bit indicates that the mapped physical page is present in the physical memory and not ejected to disk. |
ma | memory attributes. Describes caching, coherence, writing method, and other attributes of the displayed physical page. |
a | Access bit. This bit may cause an error on access for tracing or debugging purposes. The processor doesn't modify Access bit when referenced. |
d | Dirty bit. There was a write or semaphore instruction on this page. |
pl | Privilege Level or Page Level. |
ppn | Physical page number. |
ps | Page size 2ps bytes. |
vpn | Virtual Page Number. |
asid | Address Space identifer. |
rv | Reserved (doesn't exist) |
TLB is local processor resource (local insertion or clearing of translation entries on one processor doesn't affect the TLB of another processor). Global TLB cleanup is provided to clean translations in all processors within a coherent TLB region in a multiprocessor system.
Translation Cache (TC) is an implementation-defined structure, designed to store a small working set of dynamic translations for links to memory. The processor directly controls the record replacement policy in the TC.
Purge translation cache ptc (purge translation cache) produces cleaning ITC/DTC local processor entries that match the specified range of virtual addresses. The software should handle the case where cleaning should be extended to all processors in a multiprocessor system. Flushing the translation cache doesn't affect fixed TC inputs.
The translation cache has at least 16 inputs for itc and 16 inputs for DTC. An implementation may have additional levels of a TLB hierarchy to increase efficiency.
The translation cache is controlled by both software and hardware. Generally speaking, the software cannot assume how long any installed translation will remain in the cache. This term, as well as the replacement (extrusion) algorithm, depends on the implementation. A processor can push translations out of the cache at any given time for various reasons. TC cleanups can remove more inputs than is explicitly required.
Records in the translation cache must be maintained in a consistent state. When you insert or clean a TLB, all existing entries must be deleted. which partially or completely overlap with the given translations. In this context, overlap refers to two translations with partially or completely overlapping ranges of virtual addressing. For example: two 64K pages with the same virtual addressing, or a 128K page with the virtual address 0x20000 and a 64K page with the address 0x30000.
Translation registers (TR) is part of the TLB, which contain translations whose replacement policies are controlled directly by the software. Each translation cache entry can be fixed and turned to a software-controlled translation register or unlocked and sent to a common pool. Fixed translations are not replaced when the TC overflows (but are flushed when overlapping with new translations). Fixed insert into the previously unfixed TC entry removes the cached translation in this entry. The software can explicitly embed translations in TR by determining the entry number in the cache. Translations are deleted from TR when the translation register is cleared, but not when the translation cache is cleared.
Translation registers allow the operating system to pin critical virtual memory translations into TLB, for example, IO spaces, kernel memory areas, frame buffers, page tables, sensitive interrupt code, etc. The interrupt handler instruction fetching is performed using virtual addressing, and therefore, virtual address ranges containing software translation miss handlers and other critical interrupt handlers should be fixed, otherwise, additional recursive misses in the TLB may occur. Other virtual mappings may be pinned for performance reasons.
Insertion record will be fixed if it is done with the fix bit turned on. Once such a translation falls into the TLB, the processor will not replace this translation in order to make room for other translations. Fixed translations can only be deleted by the TLB software cleanup. Insertions and cleanups of translation registers can selectively delete other translations (from the translation cache).
A processor must have at least 8 fixed translation registers for itc and 8 for dtc. An implementation may have additional translation registers to increase efficiency.
In case of a miss to the TLB translation hardware cache (lack of the necessary record), an interrupt occurs and the software miss handler comes into play. He should find the necessary translation in the page table in memory and place it in TLB, after which the instruction that caused the interrupt is restarted. However, many systems contain a hardware (or half-hardware) implemented unit translations Then, in case of a miss in TLB, the hardware block for searching for translation in memory comes into play, and only if this block doesn't detect the desired translation, an interrupt occurs and a system (software) miss handler is called.
If the processor implements an automatic search block for translations in memory, then the format of individual translation records, the format of the translation table as a whole, and the search algorithm in the translation table ceases to be the free choice of the operating system. At the same time, the system (owned by the OS) translation structures should work in close cooperation with the hardware translation search unit.
Page Table Walker (PTW) is a hardware unit for independent search for translations in RAM in case of their absence in the TLB. PTW is designed to increase the performance of a virtual address translation system.
Page Table (PT) is a translation table in memory, viewable by the PTW hardware unit (must be configured according to the requirements of the PTW equipment).
The processor PTW block can be (optionally) configured to search for translation in PT after a failed search in the TLB for instructions or data. PTW unit provides a significant increase in productivity by reducing the number of interrupts (and therefore delays and cleanup of the processor pipeline), caused by misses in the TLB, and by ensuring the parallel operation of the PTW block to populate the TLB translation cache at the same time as other processor actions.
To organize a page table in memory, traditionally in different architectures, the following schemes are used with varying success:
top-down is a traditional multi-level translation search scheme based on direct downward parsing of a virtual address, when each level of the table tree is directly indexed by the next portion of the virtual address. The easiest way for a hardware implementation. All tree tables are placed in physical memory. The number of memory accesses for searching for translation is equal to the number of levels (depth of the tree) - 2 for X86, 3 for DEC Alpha, 4 for X64, 5-6 for IBM zSeries. It has problems with sparseness and fragmentation, limited support for variable page sizes. It takes up too much space for translation tables (proportional to the size of virtual memory) and inefficiently uses table space with large fragmentation.
guarded top-down is an improved multi-level translation search scheme based on direct downward parsing of a virtual address, when each level of the table tree is directly indexed by the next portion of the virtual address, and omissions of some levels are possible. Harder for hardware implementation. All tree tables are placed in physical memory. The number of memory accesses for translation search may be less than the maximum number of levels. Reduces problems with sparseness and fragmentation, limited support for variable page sizes.
bottom-up is a scheme of the reverse recursive ascending order of viewing translation tables, when recursive misses are used in one large linear table located in virtual memory. Requires hardware implementation of nested interrupts. The number of memory accesses for searching for a translation depends on the number of recursive misses in the TLB and, at best, is 1, but in the worst case, it is proportional to the top-down method. Has problems with sparseness and fragmentation, limited support for variable page sizes. It takes up too much space for translation tables (in the worst case, it is proportional to the size of virtual memory) and inefficiently uses table space with large fragmentation.
inverted hash page table of pages. Its size is proportional to the size of physical (rather than virtual) memory and doesn't depend on the degree of fragmentation of virtual space. The number of memory accesses for translation search doesn't depend on the size of the page table and, if the hash function is correctly selected and the hash table size is usually 1. It copes well with sparseness and fragmentation, limited support for variable page sizes. It caches poorly when looking for translations for neighboring pages.
In the architecture POSTRISC, a multi-level translation search scheme was chosen to implement the page table based on the direct top-down order of viewing the translation tables, when each next level is directly indexed by a new portion of the virtual address. The number of memory accesses for translation search is equal to the number of levels (variable, currently 3 levels). The page table is located in the physical memory space as a multi-level structure of service tables.
6 3 | 6 2 | 6 1 | 6 0 | 5 9 | 5 8 | 5 7 | 5 6 | 5 5 | 5 4 | 5 3 | 5 2 | 5 1 | 5 0 | 4 9 | 4 8 | 4 7 | 4 6 | 4 5 | 4 4 | 4 3 | 4 2 | 4 1 | 4 0 | 3 9 | 3 8 | 3 7 | 3 6 | 3 5 | 3 4 | 3 3 | 3 2 | 3 1 | 3 0 | 2 9 | 2 8 | 2 7 | 2 6 | 2 5 | 2 4 | 2 3 | 2 2 | 2 1 | 2 0 | 1 9 | 1 8 | 1 7 | 1 6 | 1 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 0 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
sign extension | 11 bits | 11 bits | 11 bits | 16 KiB page offset |
In the event of a miss in the TLB translation hardware cache (lack of the necessary record), the hardware translation search block in memory comes into play, and if this block doesn't find the required translation, an interrupt occurs and the program miss handler is called.
Special register page table address (pta) defines the search parameters for translation in memory for the virtual space, describes the location and size of the PT root page in the address space. The operating system must ensure that page tables are aligned naturally.
6 3 | 6 2 | 6 1 | 6 0 | 5 9 | 5 8 | 5 7 | 5 6 | 5 5 | 5 4 | 5 3 | 5 2 | 5 1 | 5 0 | 4 9 | 4 8 | 4 7 | 4 6 | 4 5 | 4 4 | 4 3 | 4 2 | 4 1 | 4 0 | 3 9 | 3 8 | 3 7 | 3 6 | 3 5 | 3 4 | 3 3 | 3 2 | 3 1 | 3 0 | 2 9 | 2 8 | 2 7 | 2 6 | 2 5 | 2 4 | 2 3 | 2 2 | 2 1 | 2 0 | 1 9 | 1 8 | 1 7 | 1 6 | 1 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 0 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
reserved | ppn | 0 | mod | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
reserved | ppn | ma | 0 | s | v | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
reserved | ppn | ma | 0 | s | v | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
reserved | ppn | ar | g | d | a | p | 0 | v |
Field | Bit | Description |
---|---|---|
mod | 3 | Translation mode: 0 - no translation, 1,2,3 and so on - the number of indexing levels when searching. |
v | 1 | Bit of validity. For intermediate and final formats, if 1 - the page entry is valid, otherwise a search error occurs. |
ppn | varied, 30-50 | Physical page number if p=1, or other system data if p=0. |
s | 1 | Superpage bit, stop the search (final format instead of intermediate). |
p | 1 | The page is in memory |
ma | 4 | Page Physical Attributes. Should be defined per superpage. |
a | 1 | Access Bit |
d | 1 | Dirty bit - indicates whether there were any changes in the page. When a page is pushed into a swap, it may not be saved if the page is already in the swap and has not changed. |
ar | 6 | Permissions |
rv | Reserved (must be zeros) |
The format of the page tables should take into account the mapping of virtual addresses to a physical address space of a total depth of 64 bits.
List of translation instructions. The processor doesn't guarantee that the modification of translation resources is observed by subsequent samples of instructions or by accessing data in memory. The software should provide serialization (serialization by issuing a synchronizing barrier instruction) according to instructions before any dependent selection of instructions and serialization according to data before any dependent reference to data.
Syntax | Description |
---|---|
ptc ra,rb,rc | Purge translations cache |
ptri rb,rc | Clear the instruction translation register. ITR ← gr[rС], ifa |
ptrd rb,rc | Cleans the data translation register. DTR ← gr[rС], ifa |
mprobe ra,rb,rc | Returns page permissions for the privilege level gr[rC] |
tpa ra,rb | Translates the virtual address to the physical address |
The ptc instruction invalidates all translations from the local processor cache specified with the address and ASID. The processor determines the ASID-specifiec page that contains that address and invalidates all TLB entries for that page. The instruction deletes all translations from both translation caches that intersect with the specified address range. If the paging structures map the linear address using a large pages and/or there are multiple TLB entries for that page, the instruction invalidates all of them.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | base | address | asid | 0 | opx |
Translation records can be inserted into fixed translation registers by instructions mtitr (move to instruction translation register) and mtdtr (move to data translation register). The data for the inserted translation is taken from the first register argument of the instruction and special registers ifa. The translation register number is taken from the second argument register.
Translation records can be deleted from translation registers by instructions ptri (Purge Translation register for Instruction) and ptrd (Purge Translation register for Data). The first argument is the base address register number, the second argument is the register number that stores the translation register number. The instructions also delete all translations from both translation caches that intersect with the specified address range. The instructions only remove translations from the local processor registers.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | target | base | asid | 0 | opx |
Permissions for a virtual page can be specified by instructions mprobe (memory probe), mprobef (memory probe faulting). The mprobe instruction for a given base address and privilege level returns the available rights mask. The privilege level is set as a value in the register. The mprobef instruction doesn't return rights, but tests for the necessary access rights for a given base address and privilege level. If there are no rights, the mprobef instruction raises a «Data Access rights fault» error, otherwise the instruction doesn'thing.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | base | pl | 0 | opx |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | 0 | base | pl | 0 | opx |
Privileged instruction tpa (translate to physical address) returns the physical address corresponding to the given virtual address.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | base | 0 | 0 | opx |
The common sequence of TLB/PT search looks like this. If the TLB search fails, if the PTW is blocked (pta.v=0), a miss error in ITLB/DTLB occurs. If PTW is enabled (pta.v=1), then PTW calculates the index for access to the root page table, and tries to find the missing translation in the PT memory, looking through the table tree. If additional misses in the TLB occur during PTW operation, PT generates an error. If the PTW doesn't find the required translation in the memory (that is, the PT doesn't contain it), or the search is interrupted, an instruction/data miss TLB error occurs. Otherwise, the record is loaded into ITC or DTC. The processor can upload records to ITC or DTC, even if the program did not require translation.
Insertions from PT to TC follow the same «cleanup rules before inserting» as program inserts. PT insertion of entries that exist in TR registers is not allowed. Specifically, PT can search for any virtual addressing, but if the address is mapped to TR, such a translation should not be inserted into the TC. The software should not be placed in PT translations that intersect with current TR translations. An insert from PT may result in a machine abnormal termination if there is overlap between the TR and the inserted PT record.
After the translation record is loaded into the TLB, additional translation errors are checked (in order of priority): lack of rights to the page, enabled access bit, enabled dirty bit, lack of a page in memory.
This chapter describes the floating-point and vector subsystem of the virtual processor instruction set.
IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE 754-1985) defines two floating-point formats – single and double precision, in two groups – main and advanced. The architecture supports all four formats according to IEEE terminology: basic single and double formats and extended dual format. The basic dual format serves simultaneously as an extended single format.
The architecture defines the representation of floating-point values in four different fixed-length binary formats. The format can be 16-bit for half-float precision values, 32-bit for single precision values, 64-bit for double precision values, 128-bit for quadruple precision values. Values in each format are composed of three fields: sign bit (S), exponent (E), fractional part or mantissa (F).
15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
S | Exp | Fraction |
15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
S | Exp | Fraction | |||||||||||||
Fraction |
15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
S | Exp | Fraction | |||||||||||||
Fraction | |||||||||||||||
Fraction | |||||||||||||||
Fraction |
15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
S | Exp | ||||||||||||||
Fraction | |||||||||||||||
Fraction | |||||||||||||||
Fraction | |||||||||||||||
Fraction | |||||||||||||||
Fraction | |||||||||||||||
Fraction | |||||||||||||||
Fraction |
Single precision numbers occupy four adjacent bytes of memory, starting with an arbitrary address multiple of 4. Double precision numbers occupy eight adjacent bytes of memory, starting with an arbitrary address multiple of 8. Quadruple numbers occupy sixteen contiguous bytes of memory, starting with an arbitrary address multiple of 16.
The values represented within each format are determined by two integer parameters – the size of the format S and the number of bits of the exponent P. All other – parameters are derived from these two.
Format Options | Half | Single | Double | Quadruple |
---|---|---|---|---|
Format bits B | 16 | 32 | 64 | 128 |
Exponent bits P (P<B) | 5 | 8 | 11 | 15 |
Sign bit S (1) | 1 | 1 | 1 | 1 |
Fraction bits FB: (B−P−1) | 10 | 23 | 52 | 112 |
Fraction significant bits (B−P) | 11 | 24 | 53 | 113 |
Significant decimal digits log10(2B−P) | 3.311 | 7.225 | 15.955 | 34.016 |
Maximum exponent EMAX: (2P−1−1) | 15 | 127 | 1023 | 16383 |
Minimum exponent EMIN: −(2P−1−2) | −14 | −126 | −1022 | −16382 |
Exponent bias (2P−1−1) | 15 | 127 | 1023 | 16383 |
Maximum biased exponent EBMAX: (2P−1) | 31 | 255 | 2047 | 32767 |
bias adjustment 3×2P–2 | 24 | 192 | 1536 | 24576 |
The following table shows the exact limit values for the three formats decimal places:
Limit | Value |
---|---|
Normalized values | (−1)S×1.F×2E−EMAX |
Maximum normalized values | (2.0−2−FB)×2EMAX |
Single absolute maximum | 3.40282347e+38 |
Double absolute maximum | 1.7976931348623158e+308 |
Quadruple absolute maximum | 1.1897314953572317650857593266280070162e+4932 |
Minimum normalized values | 1.0×2EMIN |
Single absolute minimum | 1.17549435e−38 |
Double absolute minimum | 2.2250738585072013e−308 |
Quadruple absolute minimum | 3.3621031431120935062626778173217526026e−4932 |
Subnormalized values | (−1)sign×0.fraction×2EMIN |
Maximum subnormalized values | (1−2−FB)×2EMIN |
Quadruple maximum subnormal | 3.3621031431120935062626778173217519551×10−4932 |
Minimum subnormalized values | 1.0×2EMIN−FB |
Single minimum (subnormal) | 1.401298464324817071e−45 (inaccurate) |
Double minimum (subnormal) | 4.940656458412465442e−324 (inaccurate) |
Quadruple minimum (subnormal) | 6.4751751194380251109244389582276465525×10−4966 |
The following objects are allowed within each format:
NAN – short for «not a number» (Not A Number). NAN is an IEEE is a binary floating-point representation that is something other than a number. NANs come in two forms: «signaling» NANs and «quiet» NANs.
Arithmetic with infinities is treated as if the operands are arbitrary large amount. Negative infinity is less than any finite number; positive infinity is greater than any finite number.
Denote: S is a sign bit (sign), EXP is an exponent with offset, i.e. reduced to unsigned (biased exponent), F is a fractional part or mantissa (fraction), XXXXX as an arbitrary but non-zero sequence of bits, EBMAX is a maximum offset unsigned exponent. The value of a float number is interpreted as follows.
If EXP = EBMAX (consists of one bit units), then this is a special IEEE value. To recognize special values, F. is further investigated. If F is not equal to zero, then it is + NAN or −NAN. In particular, if the first bit of the mantissa is 0, then it is a signal NAN (signaled), and if 1 – then it is «quiet» NAN. If EXP = EBMAX and F = 0, then it is «infinity» + INF or −INF depending on S. If 0 < EXP < EBMAX, then this is a finite normalized number. If EXP = 0, and the mantissa is not equal to zero, then this is a finite unnormalized number. If EXP = 0 and F = 0, then this is +0 or −0 depending on S.
Exponent | Fraction | IEEE value |
---|---|---|
EBMAX | 0XXXXXX | QNAN |
EBMAX | 1XXXXXX | SNAN |
EBMAX | 0 | INF |
0<E<EBМAX | any | Finite (Normalized): (−1)S × 2(E−BIAS) × 1.F |
0 | XXXXXXX | Finite (Denormal): (−1)S × 2(− EMIN) × 0.F |
0 | 0 | ±0 |
Floating-point operations can raise arithmetic exceptions for many reasons, including invalid operations, overflow from above or below, division by zero, inaccurate result.
NAN is the abbreviation for the concept of «not a number». NAN is an IEEE bitmap floating-point that represents something other than a number. These are the values that have the maximum value of the offset exponent and non-zero fractional part. The sign bit is ignored (NAN is neither positive nor negative), although it can be determined. NANs come in two forms: Signaling NANs and Silent NANs. If the high bit of the mantissa is zero, then this is a signaled NAN, otherwise a quiet NAN.
Signaled NAN (SNAN) is used to provide values for uninitialized variables and for extension arithmetic. The signaled NAN reports an invalid operation when it is the operand of an arithmetic operation, and may throw an arithmetic exception. The signaled NAN is used to raise a signal exception when such a value appears as the operand of the computational instruction.
Quiet NAN (QNAN) provides the retrospective diagnostic information relative to previous invalid or inaccessible data and results. Quiet NANs propagate through almost every operation without generating arithmetic exceptions.
QNAN is used to present the results of some invalid operations, such as invalid arithmetic operations at infinity or on a NAN, when the generation of an exception for an invalid operation is blocked. Quiet NANs propagate through all floating-point operations except ordered comparisons (LT, LE, GT, GE) and conversions to an integer, otherwise they report exceptions. QNAN codes can thus be stored through a sequence of floating-point operations and used to transmit diagnostic information, helping to identify the consequences of illegal operations.
When QNAN is the result of a floating-point operation, because one of the NAN operands or because the QNAN was generated due to a blocked exception on an invalid operation, then the following rule applies to determine the NAN with the high bit of mantissa 1, which should be saved as a result. If either operand is an SNAN, then the SNAN is returned as the result of the operation. Otherwise, if a QNAN is generated due to a prohibition on the exclusion of an invalid operation, then this QNAN is returned as a result. If the QNAN is generated as a result, then the QNAN has a positive sign, an exponent of all 1, and the most significant bit of the mantissa 1 (all other 0). An instruction that generates a QNAN as a result of an exception ban due to an invalid operation should generate such a QNAN (e.g. 0x7FF8000000000000 for double).
Floating-point instructions provide a subset of the IEEE standard for binary floating-point arithmetic (ANSI/IEEE Standard 754-1985 for Binary Floating-Point Arithmetic). The following describes how to create a full implementation of IEEE.
Four IEEE rounding modes are supported in hardware: normal, truncation, plus infinity, and minus infinity. The hardware supports IEEE enable/disable software traps for special situations. Addition, subtraction, multiplication, division, conversion between floating formats are supported in hardware, rounding to an integer in floating-point format, conversion between floating and integer formats, comparison, square root calculation. The remainder of division is supported in software, conversion of binary format to decimal number. Copying (possibly with a change in sign) without changing the format is not considered an operation (non-finite numbers are not checked). Operations with different formats are not provided, calculations occur with the maximum accuracy available for this vector format.
Conversion precision between decimal strings and binary numbers floating-point - no less than the requirements of the IEEE standard. Depends on the implementation, whether the conversion procedures to decimal format are processed any excess numbers (over 9, 17 or 36 digits) as zeros.
Overflows above and below, NAN, INF, which the binary to decimal conversion software encounters, return strings that define these states.
The hardware supports comparisons of numbers of the same format. You can programmatically compare numbers with a different format. The result of the comparison is true or false. The hardware supports the required six predicates and the predicate of incomparability of numbers. The other 19 optional predicates can be created from comparisons and bitwise operations. Infinity is supported in hardware in comparison instructions.
QNANs provide retrospective diagnostic information. Copying NAN signals without changing the format doesn't report an invalid exception (fmerge instructions also do not check for non-finite numbers.)
The hardware fully supports negative null operands and follows IEEE rules to create negative null results. The hardware support bottom overflow and denormal numbers.
Tiny is detected by hardware after rounding, and a loss of accuracy is detected by the software as an inaccurate result.
Universal registers with a width of 128 bits each can store in themselves one float number of quadruple precision (quadruple float), 2 double precision, 4 single precision, 8 half-float precision, or integer vector length of 1, 2, 4 or 8 bytes.
register bytes | |||||||||||||||
15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
half | half | half | half | half | half | half | half | ||||||||
single | single | single | single | ||||||||||||
double | double | ||||||||||||||
quadruple |
The special register fpcr regulates the execution of material and vector operations. It controls the arithmetic rounding mode for all instructions except explicit rounding instructions, indicates the allowed traps of the user level, stores the exceptions that have occurred (exceptions), stores excepted and masked exceptions.
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
IEEE masked flags | IEEE masked traps | IEEE nonmasked traps | control bits | ||||||||||||||||||||||||||||
0 | im | um | om | zm | dm | vm | 0 | i | u | o | z | d | v | 0 | i | u | o | z | d | v | 0 | td | ftz | 0 | rm |
bits | description |
---|---|
v | Invalid Operation |
d | Denormal/Unnormal Operand |
z | Zero Divide |
o | Overflow |
u | Underflow |
i | Inexact result |
td | Traps disabled |
rm | Rounding mode |
ftz | Flush-to-Zero mode (zeroing without underflow) |
The rm (rounding mode) bits control the rounding mode of the results. The rounding mode doesn't affect the execution of explicit rounding instructions, for which only the rounding mode specified directly in the instructions matters.
Rounding mode (RM) | Description |
---|---|
0 | Round to nearest (round) |
1 | Round toward minus infinity (floor) |
2 | Round toward plus infinity (ceil) |
3 | Round toward zero (chopping) |
The masked flags vector stores a mask of flags allowing IEEE interrupts of the corresponding type. The bits of the vectors nonmasked traps and masked traps store flags of the exceptions that occurred. the occurrence of which was allowed (or, accordingly, prohibited) in the vector masked flags.
The fldi instruction is used to load direct real constants into the registers. It allows you to load real constants presented in formats up to extended (80 bits) without loss of accuracy. The instruction doesn't allow to set zero values, special values, and has restrictions on the order value (6 bits). The instruction stores numbers 28 bits long (or 70 bits for a double instruction).
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | target | s | exponent | mantissa (high 21 bits) |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
mantissa (full 63 bits) |
All computational operations are performed only on registers. The basic operation for maximum performance is vector (or scalar) operation «multiply-add» MAC (multiply-accumulate fused). Floating-point arithmetic instructions that fuse multiplication with addition and possibly sign change, formed according to the FMAC rule.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | src1 | src2 | src3 | opx |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | src1 | src2 | 0 | opx |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | src | 0 | 0 | opx |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | src | 0 | rm | opx |
The following table lists these instructions. They exist in all variants of – single, double or quadruple respectively.
Scalar | Packed | Description |
---|---|---|
Fused instructions | ||
fmadds[h|s|d|q] | fmaddp[h|s|d|q] | a = b × c + d |
fmsubs[h|s|d|q] | fmsubp[h|s|d|q] | a = b × c − d |
fnmadds[h|s|d|q] | fnmaddp[h|s|d|q] | a = − b × c + d |
fnmsubs[h|s|d|q] | fnmsubp[h|s|d|q] | a = − b × c − d |
fmaddap[h|s|d] | ax = bx × c.x + dx, ay = by × cy − dy | |
fmsubap[h|s|d] | ax = bx × cx − dx, ay = by × c.y + dy | |
binary instructions | ||
fadds[h|s|d|q] | faddp[h|s|d|q] | a = b + c |
faddhp[h|s|d] | ax = b.x + by, ay = c.x + cy | |
faddcp[h|s|d] | ax = b.x + cx, ay = bx − cy | |
fnadds[h|s|d|q] | fnaddp[h|s|d|q] | a = − (b + c) |
fsubs[h|s|d|q] | fsubp[h|s|d|q] | a = b − c |
fsubhp[h|s|d] | ax = bx − by, ay = cx − cy | |
fsubcp[h|s|d] | ax = bx − cx, ay = b.x + cy | |
fabsds[h|s|d|q] | fabsdp[h|s|d|q] | a = abs (b − c) |
fnabsds[h|s|d|q] | fnabsdp[h|s|d|q] | a = − abs (b − c) |
fmuls[h|s|d|q] | fmulp[h|s|d|q] | a = b × c |
fdivs[h|s|d|q] | fdivp[h|s|d|q] | a = b/c |
fmins[h|s|d|q] | fminp[h|s|d|q] | a = min (b, c) |
fmaxs[h|s|d|q] | fmaxp[h|s|d|q] | a = max (b, c) |
famins[h|s|d|q] | faminp[h|s|d|q] | a = min (abs (b), abs (c)) |
famaxs[h|s|d|q] | famaxp[h|s|d|q] | a = max (abs (b), abs (c)) |
fcmps[h|s|d|q]oeq | fcmpp[h|s|d]oeq | fp compare ordered and equal |
fcmps[h|s|d|q]one | fcmpp[h|s|d]one | fp compare ordered and not-equal |
fcmps[h|s|d|q]olt | fcmpp[h|s|d]olt | fp compare ordered and less |
fcmps[h|s|d|q]ole | fcmpp[h|s|d]ole | fp compare ordered and less-equal |
fcmps[h|s|d|q]o | fcmpp[h|s|d]o | fp compare ordered |
fcmps[h|s|d|q]ueq | fcmpp[h|s|d]ueq | fp compare unordered or equal |
fcmps[h|s|d|q]une | fcmpp[h|s|d]une | fp compare unordered or not-equal |
fcmps[h|s|d|q]ult | fcmpp[h|s|d]ult | fp compare unordered or less |
fcmps[h|s|d|q]ule | fcmpp[h|s|d]ule | fp compare unordered or less-equal |
fcmps[h|s|d|q]uo | fcmpp[h|s|d]uo | fp compare unordered |
p[s|d]pk | pack two vectors into one | |
Conversion to integer with rounding | ||
fcvtiw2s[h|s|d|q] | fcvtiw2ps | convert signed word to floats |
fcvtuw2s[h|s|d|q] | fcvtuw2ps | convert unsigned word to floats |
fcvts[h|s|d|q]2iw | fcvtps2iw | convert floats to signed word |
fcvts[h|s|d|q]2uw | fcvtps2uw | convert floats to unsigned word |
fcvtid2s[h|s|d|q] | fcvtid2pd | convert signed doubleword to floats |
fcvtud2s[h|s|d|q] | fcvtud2pd | convert unsigned doubleword to floats |
fcvts[h|s|d|q]2id | fcvtpd2id | convert floats to signed doubleword |
fcvts[h|s|d|q]2ud | fcvtpd2ud | convert floats to unsigned doubleword |
fcvtiq2s[h|s|d|q] | convert signed quadword to floats | |
fcvtuq2s[h|s|d|q] | convert unsigned quadword to floats | |
fcvts[h|s|d|q]2iq | convert floats to signed quadword | |
fcvts[h|s|d|q]2uq | convert floats to unsigned quadword | |
Conversion to narrower float with rounding | ||
fcvts[s|d|q]2sh | convert float to half-float | |
fcvts[d|q]2ss | convert float to single float | |
fcvtsq2sd | convert float to double float | |
Extending to wider float instructions | ||
fextsh2ss | extend float to single float | |
fexts[h|s]2sd | extend float to double float | |
fexts[h|s|d]2sq | extend float to quadruple float | |
Rounding instructions | ||
frnds[h|s|d|q] | frndp[h|s|d] | floating-point round |
unary instructions | ||
fnegs[h|s|d|q] | fnegp[h|s|d] | floating-point negate value |
fabss[h|s|d|q] | fabsp[h|s|d] | floating-point absolute value |
fnabss[h|s|d|q] | fnabsp[h|s|d] | floating-point negate absolute value |
frsqrts[h|s|d|q] | frsqrtp[h|s|d] | floating-point reciprocal square root |
fsqrts[h|s|d|q] | fsqrtp[h|s|d] | floating-point square root |
funphp[h|s|d] | unpack high half the vector into wider precision vector | |
funplp[h|s|d] | unpack lower half the vector into wider precision vector |
The instructions fcmp are intended for generating predicates from the results of floating-point comparisons. They produce boolean scalar/vectors as a result of real vector comparison. Comparison of real numbers is done by elementwise comparison of two vectors and recording the result in the third real vector. All bits of the result vector, for elements of which the condition is satisfied, are set to 1, the rest to 0. After comparison, you can get a single predicate bit performing respectively conjunction and disjunction of all bits of the result vector.
For some instructions, the second operand is replaced with the 7-bit immediate value count from 0 to 127, which describes the accuracy of a non-pipelined unary operation, e.g. fsqrt or frcp.
The accuracy of executing the instructions fsqrt, frcp and frsqrt is indicated by the constant count directly in the instruction. The instruction is executed with minimal accuracy at the same time as a regular MAC, without pipeline delays.
Quadruple scalar | Scalar Double | Scalar Single | Description | ||
---|---|---|---|---|---|
branch if compare is true | |||||
bfsqoeq | bfsdoeq | bfssoeq | ordered and equal | ||
bfsqone | bfsdone | bfssone | ordered and not-equal | ||
bfsqolt | bfsdolt | bfssolt | ordered and less | ||
bfsqole | bfsdole | bfssole | ordered and less-or-equal | ||
bfsqo | bfsdo | bfsso | ordered | ||
bfsqueq | bfsdueq | bfssueq | unordered or equal | ||
bfsqune | bfsdune | bfssune | unordered or not-equal | ||
bfsqult | bfsdult | bfssult | unordered or less | ||
bfsqule | bfsdule | bfssule | unordered or less-or-equal | ||
bfsquo | bfsduo | bfssuo | unordered | ||
branch if classification is true | |||||
bfsqclass | bfsdclass | bfssclass | compare | ||
nullify if compare is true | |||||
nulfsqoeq | nulfsdoeq | nulfssoeq | ordered and equal | ||
nulfsqone | nulfsdone | nulfssone | ordered and not-equal | ||
nulfsqolt | nulfsdolt | nulfssolt | ordered and less | ||
nulfsqole | nulfsdole | nulfssole | ordered and less-or-equal | ||
nulfsqo | nulfsdo | nulfsso | ordered | ||
nulfsqueq | nulfsdueq | nulfssueq | unordered or equal | ||
nulfsqune | nulfsdune | nulfssune | unordered or not-equal | ||
nulfsqult | nulfsdult | nulfssult | unordered or less | ||
nulfsqule | nulfsdule | nulfssule | unordered or less-or-equal | ||
nulfsquo | nulfsduo | nulfssuo | unordered | ||
nullify if classification is true | |||||
nulfsqclass | nulfsdclass | nulfssclass |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | src1 | src2 | opx | disp17x16 |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | src1 | src2 | opx | dist-no | dist-yes | opx |
The instructions for branch on floating-point classification check floating-point value class. The floating-point classification instructions use 7-bit immediate mask with flags describing which floating-point value types are meet condition.
Classification flag | Description | Assembler mnemonic |
---|---|---|
0x01 | Zero | @zero |
0x02 | Negative | @neg |
0x04 | Positive | @pos |
0x08 | Infinity | @inf |
0x10 | Normalized | @norm |
0x20 | Denormalized | @denorm |
0x40 | NaN (Quiet) | @nan |
0x80 | fixme: no place for Signaling NaN | @snan |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | src | classify | opx | disp17x16 |
The instructions for nullification on floating-point classification nfclsd, nfclsq, nfclss check floating-point value class.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | src | classify | 0 | dist-no | dist-yes | opx |
The instructions for manipulating real registers as bit vectors are independent of the type of data stored in the registers. They are intended for conditional movements, operations on bit masks, generation of predicates.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | src1 | src2 | 0 | opx |
Name | Description |
---|---|
vsll | shift left |
vsrl | shift right |
vrll | rotate left |
vrrl | rotate right |
p1perm | permute bytes |
lvsr | vector load for shift left (permutation) |
Instruction vsel (vector bitwise select) produces a bitwise selection of two registers based on the contents of the third register, where the bit mask is the preliminarily computed result of a logical operation or a comparison operation.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | src1 | src2 | src3 | opx |
Instructions dep16 (vector deposit) and srp16 (vector shift right pair) produce a bitwise selection of two registers. The instruction dep16 takes the first count bit of the result from the first operand register, the remaining bits are from the second operand register. The instruction srp16 takes the first count bit of the result from the upper part of the first operand register, the remaining bits are from the lower part of the second operand register.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | src1 | src2 | count | opx |
These are DSP (digital signal processing) instructions for working with multimedia integer data. Instructions are generated according to the FBIN rule (format). The first register is the result. The second and third are operands.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | src1 | src2 | 0 | opx |
The size of vector data can be 1, 2, 4, and 8 bytes. It is possible to carry out calculations with rounding modulo or with saturation (saturate). Saturation can be signed or unsigned. Modular rounding can be truncated or carried back (carry-out).
Name | Description | Element Size |
---|---|---|
vaddc* | add carryout unsigned | 1,2,4,8 |
vaddu* | add unsigned modulo | 1,2,4,8 |
vaddo* | add overflow | 1,2,4,8 |
vaddss* | add signed saturate | 1,2,4,8 |
vaddus* | add unsigned saturate | 1,2,4,8 |
vavgs* | average signed | 1,2,4,8 |
vavgu* | average unsigned | 1,2,4,8 |
vcmpeq* | compare equal | 1,2,4,8 |
vcmplts* | compare less than signed | 1,2,4,8 |
vcmpltu* | compare less than unsigned | 1,2,4,8 |
vmaxs* | maximum signed | 1,2,4,8 |
vmaxu* | maximum unsigned | 1,2,4,8 |
vmins* | minimum signed | 1,2,4,8 |
vminu* | minimum unsigned | 1,2,4,8 |
vmrgh* | merge high | 1,2,4,8 |
vmrgl* | merge low | 1,2,4,8 |
vpkssm* | pack signed as signed modulo | 2,4,8 |
vpksss* | pack signed as signed saturate | 2,4,8 |
vpksum* | pack signed as unsigned modulo | 2,4,8 |
vpksus* | pack signed as unsigned saturate | 2,4,8 |
vpkuum* | pack unsigned as unsigned modulo | 2,4,8 |
vpkuus* | pack unsigned as unsigned saturate | 2,4,8 |
vrol* | rotate left | 1,2,4,8 |
vror* | rotate right | 1,2,4,8 |
vsll* | shift left logical | 1,2,4,8 |
vsra* | shift right alfebraic | 1,2,4,8 |
vsrl* | shift right logical | 1,2,4,8 |
vsubb* | subtract carryout unsigned | 1,2,4,8 |
vsubu* | subtract unsigned modulo | 1,2,4,8 |
vsubus* | subtract unsigned saturate | 1,2,4,8 |
vsubss* | subtract signed saturate | 1,2,4,8 |
vupkhs* | unpack high signed | 1,2,4 |
vupkls* | unpack low signed | 1,2,4 |
In the table, the asterisk* replaces the size of vector elements: 1, 2, 4, 8.
This chapter describes the extended virtual processor instruction set which was not included in the basic set.
To simplify addressing, several instructions have been introduced that calculate effective addresses without going to memory. The ldax instruction returns the effective address as indexed addressing.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | base | index | scale | sm | disp |
The instruction ldar (load address relative) calculates the ip-relative base address as jump instruction. The first argument is the number of the result register, the second is the distance in the instruction bundles from the current position (in assembler, this is a label in the code section, or a label in the immutable data section, aligned on a 16-byte boundary). It is used to get the base address of immutable data from a code section, function address or label. The instruction doesn't generate interrupts.
ldar dst, label
This instruction is necessary for position-independent code to get the absolute address of objects, stored at a fixed distance from the current position, for example, intra-module procedures or unchanged local module data. On systems like MAS (Multiple Address Spaces) with multiple address spaces, where the module's private data is stored at a fixed distance from the code section, it can also be used to obtain the base absolute address of the module's private data.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | label (28 bits) |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
0 | label (expanding to 60 bits instead of 28) |
The instruction is formed according to the ldar rule. The result register is followed by a 28-bit field for encoding the offset relative to the instruction counter. The data block must be aligned with at least a 16-byte boundary, since the offset expresses the distance in instruction bundles, not bytes. The general formula for obtaining the address:
gr[dst] = ip + 16 × sign_extend(label)
Offset field 28 bits long (64 bits for double instruction) after sign extension and left shift by 4 positions is added to the contents of the instruction counter ip, to produce a 64-bit effective address. The maximum distance for a one-slot instruction is 2 GiB on either side of the instruction counter. The ldar instruction allows the continuation of the immediate value in the instruction code to the next slot of the bundle with the formation of a dual-slot instruction.
The ldar instruction might be used to compute address of the static module data. But specially for this the another instruction ldafr is intended (load address forward relative), which allow to address any address, not only 16-byte bundle-aligned. It computes effective address same as all ip-relative load/store instructions. This reduces the maximal available distance 16 times, so only forward references with unsigned offset are possible, so distance reduction is 8 times only. To use ldafr, the distance from the current bundle to the data should not exceed 256 MiB. Usually 256 MiB is enough for any module.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | label (28 bits) |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
0 | label (expanding to 64 bits instead of 28) |
gr[dst] = ip + zero_extend(label)
If a constant with a sign is fits in 28 bits, then it's more efficient to use the ldi instruction, and if fits in 56 bits then ldi along with ldan. However, when loading constants in bulk, a single ldar instruction falls on several loading instructions, and then a pair of ldi and ldan – instructions is less compact than a single ldw instruction. As for loading 8-byte integer constants, real constants, vector constants, then using ldar along with ld8, ld4 and other download instructions, is the recommended, and often the only possible way to load such constants.
Base with offset addressing allows 1 MiB addressing to both sides of the base address when using one-slot instructions (21-bit offset). If the object is beyond 1 MiB, you will have to use dual-slot instructions. But, according to the principle of access locality, with a high probability the program will access next objects located near the first one. This fact can be used, and once calculate the base address, from which several necessary objects are located no further than 1 MiB, and then use one-slot instructions to address them.
Nearest Base Address Calculation Instructions ldan (load address near). It is used to optimize local (by place and time) memory access without using dual-slot instructions and long offsets. Another nearest base address instruction ldanrc (load address near relative consistent).
ldan dst, base, simm ldanrc dst, base, simm
First argument is the result register number, second is base address register number, the third is an immediate value is 21 bits long (or 63 bits for a long instruction), extended to 64 bits. The instruction allows the continuation of the immediate value in the instruction code up to 63 bits to the next slot of the bundle with the formation of a dual-slot instruction.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | base | simm (21 bits) |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
0 | (44 bits instead of 21) |
The target 64-bit address is calculated (for ldan and ldanrc) as:
gr[dst] = gr[base] + (simm << 20)
gr[dst] = ip + gr[base] + (simm << 20)
The following example shows how to use the ldan instruction to access a group of closely spaced (no more than 512 KiB from each other), but far-away data (the distance to the sym object is more than 512 KiB from the base address).
Without using ldan (4 double instructions, 8 slots)
ldsw.l %r1, base, sym + 4 ldw.l %r2, base, sym + 8 std.l %r2, base, sym + 16 ldd.l %r3, base, sym + 32
Using ldan (5 single instructions, 5 slots)
ldan tmp, base, data_hi (sym); put the nearest address in tmp ldsw g11, tmp, data_lo (sym) +4; tmp addressing ldw g12, tmp, data_lo (sym) +8 std g12, tmp, data_lo (sym) +16 ldd g13, tmp, data_lo (sym) +32
For hardware support for long arithmetic, it is advisable to add special instructions. In the general case, for intermediate addition/subtraction of parts of high precision numbers it is required to specify the incoming carry (borrow), two operands, the result and the outgoing carry (borrow).
When explicitly coding all dependencies and not using global flags (which is good for parallel/pipeline execution of instructions) it requires 5 parameters: the result, two operands, input and output carry/borrow. There is not enough space in the instructions for all five parameters. Therefore, the high part of 128-bit registers is used to return the carry/borrow.
A special instruction mulh (multiply high) was introduced for hardware support for multiplying long numbers calculating the upper half of a 128-bit product of two 64-bit numbers.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | ra | rb | rc | 0 | opx |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | ra | rb | rc | rd | opx |
Syntax:
addc ra, rb, rc addaddc ra, rb, rc, rd subb ra, rb, rc subsubb ra, rb, rc, rd mulh ra, rb, rc
Name | Operation | Description |
---|---|---|
addc | add with carry | ra = carry (rb + rc), sum (rb + rc) |
subb | subtract with borrow | ra = borrow (rb − rc), rb-rc |
addaddc | add and add with carry | ra = carry (rb + rc + rd. high), rb + rc + rd.high |
subsubb | subtract and subtract with borrow | ra = borrow (rb − rc −rd.high),rb−rc−rd.high |
It is assumed that numbers of arbitrary length are already loaded into the registers. For example, the addition of 256-bit numbers will occur as follows:
addc a1, b1, c1 ; sum of lower parts, first carry-out addaddc a2, b2, c2, a1 ; sum of middles and carry-in, next carry-out addaddc a3, b3, c3, a2 ; sum of middles and carry-in, next carry-out addaddc a4, b4, c4, a3 ; sum of higher and carry-in, last carry-out
The instruction syscall (system call) does the call to the kernel of the system to process the system request. The system call number is obtained from r1, arguments from subsequent registers.
Unlike interrupts, a system call is an analogue of a function call, and has similarly implemented return from it to the next bundle. Therefore, after the instruction syscall in assembler, you need to put a label to ensure that the subsequent instructions fall into the new bundle. Bits of future predication are cleared.
The first frame registers are rotated, and return address is stored in zero register in the new frame. Subsequent local registers contain syscall arguments.
The sysret (system return) instruction returns from the system request handler, which was called using syscall. The instruction use the return address and frame state from zero register.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | 0 | opx |
The instruction int (interrupt) is provided for sending interrupts to the current core themselves programmatically. The sent interrupt doesn't happen synchronously to the instruction thread, but it can be delayed until the moment when this vector is unmasked. For the user-mode program, when all interrupts are unmasked, the sent interrupt is happen synchronously to the instruction thread. The interrupt index is calculated as gr[src] + simm10. The instruction support both styles of interrupt code passing: hardcoded codes with zero register gz or dynamic code pasing.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | 0 | src | simm10 | opx |
The rfi instruction (return from interruption) returns from the interrupt handler. It returns to the beginning of the bundle containing the interrupted incomplete instruction (in case of an error), or to a bundle containing the subsequent instruction (in the case of a trap).
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | 0 | opx |
Name | Operation |
---|---|
aesdec ra, rb, rc | aes decrypt round |
aesdeclast ra, rb, rc | aes decrypt last round |
aesenc ra, rb, rc | aes encrypt round |
aesenclast ra, rb, rc | aes encrypt last round |
aesimc ra, rb | aes inverse mix columns |
aeskeygenassist ra, rb, uimm8 | aes key generation assist |
clmulll ra, rb, rc | carry-less multiply low parts |
clmulhl ra, rb, rc | carry-less multiply high and low parts |
clmulhh ra, rb, rc | carry-less multiply high parts |
crc32c ra, rb, rc, rd | crc32c hash |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | src1 | src2 | 0 | opx |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | src | 0 | 0 | opx |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | src | round constant | opx |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | prev | data | len | opx |
The crc32c instruction computes crc32c hash. The new hash value is based on previous hash value «prev». The hashed data is in register «data». The len parameter may be any value. If it is bigger than 16, only 16 bytes of data in register data is used.
The special instructions random are designed to generate random variables. Reading from it returns the next 64-bit random number. The instruction returns random numbers that are compliant to the «U.S. National Institute of Standards and Technology (NIST)» standards on random number generators.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | src | 0 | 0 | opx |
The src operand specifies the used random generator.
Instruction | Source | NIST Compliance |
---|---|---|
rdrand(0) | Cryptographically secure pseudorandom number generator | SP 800-90A |
rdseed(1) | Non-deterministic random bit generator | SP 800-90B & C (drafts) |
The numbers returned by rdseed are referred to as "seed-grade entropy" and are the output of a true random number generator (TRNG), or an enhanced non-deterministic random number generator (ENRNG) in NIST-speak. rdseed is intended for use by software vendors who have an existing PRNG, but would like to benefit from the harsware entropy source. With rdseed you can seed a PRNG of any size.
The numbers returned by rdseed have multiplicative prediction resistance. If you use two 64-bit samples with multiplicative prediction resistance to build a 128-bit value, you end up with a random number with 128 bits of prediction resistance (2128×2128 = 2256). Combine two of those 128-bit values together, and you get a 256-bit number with 256 bits of prediction resistance. You can continue in this fashion to build a random value of arbitrary width and the prediction resistance will always scale with it. Because its values have multiplicative prediction resistance rdseed is intended for seeding other PRNGs.
In contrast, rdrand is the output of a 128-bit PRNG that is compliant to «NIST SP 800-90A». It is intended for applications that simply need high-quality random numbers. The numbers returned by rdrand have additive prediction resistance because they are the output of a pseudorandom number generator. If you put two 64-bit values with additive prediction resistance togehter, the prediction resistance of the resulting value is only 65 bits (264+264=265). To ensure that rdrand values are fully prediction-resistant when combined together to build larger values you can follow the procedures in the «DRNG Software Implementation Guide» on generating seed values from rdrand, but it's generally best and simplest to just use rdseed for PRNG seeding.
The decision for which generator to use is based on what the output will be used for. Use rdseed if you wish to seed another pseudorandom number generator (PRNG), use rdrand for all other purposes. rdseed is intended for seeding a software PRNG of arbitrary width. rdrand is intended for applications that merely require high-quality random numbers.
The cpuid instruction is used to dynamically identify which features of POSTRISC are implemented in the running processor. The realization of the functional characteristics of these instruction systems is recorded in the series of configuration information words. One configuration information word can be read once the cpuid instruction is executed. The configuration information word number to be accessed is computed as gr[index]+sext(imm10). The 64-bit configuration information is written into the general register dst.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | dst | index | simm10 | opx |
Syntax:
cpuid ra, rb, simm10
The configuration information word contains series of configuration bits (fields). For example, The PALEN field of the number of physical address bits supported by the 11th to 4th digits of the configuration word No.1 is recorded as cpuid.1.PALEN[11:4].
The configuration information accessible by the cpuid instruction is listed in the table below. cpuid access to undefined configuration words causes general protection exception. The reserved fields in the defined configuration words read back zero values.
Word number | Bit field | Description |
---|---|---|
0 | 31:0 | number of implemented configuration words |
1 | 47:32 | vendor |
31:16 | version | |
15:0 | revision | |
1 | 63:0 | capabilities flags |
2 | 63:0 | L1I info |
3 | 63:0 | L1D info |
4 | 63:0 | L2D info |
5 | 63:0 | L3D info |
6 | 63:0 | L1 ITLB |
7 | 63:0 | L1 DTLB |
8 | 63:0 | L2 TLB |
9 | 63:0 | PMR info |
Currently the OS and standard libraries are not implemented for the virtual processor. Therefore, a few special instructions have been added to mimic their minimal emulation.
The write instruction is for outputting a formatted string. It uses the forward ip-relative addressing to address the format string. An unsigned 28-bit ip-relative offset gives a maximum distance of 256 MiB forward from the current position for a one-slot instruction and all available address space for a long instruction. The write instruction allows the continuation of the immediate value in the instruction code to the next bundle slot with the formation of a dual-slot instruction. It is assumed that the effective address point to a zero-terminated string. In assembler, you can use both labels on strings in the rodata section, and directly strings (the assembler will place them in the rodata section and insert the offset into the instruction).
ea = ip + zext(disp)
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | opx | disp (28 bits) |
The following formatters are used to display the content of the current core registers. The common syntax is «%formatter(register)» or «%m(command)».
formatter | description |
---|---|
%% | % |
%c | low part of general register as a 1-byte character |
%i8, %i16, %i32, %i64 | low part of general register as a signed decimal value |
%u8, %u16, %u32, %u64 | low part of general register as a unsigned decimal value |
%x8, %x16, %x32, %x64, %x128 | low part of general register as a unsigned hexadecimal value |
%b8, %b16, %b32, %b64 | low part of general register as a binary value |
%f32, %f64, %f128 | low part of general register as a floating-point value |
%vf32, %vf64 | general register as a vector of floating-point values |
%vi8, %vi16, %vi32, %vi64 | general register as a vector of signed decimal value |
%vu8, %vu16, %vu32, %vu64 | general register as a vector of unsigned decimal value |
%vx8, %vx16, %vx32, %vx64 | general register as a vector of hexadecimal value |
%m(dump) | full core state dump |
The instruction halt without parameters is intended to turn off the processor core, switching it to the deepest level of sleep, without saving a state, from which core may exit only by the reset signal. But in emulator this instruction serves to shut down the emulator.
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | 0 | opx |
Notes: The halt instruction is not used in unittests, because it is automatically added for each test source by test scripts.
This chapter gathers information about the ABI – software binary interface. This includes questions about what the runtime and program model are, what are the sections and segments of the program, how the program finds its private data, what are the available addressing methods, what are the accepted agreements on the relationships between the procedures and register preservation, relocation types and object file formats are described.
Depending on the execution environment (hardware capabilities of the target architecture, type of operating system) program models are (in order of the history of the emergence and growth of universality):
For the POSTRISC system, a combined SAS/MAS environment with implicit segmentation was selected, when each segment can be configured as SAS or MAS.
The compiler divides the different parts of the generated object code and data into different sections. During linking, when combining object files, sections with the same name come together and consolidate, getting an output file with one instance of each type of section. These sections of the output file are further grouped into several segments, which are processed by the loader as indivisible units.
The purpose of sections is to allow the compiler to generate separate pieces of code and data, which can be combined with other similar parts from other object files at the build stage. This makes it possible to achieve link locality, and confidence in the correct addressability of the contents of these sections. The most important section attributes are the type of access to the section pages; all data in one section shares the same minimum set of permissions.
The purpose of the segments is to allow the linker to group sections into fewer program units. Each segment has unique addressing methods for it, and sections of one segment are addressed in the same way. The compiler may make assumptions that any two objects in the same segment have a fixed offset relative to each other when the program is executed, but cannot assume the same for two objects in different segments.
The runtime architecture also defines some additional segments, which do not get their contents directly from the compiled object file. These segments are the – heap, stack, and shared memory segments – are created at program startup time or dynamically at runtime.
segment | section | type of program | Description |
---|---|---|---|
TEXT | header | all | file header |
sectab | all | section heading table | |
shstrtab | all | section names | |
.dynamic | shared | Dynamic linking information Header | |
.liblist | shared | A list of the names of the required spherical libraries | |
.rel.dyn | shared | Relocation for DATA process data | |
.rel.tdata | shared | Relocation for TDATA thread data | |
.conflict | shared | Additional dynamic linking information | |
.msym | shared | Additional dynamic linking symbol table | |
.dynstr | shared | Name of linking external functions | |
.dynsym | shared | Link table of external functions | |
.hash | shared | Hash table for quick search in the export table | |
.rconst | all | Read-only constants (no configuration) | |
.rodata | all | Immutable global data (setting at first boot into the system) | |
.lita | nonshared | Literal address pool section | |
.lit | all | Literals (Literal pool section) | |
.tlsinit | all | Initial copy of TDATA data | |
.pdata | all | Exception procedure table | |
.text | all | Main program code (not corrected during loading, it is possible to configure it at the first load in the system) | |
.init | all | Section of the program initialization code | |
.fini | all | Program Termination Code Section | |
.comment | all | Comment Section | |
TEXT but not downloadable | rsrc | all | Compiled resources |
line | all | Debug information | |
debug | all | Debug information | |
unwind | all | Table for stack rollback after exceptional situations | |
unwind_info | all | Blocks of information to roll back the stack after exceptional situations | |
DATA | .data | all | Initialized private process data (setting at boot) |
.xdata | all | Exception scope table | |
.sdata | all | Near-address small data initialized private process data (setting at boot) | |
.got | shared | GOT table (Global offset table) for references to DATA variables of other modules | |
.sbss | all | Small-address (small bss) uninitialized private process data | |
.bss | all | Uninitialized private process data | |
TDATA | .tdata | all | Initialized thread local data (setting at boot) |
.tsdata | all | Near-address small data initialized thread local data (setting at boot) | |
.tgot | shared | Module GOT table for the thread (links to TDATA variables of other modules) | |
.tsbss | all | addressable small (bss) uninitialized thread local data | |
.tbss | all | Uninitialized thread local data |
A program in the POSTRISC architecture consists of a main program module, dynamically loaded libraries (the same program modules), stacks of the main and other threads, several heaps. Each program module consists of four types of sections.
The TEXT segment is shared by all processes in the system and is read-only and executable. The addressing within the segment is relative to the instruction pointer. Its CODE section contains program code. Its RODATA section contains immutable data, placed after the CODE section.
The DATA segment contains private process data. The segment is read-write. The addressing within the segment is relative to the instruction pointer. The DATA segment of the main software module, in addition to its private data, contains a table of base addresses for all DATA segments of dynamically loaded libraries.
The TDATA segment contains private process data. The segment is read-write. The segment after creation is unknown distance from everyone else segments and is addressed relative to the selected base register tp. The TDATA segment of the main software module, in addition to its private data, contains a table of base addresses for all segments of TDATA dynamically loaded libraries.
There are several data models for binding fundamental integer scalar data types from programming languages to architectural data types.
Data model | Architectural types | |||
---|---|---|---|---|
1-byte | 2-byte | 4-byte | 8-byte | |
ILP16 | char | short int, int, long int, near pointer | ||
LP32 | char | short int, int, near pointer | long int, far pointer | |
ILP32 | char | short int | int, long int, pointer | long long int |
LLP64 | char | short int | int, long int | long long int, pointer |
LP64 | char | short int | int | long int, long long int, pointer |
ILP64 | char | short int | wchar_t | int, long int, long long int, pointer |
The ILP16 variant was used by very ancient 16-bit systems, LP32 uses MS DOS, ILP32 use all 32-bit systems, LLP64 chose Microsoft for Windows-64, LP64 is selected for Linux-64 and most other 64-bit Unix systems, ILP64 is used in some versions of Unix systems.
The choice between LLP64, LP64, and ILP64 is determined by different criteria. If you need support (without recompiling) an existing array of 32-bit software when migrating to 64-bit systems, then LLP64 is the best choice. The disadvantage of – alteration for 64 bits requires a deep modernization of the program. If you want the existing code array to take advantage of 64-bit addressing with minimal code rework, then ILP64 is a good fit. The disadvantage of a – superficial code upgrade leads to memory overrun where 64 bits are not needed. If you follow a balanced approach between the complexity of converting to 64-bit systems and the need to support existing 32-bit programs, then choose LP64. ILP64 was chosen for POSTRISC, with the addition of the new fundamental type long char to describe four-byte numbers (wchar_t).
Data Type | Size and alignment | Machine Type |
---|---|---|
signed char | 1 (1) | signed byte |
unsigned char | 1 (1) | unsigned byte |
char | 1 (1) | byte, the sign depends on the compiler |
bool | 1 (1) | unsigned byte, 0 or 1 |
[signed] short int | 2 (2) | signed 2-byte |
unsigned short int | 2 (2) | unsigned 2-byte |
[signed] long char | 4 (4) | signed 4-byte |
unsigned long char | 4 (4) | unsigned 4-byte |
enum | 1,2,4,8 | depends on the range of values |
[signed] int | 8 (8) | signed 8-byte |
unsigned int | 8 (8) | unsigned 8-byte |
[signed] long int | 8 (8) | signed 8-byte |
unsigned long int | 8 (8) | unsigned 8-byte |
[signed] long long int | 8 (8) | signed 8-byte |
unsigned long long int | 8 (8) | unsigned 8-byte |
data pointer: type * | 8 (8) | unsigned 8-byte |
function pointer: type (*) () | 8 (8) | unsigned 8-byte |
float | 4 (4) | IEEE single |
double | 8 (8) | IEEE double |
long double | 16 (16) | IEEE quadruple |
Aggregate data types (structures – struct, class – and arrays) and unions (union) are aligned with their most strictly aligned component. The size of any object, including aggregates and associations, is always a multiple of the alignment of the object. An array uses the same alignment as its elements. Structure and join objects may require inserts to meet size and alignment restrictions. The content of any padding is undefined.
C structures and associations can contain bit fields that define integer objects with a specified number of bits. The table shows the permissible values of bit fields for each base type, and the corresponding limits.
Data Type | Field Width W | Limits |
---|---|---|
char, signed char | 1-8 | − 2 W−1 … 2 W−1 − 1 |
long char, signed long char | 1-16 | |
short, signed short, enum | 1-32 | |
int, signed int | 1-64 | |
long, signed long | 1-64 | |
long long, signed long long | 1-64 | |
unsigned char | 1-8 | 0 … 2W − 1 |
unsigned long char | 1-16 | |
unsigned short | 1-32 | |
unsigned int | 1-64 | |
unsigned long | 1-64 | |
unsigned long long | 1-64 |
Bit fields whose base type (with the exception of enumerated types) is represented without an explicit signed or unsigned definition, considered as unsigned (fixme). Bit fields of enumerated types are considered to be signed, unless an unsigned type is needed to represent all constants from the enumeration type. Bit fields obey the same size and alignment rules as other fields in a structure or union, with the following additions:
Bit fields like int and long (signed and unsigned) are usually packed denser, the smaller the base types (less restrictions on crossing the boundaries of the base type). You can use bit fields and types char and short, to force placement within those types, but int is generally more efficient.
Although all 128 general-purpose registers are physically equal (except for the difference between global and rotate registers, and some other differences) the software binary interface reserves several general purpose registers for its (special) purposes. Unlike real special purpose registers, these registers are special only in the sense that that the program is obliged to use them only in an authorized way. The choice of numbers for these registers is (almost) arbitrary and not part of the architecture.
The initial contents of the registers sp, tp set by the loader at the start of the process / thread and should be changed by the program only according to ABI rules. The contents of the reg sp should always correctly display the state of the stack and be aligned with the strictest boundary for the base types – 16 bytes. Register r0 must contain the return info when the procedure is called.
Register | Content |
---|---|
r0 | link pointer – return address from the procedure. The called procedure receives the return address in the first register of the new frame of the local registers, register r0. |
sp | stack pointer – pointer to the top of the stack. |
tp | thread pointer – pointer to the beginning of thread local data for the main (static) module. Used by load/store instructions and ldan only inside the main module. |
The code segment must not contain relocations (PIC). To create a PIC, the compiler must:
The position-independent code cannot contain absolute addresses directly in the instruction code, but uses for addressing data and an offset code relative to the instruction counter. A data-binding-independent code uses for code addressing the offset relative to the instruction counter, but cannot address private data in the same way, but only relative to to the base registers.
The Global Offset Table (GOT) stores absolute addresses and is part of the process's private data, which makes addresses accessible without violating positional independence and sharing of program code. Each program module refers to its GOT table in a position-independent manner and extracts absolute addresses from it. So position-independent links are converted to absolute positions.
Initially, the GOT contains information about relocation points (annotations for the dynamic linker). After the system creates memory segments for the loaded object file, the dynamic linker processes relocation points, some of which will refer to the GOT. The dynamic linker determines the symbolic names associated with them, calculates their absolute addresses, and sets the appropriate values in the corresponding GOT entries. Although the absolute addresses are unknown to the link editor when it builds the object file, but the dynamic linker knows the addresses of all memory segments and can therefore calculate the absolute addresses of the objects contained therein.
If the program requires direct access to the absolute address of the object, this object will have an entry in the GOT. Since the executable file and each shared object have separate GOTs, the address of a symbolic name may appear in several tables. The dynamic linker processes all GOT relocations before transferring control to the process code, which guarantees the availability of absolute addresses at runtime.
Thanks to GOT, the system can select different addresses of memory segments for one shared object in different programs. She can even choose different library addresses for different executions of the same program. At the same time, memory segments do not change addresses after the process image is installed. As long as the process exists, its segments are located at fixed addresses.
Short summary: if the program has several data segments (private or shared), then they are accessed indirectly through the GOT address table. The GOT table is part of one selected – DATA private data segment. However, objects in the DATA segment itself can be addressed indirectly through the GOT table in DATA (for example, if the distance of the relative displacement is too large for implementation in the instruction).
The relocation or unresolved link is a place in the code or static data, reserved by the compiler to substitute the later calculated value – at the stage of compilation or even later at the stage of loading, or not containing data (field of zero bits), or containing incomplete information (an additional term may be stored to calculate the allowed link). Usually, a direct value is stored in the place of relocation, which is the absolute address or relative offset relative to the base address or counter of the bundles, and used when accessing memory or address calculations.
There are as many types of relocation as there are different ways in the processor architecture of the processor to put an immediate value in the code of the machine instruction (without taking into account the constants for the description of shifts and some other constants, too short and therefore not used for relocation) or in the data object. Link Editor uses these unfilled (under-calculated) line items. at the assembly stage for embedding in the previously compiled code its information about the links in the program between individual segments, sections, object modules and dynamically linked executable modules.
The compiler creates (for later use by the linker) a table of relocation (program moving) records as part of the object file. Moving records describe how the linker (or later the loader) should modify the instruction or data field.
The following distinct data relocation types are defined for data sections in the POSTRISC architecture:
For code in the POSTRISC architecture (according to the format of single-slot instructions and their extensions to the second slot) The following distinct code relocation types are defined:
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | other | simm 28 bits |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
0 | extended simm (64 bits instead of 27) |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | other | simm 28 bits |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
0 | extended simm (60 bits instead of 28) |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | other | simm 17 bits |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
other | extended (30 bits) |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | other | simm 21 bits |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
extended simm (63 bits instead of 21) |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | other | uimm 21 bits |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
extended uimm (63 bits instead of 21) |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | other | simm11 | other |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
extended simm (40 bits instead of 11) | other |
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | other | uimm11 | other |
83 | 82 | 81 | 80 | 79 | 78 | 77 | 76 | 75 | 74 | 73 | 72 | 71 | 70 | 69 | 68 | 67 | 66 | 65 | 64 | 63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 |
extended uimm (40 bits instead of 11) | other |
Instruction-dependent basic relocation types further differ in different ways of forming the implemented constant and conditions checked in this case. The set of these methods depends on the program model used and additional features of the instruction set architecture.
For example, the Intel X86 architecture has only one basic type of code relocation (in 32-bit mode) – the 4-byte field as part of the instruction, for which there are only two ways of forming the constant – just an absolute address for the data or ip-relative address for the code.
The POSTRISC system is oriented towards a position-independent code. In addition, the special instructions ldan, ldar, designed to optimize relative addressing, require special support from the linker. Hence a large selection of possible ways of referencing an object of code or data depending on the location of the object, the remoteness of the object from the base of relative addressing, the presence of indirect links through the GOT table, the number of repeated calls to the same object or objects near it.
The method of referencing an object in the place of relocation (a method of converting a symbol name into an embedded constant) is usually set in assembler as a call to a special function when calculating the constant parameter of an instruction, or as a suffix added to the name of an object. It should be understood that this is actually not a function call, but a mark for the linker (hello to him from the compiler) – exactly how to construct a relocation constant from the name of the object. The set of relocation methods used depends on the architecture of machine instructions (if, for example, long constants are synthesized from parts and divided between several instructions, or several typical sizes of programs with different addressing methods are provided) and from the selected program model (absolute code for the system kernel or user program). The set of POSTRISC assembler functions below is generally traditional for the 64-bit PIC program model and is found (with some variations) in all 64-bit architectures: DEC Alpha, SGI MIPS, IBM PowerPC, Sun UltraSPARC, Intel Itanium.
Group (scope) | Function (Method) | How, at runtime, an offset is obtained from the offset |
---|---|---|
Absolute addresses (for data only) | symbol | symbol |
expr | symbol+offset | |
got(symbol) | mem8[offsettee | |
Private process data | pcrel (expr) | ip + offset |
ltoff(expr) | mem8[ip+offsettee | |
thread local data (main program) | tprel (expr) | tp + offset |
@tprel@got(expr) | mem8 [tp + offset] | |
Private flow data (dynamic modules) | dtprel(expr) | mid = mem4 [gp + mid_offset] local_tp = mem8 [dtv + mid] ea = local_tp + offset |
Support for ldan instructions for all data types | data_hi (expr) | base + (offset << 15) |
data_lo(expr) | base+offset | |
Support for the ldar instruction for all data types | text_hi(expr) | ip+16×offset |
text_lo(expr) | base+offset | |
Miscellaneous Functions | segrel(expr) | segbase+offset |
secrel(expr) | secbase+offset |
The mere mention of the symbol symbol means the absolute address of the symbol object. The expression expr means a formula from the absolute address of the object and constant offset: symbol + offset. Absolute addresses at runtime are not calculated and used as is. Absolute addresses can be embedded in the instruction code only if it is an absolute program (system core, drivers).
Function got (symbol) (global offset table) means the absolute address in the GOT table for indirect access to the symbol object. At the same time, this is a request to create a GOT record for the symbol object, if there is no such record yet.
The got function cannot be used by itself, but only with pcrel or tprel, e.g. like @ pcrel @ got (expr), since the GOT table is divided in two (depending on the locality of the link objects – process or thread) and is part of the DATA and TDATA segments, and therefore should be addressed accordingly.
The function pcrel (expr) (program counter relative) means the offset offset relative to the instruction counter. The absolute address of the object is computed at run time as ip+offset. Used to access the code and / or static data of the same module.
Function tprel (expr) (thread pointer relative) means offset offset relative to the base register tp when addressing the thread private data. The absolute address of the object is calculated at runtime as tp + offset. The expr object must belong to the TDATA segment of the main module.
Function dtprel (expr) (dynamic thread pointer relative) means the offset offset of the object expr regarding the beginning of the thread private data of this module dtv [ModID] (taken from the array dtv addressed by the register dtv). The absolute address of the object is calculated at run time as dtv[ModID†+offset. The expr object must belong to the TDATA segment of the ModID module itself.
The function data_lo(offset) describes the lower 15-bit part of the offset offset: data_lo(offset) = sign_extend(offset, 15). The offset offset is usually calculated for position-independent programs as gprel(expr) or tprel(expr) depending on the location of the expr object. Used for addressing relative to the intermediate base address, calculated earlier using the ldan instructions. This intermediate address can be reused for calls. to an object or its nearest neighbors using short read / write instructions (with offsets of minimum length – not more than 16 bits per offset).
The data_hi(offset) function describes the older part of the offset offset: data_hi(offset) = (offset- sign_extend(offset, 15)) >> 15. Used by ldan instructions to calculate an intermediate base address before using short read / write instructions. These instructions calculate the absolute address no further than 16 kilobytes from the relative addressable object expr. Long (over 32 kilobytes) offset offset relative to the base register split into two parts offset_hi and offset_lo, so that offset = (offset_hi << 15) + offset_lo, so the younger part is offset_lo always placed in a 16-bit constant with a sign, and the older part offset_hi will be the argument of the ldan instructions for calculating the intermediate base address.
Objects in the TEXT segment (RODATA section with read-only data) in a position-independent program, it should be addressed relative to the ip instruction counter. However, with the ldar instruction, you can get only the starting address of the 16-byte bundle, that is, the address closest to the target is 16 bytes aligned. The following instructions for accessing the memory should take into account the short offset (from 0 to 15 bytes long) from this starting address to the object.
Function text_hi (expr) means that part of the ip-relative offset in the segment text used by the ldar instruction to calculate the absolute address closest to the object, aligned on a 16-byte boundary. The ldar instruction computes ip+16× text_hi(symbol), where text_hi(symbol) is calculated by assembler as ((symbol− text_lo(symbol)) >> 4). The paired function text_lo(symbol) describes the younger part ip -relative offset as sign_extend(symbol, 4), that is, the difference between the address of the object and the nearest 16-byte boundary. This value is used for direct addressing in load/store instructions. after calculating the intermediate address of the 16-byte bundle using the ldar instruction.
Function segrel (expr) (segment relative) describes the offset of the object expr relative to the start of the segment. This relocation is for data structures, which are placed in read-only shared segments but must contain pointers. In this case, the relocation point and the relocation object must be located in one segment. Applications using such relative pointers should be aware of their relativity and add the base address of the segment to them at runtime.
Function secrel (expr) (section relative) describes the offset of the expr object relative to the beginning of the section. This relocation is for links from one section to another within the same segment.
As a result, combining the type of relocation and the method of linking to an object, we get a complete set of all valid types of unresolved links, which the linker should be able to handle (minus some never-seen combinations).
Group | Name | Relocation Method |
---|---|---|
absolute addressing (data only) | R_ADDR_WORD | sym+addend |
R_ADDR_DWORD | sym+addend | |
relative to ip | R_PCREL_JUMP | pcrel (sym + addend), jump/call |
R_PCREL_JUMP_EXT | pcrel (sym + addend), jump/call | |
R_PCREL_BRANCH | pcrel (sym + addend), compare-and-branch | |
R_PCREL_BRANCH_EXT | pcrel (sym + addend), compare-and-branch | |
R_PCREL_LDAR | text_hi (pcrel (sym + addend), ldar | |
R_PCREL_LDAR_EXT | text_hi (pcrel (sym + addend), ldar | |
section-base relative | R_SECREL_WORD | sym - SC + addend, .mem4 |
R_SECREL_DWORD | sym - SC + addend, .mem8 | |
segment-base relative | R_SEGREL_WORD | sym - SB + addend, .mem4 |
R_SEGREL_DWORD | sym - SB + addend, .mem8 | |
base-relative | R_BASEREL_LDI | L (sym - base + addend) |
R_BASEREL_LDI_EXT | L (sym - base + addend) | |
R_BASEREL_BINIMM | sym - base + addend | |
R_BASEREL_BINIMM_EXT | sym - base + addend | |
dynamic layout? | R_SETBASE | Set base |
R_SEGBASE | Set SB | |
R_COPY | dyn reloc, data copy | |
R_IPLT | dyn reloc, imported PLT | |
R_EPLT | dyn reloc, exported PLT | |
tp -relative | R_TPREL_WORD | tprel (sym + addend), .mem4 |
R_TPREL_DWORD | tprel (sym + addend), .mem8 | |
R_TPREL_LDI | tprel (sym + addend), LDI | |
R_TPREL_LDI_LONG | tprel (sym + addend), LDI | |
R_TPREL_HI_BINIMM | data_hi(tprel(sym+addend)) | |
R_TPREL_HI_BINIMM_EXT | data_hi(tprel(sym+addend)) | |
R_TPREL_LO_BINIMM | data_lo(tprel(sym+addend)) | |
R_TPREL_BINIMM | tprel (sym + addend), load/store | |
R_TPREL_BINIMM_EXT | tprel (sym + addend), load/store |
The assembler syntax must be consistent with the set of types of unresolved references that the linker can handle.
For example, almost none of the assemblers/compilers can take into account and handle the subtraction of two addresses from the same segment as a immediate, although it is. At the compilation stage, this subtraction is still unknown, but at the linking stage, when they can be defined, the corresponding types of relocation are not provided in order to pose a similar task to the linker. As a result, the compiler is forced to take these calculations to the stage of loading or executing the program.
The most «advanced» compilation/linking systems support the ability to postpone to the link stage the unresolved links of arbitrary complexity, if they would be reduced to a constant result.
Managing thread local storage (TLS) which is private for a thread isn't as simple as per-process private data. TLS sections cannot simply be loaded from a file into memory and made available to the program. Instead, multiple copies should be created (one for each thread) and all of them must be initialized according to the primary image of the TLS section in the program file. The creation of new threads can continue dynamically throughout the entire period of the program.
TLS support should avoid creating TLS data blocks if possible, for example, using deferred memory allocation on the first request (first attempt to access TLS). Most threads will probably never use private data of all dynamic modules at once. Unfortunately, the mechanism of deferred memory allocation requires at least introducing a separate functional level (layer) to control access to TLS objects, which may be too inefficient.
The problem is the very process of compiling TLS data and accessing it when there are many copies of it. The TLS variable is characterized by two parameters: a reference to the TLS block of a particular dynamic module and an offset within this block. To get the address of a variable, you need to somehow map these two parameters to the virtual address space at runtime.
The traditional TLS mapping approach is as follows. One of the general registers (tp or thread pointer) permanently stores the address of the static TLS block of data associated with the current thread. The data block is conditionally divided into two parts: a statically allocated single TLS data block of the main module (exe file) and the dtv vector (dynamic thread vector), storing addresses of dynamically (possibly lazy) dedicated TLS blocks for dynamically loaded dynamic modules. If the dynamic module is loaded into the program, then it is allocated one slot (a place to store the address) in the dtv vector.
Knowing your mid number, the dynamic module can find the beginning of its TLS data for the current thread in dtv[mid] or MEM (tp + mid + offset), where offset is the position of dtv relative to tp (usually 0). Next, you can find the address of the variable as dtv[mid] + var_offset, where var_offset is the position of the variable relative to the dynamic TLS block.
General dynamic model TLS (general dynamic) is the most universal. The code compiled for it can be used anytime, anywhere, and it can access TLS variables defined anywhere. For example, from one dynamic module, access the TLS variable in another dynamic module. By default, the compiler generates code for this model, and can use more limited TLS models only when explicitly allowed by the compiler options.
For the code of this model for the TLS variable are unknown at the build stage (and especially compilation) neither the module number (slot) in which it is located, nor the offset inside the TLS block of this module. Module number (ModuleID) and offset in the TLS block are determined only at runtime (taken from the GOT table where the loader writes them) and passed to a special function __tls_get_addr (the standard name for many Unix), which checks for the existence of a TLS block, creates if it is not, and returns the address of the variable for the current thread. The implementation of this function is also a problem requiring the assistance of the OS.
addr1 = __tls_get_addr (GOT [ModuleID], GOT [offset1]) addr2 = __tls_get_addr (GOT [ModuleID], GOT [offset2])
The code size and runtime are such that it is best to avoid this model altogether. If the module number and / or offset are known, optimization or simplification is possible.
Local dynamic model TLS (local dynamic) is an optimization of the general dynamic model. The compiler uses this model if it knows that TLS variables are used in the same module in which they are defined. Now the variable offsets (at least in the TLS block of this module itself) will be known at the linking stage. The module number is unknown. Still need to call the function __tls_get_addr, but now it can be called only once (with offset 0) to determine the start address of the «block of its» TLS variables. The address of individual variables is simply determined by adding a known offset.
addr0 = __tls_get_addr (GOT [ModuleID], 0) addr1 = addr0 + offset1 addr2 = addr0 + offset2
Dynamic models using the __tls_get_addr function allow lazy allocation of memory for TLS data at the first request to the block.
Static Load Model TLS (initial exec) assumes that a certain set of dynamic modules will always be loaded together with the main program. Then the loader can calculate the total value of all TLS blocks of such modules and their location in a single TLS block. Separate TLS blocks of different modules in this single block will be located at a fixed distance from the beginning of the block, which the loader computes and stores in the GOT table. Now, to calculate the address of the TLS variable, you do not need to call the function __tls_get_addr, it's just a read from the GOT record, and you don't need to know the module number. Allocation of a single block occurs immediately upon the start of a new thread (without delayed lazy allocation). If such a model is used for a dynamic library, then it cannot be loaded dynamically, but only statically. Addressing occurs relative to the selected register tp with the offset known at the loading stage (taken from GOT).
addr = tp + GOT [offset]
Local static model TLS (local exec) will be obtained if we combine the local dynamic model and the initial exec model, then we get the local exec model: static loading and local calls (without dynamically linked modules). The main module of the program (main) refers to the TLS variables defined in it. Addressing occurs relative to the selected register tp with the offset known at the stage of layout.
addr = tp + offset
The compiler usually (when compiling object modules separately, when creating libraries) doesn't have full information about the future program as a whole. The compiler is forced to make the most careful decisions about the nature of the future program. This usually comes down to the compiler using the most common mechanisms for addressing private data. For TLS, this is the general dynamic addressing model.
Therefore, it is important that the linker, when compiling the finished program, can optimize and make changes to previously compiled object module, and replace for some variables the existing addressing method with another (optimized) one. To do this, at a minimum, the linker must know such places (unresolved references to TLS sections), and the compiler must create the addressing code so that it can be replaced by another. This requires the equivalence of different TLS addressing methods in terms of code size, number, type and number of registers used, etc.
If the optimized version is shorter than the original, after the replacement, the program may leave empty spaces filled with dummy nop instructions. It happens that the optimized version is longer than the original, then the compiler must add the dummies in advance, to allow future linker replacement with an optimized addressing option.
The POSTRISC system is focused on code and translation table sharing. It should be possible to replace shared libraries without recompiling their dependent applications. Any software module can be used by several processes. There should be no difference between application code and shared library code. For addressing code and global data, the addressing relative to the instruction pointer is used with software reconfiguration to the conformant regions of private process/thread data.
The one address range is used for mapping code sections of all program modules. This address range is shared by all processes and is executable only. For each process, another address range is allocated for static process data sections of all program modules. For each thread, another address range is allocated for the sections of static data of the thread in all program modules. All three address range types are the of the same size, a multiple of degree 2, and aligned to the same border.
For each program module, the following three values must be equal: offset from the beginning of the code range to the beginning of the code section; offset from the beginning of the private data range of the DATA process to the beginning of the DATA section; offset from the beginning of the private data range of the TDATA stream to the beginning of the TDATA section. Knowing only ip and the base address of the private range (stored in dedicated registers gp and tp for DATA and TDATA, respectively), you can always calculate the location of positionally independent private data using the formula:
base = gp | ip { gtssize − 1: 0}
or indirectly
base = mem [ gp + ip {gtssize − 1: 0} >> tgsize]
Private static data can easily be found by library code. It is not necessary to explicitly pass the correct address of the data segment of the module of the new gp (global pointer) when called through the border of a module or when called through a pointer to a function. A pointer to a function becomes just a pointer to a place in a code segment, without additional levels of indirect access via function descriptor.
Address subranges | Address granules (loaded modules, used granules) | |||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
m1 | - | m2 | m3 | - | m4 | m5 | m6 | - | ||||||||||||||||||||||||||||||||
TEXT (global) | ||||||||||||||||||||||||||||||||||||||||
process A DATA | ||||||||||||||||||||||||||||||||||||||||
thread A1 TDATA | ||||||||||||||||||||||||||||||||||||||||
thread A2 TDATA | ||||||||||||||||||||||||||||||||||||||||
process B DATA | ||||||||||||||||||||||||||||||||||||||||
thread B1 TDATA | ||||||||||||||||||||||||||||||||||||||||
thread B2 TDATA | ||||||||||||||||||||||||||||||||||||||||
thread B3 TDATA | ||||||||||||||||||||||||||||||||||||||||
process C DATA | ||||||||||||||||||||||||||||||||||||||||
thread C1 TDATA | ||||||||||||||||||||||||||||||||||||||||
thread C2 TDATA | ||||||||||||||||||||||||||||||||||||||||
thread C3 TDATA | ||||||||||||||||||||||||||||||||||||||||
process D DATA | ||||||||||||||||||||||||||||||||||||||||
thread D1 TDATA | ||||||||||||||||||||||||||||||||||||||||
process E DATA | ||||||||||||||||||||||||||||||||||||||||
thread E1 TDATA | ||||||||||||||||||||||||||||||||||||||||
thread E2 TDATA | ||||||||||||||||||||||||||||||||||||||||
thread E3 TDATA |
When dynamically or statically loading a software module, the loader first determines if this module is presented among the loaded modules in the system. If it isn't loaded, the loader determines for the module a maximum of three module sections: CODE, DATA and TDATA. The loader then looks at the system-wide loaded modules region map and looking for an sufficient size unoccupied address range. According to the rules above, similar unoccupied address ranges exist in all three regions at the same distance from the beginning of each region. Having found such a range, the loader reserves it for future use by this software module in all processes and threads.
The module occupies the selected address range until the last process that uses this module terminates. Then the system can unload the module and free the address range, so the next time the same module may be loaded into a different address range. While the module is loaded, its the base address of the text section of the program module is unchanged for all processes using it.
The following are examples of using ldar, jmp, bv to load constants, get procedure addresses, procedure calls.
Literals and other local read-only data from the TEXT segment can be loaded using ip-relative addressing. Loading Constant Group:
ldar base, text_hi (_local_data) ldws gb, base, text_lo (_local_data) +0 ldwz gc, base, text_lo (_local_data) +4 lddz gd, base, text_lo (_local_data) +8
Getting the address of a static procedure (within 64 MiB of the current ip):
ldar base, _myfunc
Getting the address of a static procedure (further 64 MiB from the current ip):
ldar.l base, _myfunc
Getting the address of a dynamic procedure:
ldar base, text_hi (_reloc_table) lddz gt, base, text_lo (_reloc_table) + __imp_myfunc
Call a static procedure (within 8 GB of the current ip):
callr _myfunc _ret_label:
Call a static procedure (beyond 8 GB of the current ip):
callr.l _myfunc _ret_label:
Calling a procedure through a pointer (in the addr register):
callri lp, addr, gz _ret_label:
The call of the explicit dynamic procedure (correction of the call by the compiler):
ldar base, text_hi (_reloc_table) lddz addr, base, text_lo (_reloc_table) + __imp_myfunc callri lp, addr, gz _ret_label:
Invoking an Implicit Dynamic Procedure (correction of the call by the linker using the stub function):
callr _glu_myfunc _ret_label: ... _glu_myfunc: ldar gt, _reloc_table ldd gt, gt, _imp_myfunc bv gt, g0 _glu_ret_label:
Private process data (distance up to 1 MiB):
ldd gt, gp, _local_data
Private process data (distance greater than 1 MiB):
ldan gt1, gp, data_hi (_local_data) ldd gt2, gt1, data_lo (_local_data)
thread local data (distance less than 1 MiB):
ldd gt, gp, _local_data
thread local data (distance greater than 1 MiB):
ldan g30, tp, data_hi (_local_data1) ldd g31, g30, data_lo (_local_data1) ldan g31, tp, data_hi (_local_data2) ldd g32, g31, data_lo (_local_data2)
Interruption is an action in which the processor automatically stops execution of the current instruction thread. The processor usually saves part of the thread context (at least the address of the instruction must be saved, with which the normal execution of the instruction flow should continue). The state of the machine changes to a special interrupt processing mode. The processor starts execution from the predefined address of the interruption handler routine. Having finished the interrupt processing, the routine-handler (usually) restores the previous state of the processor (the context of the interrupted thread), and makes it possible to continue execution of the thread with an interrupted (or following) instruction (return from interruption).
Exception is an event that, if enabled, forces the processor to interrupt. Exceptions are generated by signals from internal and external peripheral devices, instructions of the processor itself, internal timer, debugger events, or conditional errors. In the general case, exceptions do not coincide with interrupts: different exceptions may generate an interrupt of the same type, one exception can produce several interrupts.
All interrupts can be classified according to the following independent characteristics: location of the code for interrupt service, synchronism to the context, synchronism to the instruction flow, criticality, accuracy.
According to the code location for its service, interrupts are divided into two groups. Interrupts of the first group depend on the specific implementation of the processor and/or platform. This is a RESET (power up, hardware or «cold» start) INIT (soft or «warm» restart), CHECK (test and, possibly, recovery of the processor and/or platform upon failure), PMI (request to the processor/platform for a implementation specific service). The method for handling such interrupts is unknown to the operating system. The code for processing them is stored in an intermediate layer between the OS and the hardware (PAL). The addresses of the handlers for such interrupts are fixed for this processor implementation, and are tied to the address range of the PAL library. The code, in whole or in part (if the implementation allows PAL updates) is sewn into the write-protected PAL memory area.
Interrupts of the second group are determined by the architecture (fixed) and do not depend on the specific processor implementation. The method of servicing such interrupts is selected by the operating system. The code for their processing is stored in the interrupt table, the location address of this table and its contents are set by the OS. Interrupts of the second group are also called vector interrupts, since the processor uses the interrupt vector number to select the handler code from the interrupt table.
Synchronism to the context specifies the ability to continue the interrupted instruction flow. For RESET or CHECK, the continuation of the interrupted execution context is impossible – it either doesn't exist yet, or it is not restored. A machine check (restart, reset) interrupts the actions of context synchronization with respect to subsequent instructions. For other types of interrupts, after the interrupt is processed, the interrupted thread context is usually restored. These interrupts are also called context-synchronous or recoverable. This means that after the interruption is completed, execution can continue. interrupted sequence of instructions (execution context is saved/restored). An interrupt can be unrecoverable if during its generation or processing The contents of the processor registers, cache memory, write buffers, etc will be lost.
Synchronization to the thread sets the relation of interruption to the interrupted instruction thread. Asynchronous to the thread interrupts are caused by events that are not explicitly dependent on the instructions being executed. For asynchronous interrupts, the address reported to the exception handling routine is it is simply the address of the next thread instruction that would be executed next if the asynchronous interrupt did not occur. Synchronous to the thread interrupts are caused directly by the execution or attempt to execute an instruction from the current thread. Synchronous interrupts are processed strictly in software order, and if available multiple interrupts for a single – instruction in order of precedence for interrupts. Thread-synchronous interrupts are divided into two classes: errors (or faults) and traps.
Error or fault is an interrupt that occurs before the instruction completes. The current instruction cannot (or should not) be executed, or system intervention is required before the instruction is executed. Errors are synchronous relative to the instruction flow. The processor completes the state changes that occurred in the instructions before the erroneous instruction. An erroneous instruction and subsequent instructions have no effect on the machine state. Possible intermediate results of the instruction execution are completely canceled upon error, and after processing the interrupt, the instruction restarts again. Synchronous interrupt errors accurately indicate the address of the instruction that caused the exception that generated the interrupt.
Trap is an interrupt that occurs after the execution of an instruction. A completed instruction requires systemic intervention. Traps are synchronous relative to the instruction flow. The trap instruction and all previous instructions are complete. The following instructions have no effect on the machine condition. The instruction that generated the trap is not canceled or restarted. Synchronous trap traps accurately indicate the location of the next instruction after the instruction that raised the exception that threw the interrupt.
When executing an instruction causes a trap or attempting to execute an instruction causes an error, The following conditions must exist at the breakpoint:
Critical Interrupts. Some types of interruptions require immediate attention, even if other types of interrupts are currently being processed, and it was not yet possible to save the state of the machine (return address and contents of the machine status registers). In addition, the interrupt handler itself may generate an interrupt, which may require a new handler to process. For example, when placing a page table in virtual memory when processing a miss in DTLB or ITLB, a DTLB may miss again.
According to these requirements, interruptions can be classified by severity level. To allow the possibility of a more critical interrupt immediately after the start of processing a less critical interrupt (that is, before the state of the machine is saved), provides several sets of shadow registers to save the state of the machine. Interrupts for each criticality class use their own set of registers.
All interrupts, except for machine verification, are ordered by two categories of interrupt criticality, so that only one interrupt of each category is processed at a time, and while it is being processed, no part of the program state will be lost. Since the group of registers for saving/restoring the processor state upon interruption is a sequential reusable resource, used by all interrupts of the same class, respectively, program status may be lost when an unordered interrupt occurs.
Interrupt Accuracy is an optional feature for synchronous interrupt flow. Exact interrupts are issued on a predictable instruction. The place where the instruction thread breaks is exactly the instruction that causes the synchronous event. All previous instructions (in program order) are completed before passing control to the interrupt handler. The instruction address is stored automatically by the processor. When the interrupt handler completes execution, it returns to the interrupted program and restarts its execution from the interrupted instruction.
Inaccurate interrupts do not guarantee spawning on a predictable instruction. Any instruction that was not yet executed when the interrupt occurred could be the place where the thread was interrupted. Inaccurate interrupts can be considered asynchronous, because the source instruction of the interrupt doesn't necessarily refer to the interrupted instruction. Inaccurate interrupts are lagging from the interrupted thread. Inaccurate interrupts and their handlers usually collect information about the state of the machine, related to interruption for reporting through the system diagnostic software. An interrupted program usually doesn't restart (cannot be restored).
PAL code (asynchronous to the thread or inaccurate, critical) | Vector | |||||
---|---|---|---|---|---|---|
Asynchronous to instruction thread, recoverable | Synchronous to the thread | |||||
Inaccurate errors | Accurate, recoverable | |||||
Unrecoverable | Recoverable | Unrecoverable | Recoverable | Errors | Traps | |
RESET, CHECK | INIT, PMI, CHECK | INT (external interrupts) | ? | maybe FPU? | TLB, Access rights | Debug, FPU traps |
Since not all combinations of handler code location, synchronicity with the context and/or flow, criticality and accuracy are exist, it is convenient to divide all interrupts into four types: failures (aborts), asynchronous interrupts (interrupts), and synchronous interruptions (interruptions), which are also divided into errors (faults) and traps (traps).
Failures. The processor has detected an internal failure, or a processor reset has been requested. A crash is not synchronous to the context or the instruction flow. A crash can leave the current instruction thread in an unpredictable state with partially modified registers and/or memory. Crashes are PAL-stored interrupts.
Asynchronous Interrupts. An external or independent entity (such as an IO device, its own timer, or another processor) needs attention. Interrupts are asynchronous relative to the instruction flow, but usually synchronous with the context, all previous instructions are completed. Current and subsequent instructions have no effect on the machine condition. Interrupts are divided into initialization interrupts, platform control interrupts, and external interrupts. Initialization and interrupts for platform management are PAL interrupts, external interrupts are vectored interrupts.
Errors and traps. Always synchronous with context and flow. These are vector interrupts.
Machine check interruption is a special case of asynchronous interruption. They are usually caused by some hardware, or by a failure of the memory subsystem, or by trying to access an invalid address. Machine verification can be called indirectly by executing an instruction, if the error caused by the execution of the instruction will not be recognized on time and will turn into a hardware failure. The fact that machine verification interrupts cannot be said to be synchronous or asynchronous, as accurate or inaccurate. They, however, are treated as critical class interrupts.
In the case of machine verification, the following general rules apply: 1. No instruction after the one whose address is communicated to the verification interrupt routine in the iip register has started execution. 2. The instruction whose address is communicated to the machine check interrupt routine in the register iip, and all previous instructions may or may not be completed successfully. All those instructions that are ever going to complete seem to be will do so already, and have done so within the context existing prior to the Machine Interruption of the Verification. No further interruption (other than new machine check interruptions) will occur as a result of those instructions.
When an interrupt occurs, the processor saves in special registers part of the context of the interrupted instruction stream. This is necessary for the subsequent correct restoration of the interrupted stream after completion of the interrupt processing. These are the registers: iip is a copy of ip, ipsr is a copy of psr.
The processor provides the interrupt handler with some minimum free registers for intermediate computations, so that the interrupt handler can use these registers for its own purposes. Special registers group ifa, cause, iib stores information about the characteristics of the interrupt necessary to recognize and process the interrupt.
Special Registers Group (iip, iipa, ipsr, ifa, cause, iib) used to quickly save part of the machine state during interruptions, service interrupt, and restore the initial state of the machine when returning from the interrupt. This group exists in two instances to service two level interrupts. priority (criticality) and forms a file of 2 banks with 16 special registers.
These registers store information during interruption and are used by interrupt handlers. These registers can only be read or written while psr.ic=0 (while interrupt processing is in progress), otherwise the error «Illegal Operation fault» occurs. For these registers, their contents are guaranteed to be saved only when psr.ic=0. When psr.ic=1, the processor doesn't save their contents.
Special register interruption instruction pointer (iip) saves a copy of the register upon interruption ip and indicates the place of return from the interrupt. In general, iip contains the address of the instruction bundle, which contains the instruction that caused the error, or the address of the bundle that contains the next instruction to return after processing the trap. The specified and the following instructions are restarted; previous ones are ignored. Outside of the interrupt context, the value of this register is undefined.
Special register interruption instruction previous address (iipa), when interruption occurs, saves the address of the last successfully executed (all slots) instruction bundle.
6 3 | 6 2 | 6 1 | 6 0 | 5 9 | 5 8 | 5 7 | 5 6 | 5 5 | 5 4 | 5 3 | 5 2 | 5 1 | 5 0 | 4 9 | 4 8 | 4 7 | 4 6 | 4 5 | 4 4 | 4 3 | 4 2 | 4 1 | 4 0 | 3 9 | 3 8 | 3 7 | 3 6 | 3 5 | 3 4 | 3 3 | 3 2 | 3 1 | 3 0 | 2 9 | 2 8 | 2 7 | 2 6 | 2 5 | 2 4 | 2 3 | 2 2 | 2 1 | 2 0 | 1 9 | 1 8 | 1 7 | 1 6 | 1 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 0 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
bundle address | 0 |
Special register interruption processor status register (ipsr) upon interruption, it saves a copy of the register psr (machine status), and has the same format and set of fields as psr. ipsr is used to restore processor state when returning from interrupt with the instruction rfi (return from interruption).
Special register interruption extended register (cause) during non-critical (primary) interruption stores information about the interruption that occurred. The cause register contains data for an exception to differentiate between the different types of exceptions that a single type of interrupt can generate. When one of these interrupts is raised, the bits or bits corresponding to the particular exception that generated the interrupt will be set, and all other bits of the register cause are cleared. Other types of interruption do not affect the contents of the register cause. The register cause must not be cleared by software. The register cause stores information about the nature of the interrupt, and recorded by the processor on all interrupt events, regardless of psr.ic, except for «Data Nested TLB faults». cause stores information about an interrupted instruction and its properties, such as read, write, execute, speculative, or non-access. Several bits can be simultaneously set to cause, for example, an erroneous semaphore operation can expose both cause.r and cause.w. Additional information about the bug or trap is available through cause.code and cause.vector.
63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 | 41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 |
reserved | vector | ||||||||||||||||||||||||||||||
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
code | reserved | ei | d | n | a | r | w | x |
Field | Bit | Description |
---|---|---|
r | 1 | Read exception. If 1, then the interrupt is associated with reading data. |
w | 1 | Write exception. If 1, then the interrupt is associated with data recording. |
x | 1 | Execute exception. If 1, then the interrupt is associated with fetch instructions. |
n | 1 | Non-access – translation request instructions (dcbf, fetch, mprobe, tpa). |
d | 1 | Exception Deferral – this bit is set to TLB exception deferral bit (tlb.ed) for a code page containing an erroneous instruction. If translation doesn't exist or translation for the code is prohibited, cause.ed=0. If 1, then the interrupt is delayed. |
ei | 2 | Excepting Instruction is the slot number of the bundle on which the interrupt occurred. For errors and external interrupts, cause.ei=iip.sn but doesn't match traps. For traps, cause.ei defines the instruction slot for the trap. |
code | 16 | interruption Code is the 16-bit code for additional information about the current interrupt. |
vector | 8 | 8-bit code for additional information about external interrupt. |
Notes: The information in the register cause is not complete. System software may also need to identify the type of instruction which caused the interrupt, examine the TLB input accessed by data or instruction memory access, to fully determine which exception or exceptions caused the interrupt. For example, a data memory interruption can be caused by both security breach exceptions, as well as byte order exclusions. System software would have to look besides cause, type of status psr in ipsr and page protection bits in the TLB input accessed by memory access, to determine if a Defense Violation has also occurred. The bits of the stored register ipsr can be changed when returning from an interrupt via rfi.
Special register interruption faulting address (ifa) upon interruption provides the effective address calculated by the interrupted instruction (virtual, or physical if translation is not used). For loads, stores, atomics, or cache management instructions, which caused an interrupt while accessing memory due to misalignment, a miss in TLB data/instructions or for any other reason, ifa contains an erroneous data address and points to the first byte of an erroneous operand. For other instructions, ifa contains the address of the instruction bundle. For erroneous instruction addresses, ifa stores a 16-byte boundary aligned binding address for the erroneous instruction. ifa is also used to temporarily store the translation virtual address, when the translation input is inserted into the TLB translation table (instructions or data).
Special 128-bit register interruption instruction bundle (iib) upon interruption, if psr.ic=1, saves the current instruction bundle for the failed instruction. The interrupt handler may use iib if needed to disassemble the failed instruction and emulate its execution.
6 3 | 6 2 | 6 1 | 6 0 | 5 9 | 5 8 | 5 7 | 5 6 | 5 5 | 5 4 | 5 3 | 5 2 | 5 1 | 5 0 | 4 9 | 4 8 | 4 7 | 4 6 | 4 5 | 4 4 | 4 3 | 4 2 | 4 1 | 4 0 | 3 9 | 3 8 | 3 7 | 3 6 | 3 5 | 3 4 | 3 3 | 3 2 | 3 1 | 3 0 | 2 9 | 2 8 | 2 7 | 2 6 | 2 5 | 2 4 | 2 3 | 2 2 | 2 1 | 2 0 | 1 9 | 1 8 | 1 7 | 1 6 | 1 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 0 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
slot2 | slot1 | tp | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
slot3 | slot2 |
There are two types of exceptions: those caused directly by the execution of an instruction (synchronous to the instruction stream) or caused by asynchronous events. In both cases, an exception can cause one of several types of interrupts.
The architecture requires that all synchronous interrupts be processed programmatically according to a sequential execution model. The exception to this rule is in the case of multiple synchronous interrupts from a single instruction.
For any instruction trying to raise several exceptions for which the corresponding synchronous interrupt types are allowed, a priority order is defined in which the instruction is allowed to generate single interrupts. This exception priority mechanism, apart from the requirement that synchronous interrupts be generated programmatically, also ensures that at any given time only one of the synchronous interrupt types exists for consideration. The exception priority mechanism also prevents some debug exceptions exceptions that occur in combination with other synchronously generated interrupts.
This section doesn't define the allowed installation of multiple exceptions for which the corresponding interrupt types are blocked. Throwing exceptions for which the corresponding interrupt types are blocked has no effect on throwing other exceptions, for which appropriate interrupt types are allowed. Conversely, if a specific exception for which the appropriate type of interrupt is enabled is shown in the following sections, has a higher priority than another exception, this will prevent the installation of this other exception, regardless of the corresponding type of interruption of another exception is allowed or blocked.
The priority of exception types is listed below from highest to lowest. Some types of exceptions can be mutually exclusive and can be considered as exceptions of the same priority. In these cases, the exceptions are listed according to the sequential execution model.
Type | No. | Exception | Description |
---|---|---|---|
Aborts | 1 | Machine reset abort (RESET) | Reboot |
2 | Machine check abort (CHECK) | Processor check | |
External Interrupts | 3 | Initialization interrupt (INIT) | Warm restart |
4 | Platform management interrupt (PMI) | Platform interrupt (chipset, board) | |
5 | External interrupt (INT) | External devices, timer, other processors | |
Runtime errors for the asynchronous register stack (spill-fill faults) | 7 | RS Data debug fault | Address and memory access match with one of the debug registers |
8 | RS Unimplemented data address fault | The presence of non-zero bits in the unimplemented bits of the address | |
10 | RS Data TLB Alternate fault | Miss in TLB data (without HPT) | |
11 | RS Data HPT fault | HPT error | |
12 | RS Data TLB fault | Missing TLB data (after HPT) | |
13 | RS Data page not present fault | Data page is not in physical memory | |
16 | RS Data access rights fault | Accessing a virtual memory page in an unauthorized way, for example, reading from a page for which reading is prohibited | |
17 | RS Data access bit fault | Access to the virtual memory page (first entry) | |
18 | RS Unsupported data reference fault | Data access is not supported by memory attributes | |
Fetch faults phase errors | 21 | Instruction TLB Alternate fault | Miss in TLB instructions (without HPT) |
22 | Instruction HPT fault | HPT error | |
23 | Instruction TLB fault | Missing TLB instructions (after HPT) | |
24 | Instruction Page Not Present fault | The instruction page is not in physical memory | |
25 | Instruction Access rights fault | Selection of instructions from the virtual memory page for which execution is not allowed | |
26 | Instruction Access Bit fault | Fetching instructions from the virtual memory page (first fetch) | |
Decode faults errors | 27 | Illegal operation fault | Reserved instruction |
28 | Privileged operation fault | Privileged instruction | |
29 | Undefined operation fault | Invalid instruction form | |
30 | Disabled floating-point fault | Forbidden FP instruction | |
31 | Unimplemented operation fault | Unimplemented standard instruction (emulation required) | |
32 | Unsupported operation fault | Unimplemented dedicated instruction (emulation required) | |
execute faults | 33 | Reserved register/field fault | Invalid instruction field value (in particular register number) |
34 | Out-of-frame rotated register | Access to the rotated register outside the local frame | |
35 | Privileged register fault | Attempt of an unprivileged program to perform a privileged operation with a privileged register | |
36 | Invalid register field fault | Attempt to write an invalid value to registers, TLB | |
37 | Virtualization fault | Attempted to execute a special instruction in processor virtualization mode | |
38 | Integer overflow fault | Integer overflow | |
39 | Integer divide by zero fault | Integer division by zero | |
40 | floating-point fault | Floating-point error | |
execute faults memory access | 42 | Data debug fault | Address and memory access match with one of the debug registers |
43 | Unimplemented data address fault | The presence of non-zero bits in the unimplemented bits of the address | |
44 | Data TLB Alternate fault | Missing TLB data (without HPT) | |
45 | Data HPT fault | HPT error | |
46 | Data TLB fault | Missing data TLB (after HPT) | |
47 | Data page not present fault | Data page not in physical memory | |
48 | Data access rights fault | Accessing the virtual memory page in an unauthorized way, such as reading from a page for which reading is prohibited | |
49 | Data access bit fault | Access to the virtual memory page (first entry) | |
50 | Unaligned data reference fault | Accessing data at an unaligned address | |
51 | Unsupported data reference fault | Data access is not supported by memory attributes | |
Traps (traps) | 53 | Lower-Privilege Transfer trap | Debugger, privilege level change |
54 | Taken branch trap | Debugger, taken branch | |
55 | Instruction Debug trap | Debugger, attempt to jump to an address that corresponds to one of the address ranges in debug registers | |
56 | System call trap | Debugger, intercept system call | |
57 | Single step trap | Debugger, trap after each instruction | |
58 | Unimplemented Instruction address trap | Unimplemented address of the next instruction bundle | |
59 | floating-point trap | Floating-point instruction requires intervention | |
60 | software trap | Software trap (trap) instruction |
If an instruction raises multiple debug exceptions and doesn't raise any other exceptions, then it is permissible to generate a single debug interrupt (highest priority).
The start addresses for interrupt handler code can be fixed in the architecture (old ARM, MIPS). But it is desirable to provide the ability to switch the entry point (for example, for updating), and also, possibly, for assigning different handlers to different processors in a multiprocessor system, since in a multiprocessor system simultaneous processing of several interrupts by different processors can occur, and no processor can use shared memory blocks for the needs of its interrupt handler. Special register interruption vector address (iva) determines the position of the system table of interrupt handlers in the virtual address space (or the physical address space if translation disabled). The vector table is 64 KiB in size and needs to be aligned on the 64 KiB border, so the lower 16 bits of the register must be zeros.
6 3 | 6 2 | 6 1 | 6 0 | 5 9 | 5 8 | 5 7 | 5 6 | 5 5 | 5 4 | 5 3 | 5 2 | 5 1 | 5 0 | 4 9 | 4 8 | 4 7 | 4 6 | 4 5 | 4 4 | 4 3 | 4 2 | 4 1 | 4 0 | 3 9 | 3 8 | 3 7 | 3 6 | 3 5 | 3 4 | 3 3 | 3 2 | 3 1 | 3 0 | 2 9 | 2 8 | 2 7 | 2 6 | 2 5 | 2 4 | 2 3 | 2 2 | 2 1 | 2 0 | 1 9 | 1 8 | 1 7 | 1 6 | 1 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 0 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
iva | 0 |
For each of the 64 types of interrupts in the table, 1024 bytes of code are allocated (64 bundles or 192 short instructions). The address of the interrupt handler is obtained by combining the register iva and the interrupt vector number inum. If some vector is not used, then the place for its code in the table can use the vector preceding by the number. If, nevertheless, the interrupt handler doesn't fit in the table, a transition outside the vector table should be implemented.
6 3 | 6 2 | 6 1 | 6 0 | 5 9 | 5 8 | 5 7 | 5 6 | 5 5 | 5 4 | 5 3 | 5 2 | 5 1 | 5 0 | 4 9 | 4 8 | 4 7 | 4 6 | 4 5 | 4 4 | 4 3 | 4 2 | 4 1 | 4 0 | 3 9 | 3 8 | 3 7 | 3 6 | 3 5 | 3 4 | 3 3 | 3 2 | 3 1 | 3 0 | 2 9 | 2 8 | 2 7 | 2 6 | 2 5 | 2 4 | 2 3 | 2 2 | 2 1 | 2 0 | 1 9 | 1 8 | 1 7 | 1 6 | 1 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 0 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
IVA base | inum | 0 |
Interrupt handling is implemented as a quick context switch (much simpler than completely changing the context of the process). When an interrupt occurs, the hardware does the following:
interruption_number is a unique integer value assigned to each interrupt. Vectorization is done by going to the interrupt vector table indexed by this integer. The interrupt vector table contains 1024 bytes (64 instruction bundles) for each interrupt processing routine. The value in the register iva should be aligned on the page border of 64 kilobytes.
Notes: The task of interrupt handlers is to resolve (unmask) external interrupts (by setting the psr.i bit to 1) as soon as possible, to minimize the worst latency for external interrupts.
At the end of the interrupt routine, rfi (return from interruption) is executed which restores the state register of the machine psr from ipsr, and normal instruction execution resumes from the address contained in iip.
The architecture defines a mechanism for delivering external interrupts to the processor from other devices, external interrupt controllers, other processors; interrupt handling mechanism; a mechanism for sending interrupts to other processors. All this is handled by the processor's embedded interrupt controller.
Traditionally, interrupts are delivered to the processor via a separate serial bus, unlike ordinary data that is delivered via the system bus. This creates a sequencing problem that is traditionally solved by software or complex bus matching logic. If the data writing is followed by an interrupt, it is possible that the interrupt will reach the processor before the data writing takes effect, which will cause the processor to see outdated data. If you use only the system bus to deliver interrupts along with normal data, the ordering problem disappears.
The POSTRISC architecture replaces the traditional serial interrupt bus with a system bus interrupt delivery implementation. Therefore, interrupt transfer capabilities are scaled along with the system bus speed. External IO interrupts are delivered directly via the IO bus, which also speeds up delivery to the system bus.
Unlike PCI, where the device sends everyone a common interrupt signal, Now the device can send a unique vector by writing it to a specific address. The OS can configure for each device in the system the address of the receiver of its interruptions (possibly one per device) and select up to 32 different vectors per device.
The architecture introduces batch interrupt handling to minimize the number of context switches, unlike the previous approach, when each interrupt is processed in its context. This will allow the interrupt handler to handle all pending interrupts without changing the processor priority level. This reduces the number of context switches and the number of processor switches, which will improve performance.
The architecture rejects interrupts based on individual contacts - in favor of interrupts in the form of special signals of the system-wide bus. To add more interrupt sources using the contact mechanism, you need more contacts, and for the signal mechanism there are no restrictions on the number interrupt sources on a shared bus.
External interrupts are not related to the execution of the instruction thread (asynchronous to the thread). The processor is responsible for the sequence and masking (prohibition) of interruptions, sending and receiving interprocessor interrupt messages, receiving interrupt messages from external interrupt controllers, and managing local interrupt sources (from itself). External interrupts are generated by four sources in the system:
External Interrupt Controllers. Interrupt messages from any external source can be sent any processor from the External Programmable Interrupt Controller (EXTPIC), which collects interrupts from several simple devices, or from an IO device capable of sending interrupt messages directly (with a built-in controller). The interrupt message informs the processor that an interrupt request has been made and specifies the unique vector number of the external interrupt. A request for interruption from a simple device is issued if a fact of a steady signal level was detected or when the signal level was different. The processors and controllers of external interrupts communicate via the system bus according to the interrupt message protocol defined by the bus architecture.
Locally attached to the processor devices. Interrupts from these devices are generated by the processor contacts for direct interrupts (LINT, INIT, PMI) and are always directed to the local processor. LINT pins can be connected directly to the local external interrupt controller. LINT contacts are programmable either differential-sensitive or level-sensitive, and for the type of interrupt that is generated. If they are programmed to generate external interrupts, then each LINT pin has its own vector number. Only LINT pins connected to the processor can directly generate level-sensitive interrupts. LINT pins cannot be programmed to generate level-sensitive PMI or INIT interrupts. The INIT and PMI pins generate their corresponding interrupts. An interrupt is generated for the PMI contact with PMI vector number 0.
Internal processor interrupts. These are, for example, interruptions from the processor timer, from the performance monitor, or interruptions due to machine checks. These interrupts are always routed to the local processor. A unique vector number can be programmed for each interrupt source.
Other processors. Each processor can interrupt any other processor, including itself, by sending an interprocess message about the interruption to a specific target processor. The destination of the interrupt message (one of the processors in the system) is determined by the unique identifier of the processor in the system.
An external interrupt controller (EXTPIC) provides incoming interrupt signal lines, by which devices inject interrupts into the system in the form of a steady-state signal level (level) or signal level difference (edge).
EXTPIC contains a Redirection Table (RT) with entries for each incoming interrupt line. Each entry in RT can be individually programmed to recognize interruptions on the line (edge or level), which vector (and therefore which priority) has the interrupt, and which of all possible processors should serve the interrupt. RT content is controlled by software (mapped to physical addresses and writable by processors) and receives default values when reset. The table information is used to send messages to the local interrupt controller of the target processor via the system bus.
EXTPIC functionality can be integrated directly into the end device, but any component of the system that is capable of sending interrupt messages on the IO bus, It must behave like EXTPIC and must have EXTPIC functionality.
Name | Address | Description |
---|---|---|
EXTPIC Version register | Base + 0x00 | |
IO eoi register | Base + 0x08 | |
Redirection Table Entry X | Base + 0x10 and then 8 |
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
0 | Selected register | ||||||||||||||||||||||||||||||
Window register | |||||||||||||||||||||||||||||||
0 | max RT num | 0 | version | ||||||||||||||||||||||||||||
eoi |
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
0 | pid | ||||||||||||||||||||||||||||||
0 | p | s | m | t | dm | 0 | vector |
Delivery Mode (DM): delivery method, Delivery Status (S): 0 (Idle) or 1 (Pending), Interrupt Input Pin Polarity (P): 0 (High) or 1 (Low), Trigger Mode (T): edge (0) or level (1), Mask (M): mask interrupt, processor ID (PID): processor ID.
From the point of view of other processors and IO devices, the processor itself is a device with a built-in programmable external interrupt controller. The only difference is that the processor itself programs its built-in interrupt controller, and is not programmed by other processors.
The local interrupt controller determines whether the processor should accept interrupts sent via the system bus, provides local registers for pending interrupts, nesting and masking interrupts, manages interactions with its local processor, provides the ability to interprocess messages to its local processor.
In older architectures, this programming was even carried out similarly to external controllers, via memory-mapped registers. This required each processor to allocate its own address range to display its interrupt controller, made it possible to make strange errors with access to the controller of a «alien» processor.
Later, registers of the embedded controller prefer to implement as special registers inside the processor, without mapping to the address space. This removes the need to map controller registers to physical addresses, and solves the access problem. For example, the new architecture of the integrated Intel X2APIC interrupt controller is implemented, replacing XAPIC (the full chronology is: PIC - APIC - XAPIC - X2APIC), or IA64 SAPIC (streamlined integrated interrupt controller).
The POSTRISC architecture naturally follows the newer approach. The processor software manages external interrupts by changing special processor registers, controlling the built-in external interrupt controller. These registers are summarized in the table below, and are used to prioritize and deliver external interrupts, and for assigning external interrupt vectors to interrupt sources inside the processor such as a timer, performance monitor, and processor validation.
Name | Description |
---|---|
lid | Local Identification register |
tpr | Task Priority register |
irr0…irr3 | Interrupt Request registers (read only) |
isr0…isr3 | Interrupt Service registers |
itcv | interval time counter vector |
tsv | termal sensor vector |
pmv | performance monitor vector |
cmcv | corrected machine-check vector |
Special task priority register (tpr) controls the forced masking (prohibition) of external interrupts depending on their priority. All external interrupt vectors with a number greater than mip (mask interrupt priority) are masked.
63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 | 41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 |
reserved | |||||||||||||||||||||||||||||||
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
reserved | mip |
To minimize the cost of handling external interrupts, you need to reduce the total number of context switches (processor interrupts and return from interrupts). It is desirable to be able to batch process interrupts, that is, once interrupting the processor, process all interrupts awaiting processing. To do this, the mechanism for determining which external interrupts are pending should be separated from the processor interrupt mechanism.
Special register group interrupt request registers (irr0-irr3) store a 256-bit vector of external interrupts awaiting processing (by the number of possible numbers of interrupt vectors from 0 to 255). The bit set to 1 in irr means that the processor has received an external interrupt. Registers are read-only, write is prohibited (invalid operation). Vector numbers 1-15 are reserved for internal and local interrupts. The zero bit of the register irr0 is always zero. It is a special «spurious» or empty interrupt vector. Reading from the register iv clears the bit corresponding to the highest priority interrupt and returns vector index (or spurious vector if there is no received interrupts).
6 3 | 6 2 | 6 1 | 6 0 | 5 9 | 5 8 | 5 7 | 5 6 | 5 5 | 5 4 | 5 3 | 5 2 | 5 1 | 5 0 | 4 9 | 4 8 | 4 7 | 4 6 | 4 5 | 4 4 | 4 3 | 4 2 | 4 1 | 4 0 | 3 9 | 3 8 | 3 7 | 3 6 | 3 5 | 3 4 | 3 3 | 3 2 | 3 1 | 3 0 | 2 9 | 2 8 | 2 7 | 2 6 | 2 5 | 2 4 | 2 3 | 2 2 | 2 1 | 2 0 | 1 9 | 1 8 | 1 7 | 1 6 | 1 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 0 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
bits 63-16 | rv | 0 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
irr1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
irr2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
irr3 |
Privileged register interrupt vector (iv) returns the highest priority unmasked number (vector) among external interrupts awaiting processing. If there is an external interrupt, the processor crosses out the interrupt vector from the waiting category and transfers it to the processed category. All vectors of the same and lower priority are masked until the processor finishes processing this interrupt. If there are no pending external interrupts or all external interrupts are masked, then iv returns the special value 0 (special vector spurious interrupt).
The end indicator is an entry in iv (end of interrupt). This is a signal that the software has finished servicing the last high priority interrupt, whose vector was read by reading from iv. The processor removes this interrupt vector from the category of serviced, and removes the masking of interrupts with a lower or equal priority.
The processor itself may generate interrupts, asynchronous to the current instruction thread, and not related to external devices, for example, in the case of a time slice end (itc match itm), itc overflow, performance monitor counter overflow, the processor overheat, the processor internal error, etc.
In this case, it is convenient to conditionally present these interrupts as external, and serve them according to the same principles. To do this, you need to map your dedicated external interrupt vector to the asynchronous interrupt from the processor. Accordingly, some interrupt vectors are mapped to specific types of asynchronous intraprocessor interrupts. They cannot be used to program external devices.
The interval time counter vector is associated with the processor interval timer counter (itc) match or overflow. The performance monitoring vector is associated with interrupts from the performance monitor. The corrected machine check vector is associated with interrupts due to the need to correct machine errors. The termal sensor vector is associated with interrupts due to processor overheating.
For these types of interrupts in the corresponding register, you can set the number of the interrupt vector or mask them (field m).
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
reserved | m | rv | vector |
The special local identification register (lid) contains the processor core identifier. It serves as the physical name of the processor for all interrupt messages (external interrupts, INIT interrupts, PMI platform interrupts). The contents of the register lid is set by the platform during boot/initialization and based on the physical location of this processor in the system. This value is implementation-dependent and should not be changed by software (available read-only). When receiving interrupt messages on the system bus, processors compare their lid with the destination address of the interrupt message. In case of a match, the processor accepts the interrupt and stores it in its queue of waiting interrupts.
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
reserved | lid |
Each processor can interrupt any other processor, including itself, by sending an inter-processor interrupt message (IPI). Different architectures have different approaches to the organization of interprocessor interrupts and their delivery.
For example, in the X86 architecture, each processor implements a special interrupt instruction register (icr), and the processor generates IPI by writing to this special register of its own. The message delivery method is not determined in this case, as well as the bus used for this (a separate narrow dedicated interrupt bus can be used). The pid field in the register defines the target processor to interrupt. The remaining fields are interrupt parameters (interrupt vector number and delivery mode). Hint is an instruction for the external system to deliver the interrupt exactly to the address (Hint=0), or you can make a load balance and deliver to the choice of the system (Hint=1) to another (unoccupied) addressee.
This method, despite the simplicity and universality of the implementation (the method for delivering interrupts is not defined by the architecture), also has some problems. Usually, sending interrupts is preceded by data modification operations that are performed on the shared bus, and there may be situations when an interrupt sent on the interrupt bus can overtake a data change on the shared bus. This requires the implementation of complex hardware-software synchronization schemes.
63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 55 | 54 | 53 | 52 | 51 | 50 | 49 | 48 | 47 | 46 | 45 | 44 | 43 | 42 | 41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 |
target processor id | |||||||||||||||||||||||||||||||
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
reserved | h | dm | rv | vector |
Another approach is that each processor behaves like any other IO device mapped to a physical address space. A processor generates IPI for another processor by writing to a specific, architecture-specific area of physical addresses. This removes the problem of synchronizing data and interrupts, since they are sent on the same common bus, and doesn't require the implementation of a separate bus for interrupts. At the same time, however, the load on the common bus increases, but insignificantly, since the interrupt signals make up a small percentage of the total traffic of the common bus. By this principle, IPI is implemented in Intel Itanium and IBM Power architectures.
For example, in IA64 architecture, the range of physical addresses is 1 MiB in size from the area of displayed devices allocated to display processors (16 bytes per processor) and transmit interrupt signals. The base address of this range is aligned with the natural border and is fixed architecturally at 0xFEE00000. Any address of the form 0xFEENNNN0 is recognized as an interrupt signal for the processor 0xNNNN. Writing an 8-byte aligned number to an address in this range will send an interrupt to the appropriate processor. Other types of writes are not supported, as well as reading. The PID address field identifies the target processor to interrupt. Hint (h) field of the address is a command for the external system to deliver the interrupt exactly to the address (Hint=0), or you can make a load balance and deliver to the choice of the system (Hint=1) to another (unoccupied) destination. The remaining fields (in the recorded number) are interrupt parameters (interrupt vector number and delivery mode).
6 3 | 6 2 | 6 1 | 6 0 | 5 9 | 5 8 | 5 7 | 5 6 | 5 5 | 5 4 | 5 3 | 5 2 | 5 1 | 5 0 | 4 9 | 4 8 | 4 7 | 4 6 | 4 5 | 4 4 | 4 3 | 4 2 | 4 1 | 4 0 | 3 9 | 3 8 | 3 7 | 3 6 | 3 5 | 3 4 | 3 3 | 3 2 | 3 1 | 3 0 | 2 9 | 2 8 | 2 7 | 2 6 | 2 5 | 2 4 | 2 3 | 2 2 | 2 1 | 2 0 | 1 9 | 1 8 | 1 7 | 1 6 | 1 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 0 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
0xFEE | pid | h | 0 |
6 3 | 6 2 | 6 1 | 6 0 | 5 9 | 5 8 | 5 7 | 5 6 | 5 5 | 5 4 | 5 3 | 5 2 | 5 1 | 5 0 | 4 9 | 4 8 | 4 7 | 4 6 | 4 5 | 4 4 | 4 3 | 4 2 | 4 1 | 4 0 | 3 9 | 3 8 | 3 7 | 3 6 | 3 5 | 3 4 | 3 3 | 3 2 | 3 1 | 3 0 | 2 9 | 2 8 | 2 7 | 2 6 | 2 5 | 2 4 | 2 3 | 2 2 | 2 1 | 2 0 | 1 9 | 1 8 | 1 7 | 1 6 | 1 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 0 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
reserved | dm | vector |
The architecture POSTRISC uses the generalization of second method: the interrupt message is sent over the common system bus as a write to the dedicated physical address. Each processor core is mapped as a device to physical memory via the standard PCE-Express config space 4KiB map size. Every processor core may be found in PCI Express config space for the corresponding chipset/socket. The first private byte of the range is used to deliver interrupts, the rest bytes are for remote processor tuning, debugging, monitoring or are reserved. The physical addresses 0xPPPPP0000000-0xPPPPPFFFFFFF are reserved for mapping existing processors (up to 65536 cores per PCIE ECAM). In the current emulator implementation, for simplicity, they are mapped to similar kernel virtual addresses 0xFFFFFFFFE0000000-0xFFFFFFFFEFFFFFFFF.
6 3 | 6 2 | 6 1 | 6 0 | 5 9 | 5 8 | 5 7 | 5 6 | 5 5 | 5 4 | 5 3 | 5 2 | 5 1 | 5 0 | 4 9 | 4 8 | 4 7 | 4 6 | 4 5 | 4 4 | 4 3 | 4 2 | 4 1 | 4 0 | 3 9 | 3 8 | 3 7 | 3 6 | 3 5 | 3 4 | 3 3 | 3 2 | 3 1 | 3 0 | 2 9 | 2 8 | 2 7 | 2 6 | 2 5 | 2 4 | 2 3 | 2 2 | 2 1 | 2 0 | 1 9 | 1 8 | 1 7 | 1 6 | 1 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 0 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
reserved | ECAM base | Bus-Device-Function | offset |
By writing 8-byte value with only 8 nonzero bit on address 0xFFFFFFFFENNNN000 we send interrupt to processor core with NNNN device id. The writing to any other address like 0xFFFFFFFFENNNNXXX or the loading from any address inside block leads to platform management interrupt for sender core.
address | bytes | |||||||
---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
0xFENNNN000 | vector | reserved (0) | ||||||
0xFENNNN008 | timecmp (test stuff) | |||||||
0xFENNNN010 | reserved | |||||||
0xFENNNN018 | reserved | |||||||
... | ... | |||||||
0xFENNNNFF8 | reserved |
The POSTRISC architecture provides debugging tools that enable hardware and software debugging features, such as step-by-step program execution, instruction breakpoints, and data breakpoints.
Debugging tools consist of a special debugging control register dbscr (debug status and control register), a set of debugging events as a subset of interrupts, special registers for comparing instruction addresses ibr (instruction breakpoint register), special registers for comparing data addresses dbr (data breakpoint register).
Debug registers are available for program execution, but they are intended for use only by special debuggers and debugging software, not general software or operating system code.
Monitoring tools include the following resources: special registers and/or bit fields controlling the monitoring, implemented types of counted events, a fixed number of event counters, additional type of interrupts for processing the monitoring counter overflow events.
Debugging tools are based on a special group of debug interrupts built into the general interrupt mechanism. Debug type interrupts can be thrown for various reasons that can be analyzed in the handler of this interrupt. There are seven types of predefined debugging events:
Name | Event Type |
---|---|
IB | «Instruction address match» debug event occurs on instruction address match. If the address of a instruction bundle matches one of the criteria, specified in the debug registers ibr, a instruction debugging event is raised (if the instruction is not canceled). One or more debug events ibr occur, if the execution of instructions at the address which matches the criteria specified in the registers ibr. |
DB | «Data address match» debug event occurs on data address match. If the address for accessing data in memory meets one of the criteria, specified in the debug registers dbr. Data Debug errors are only reported if the qualification predicate is true. The reported trap code returns the matching state of the first 4 dbr registers that matched during the execution of the instruction. Zero, one or more dbr registers can be reported as matching. |
TR | Software Trap |
TB | «Taken branch» trap occurs on each taken branch instruction received if psr.tb=1. This trap is useful for profiling a program. After the trap, iip and ipsr.ri point to the branch destination instruction, and iipa and cause.ei to the branch instruction that caused the trap. The case of debugging «taken branch» (TB) occurs if psr.tb=1 (that is, debug events «taken branch» are allowed), the branch instruction is executed (i.e., either an unconditional branch, or a conditional branch in which the branch condition is satisfied), and psr.de=1 or dbcr0.idm=0. |
SS | «Single step» trap occurs on each successfully finished instruction if dbsc.rss=1 (step-by-step debugging events are allowed). After the trap, iip and ipsr.ri point to the next instruction to be executed. iipa and cause.ei point to the caught instruction. |
lp | An interrupt has occurred. The debug event «an interrupt occurred» (IRPT) occurs, if dbcr.irpt=1 (that is, a debug event of the interrupt occurred is allowed) and any non-critical interruption occurs while dbcr.idm=1, or any critical or non-critical interrupt occurs while dbcr.idm=0. Abort Accepted Debug Events, may occur regardless of the installation of psr.de. |
IR | Returns from Interrupt |
Debug events include instruction and data breakpoints. These debug events set the status bits in the DBSR debug status register. The existence of a set bit in the DBSCR register is considered as a debug exception. Debug exceptions, if allowed, cause debugging interruptions. The debug status and control register (DBSCR) is used to set the allowed debug events, manage timer operations during debugging events, and set processor debugging mode. It contains the status of debug events.
The group of bits DBE (debug enabled events) of the DBSCR register is set in supervisor mode and cannot be changed by the program. Bit groups DBTE (debug taken enabled event) and DBT (debug taken event) of the DBSCR register installed by hardware, they can be read and cleaned programmatically. The contents of the DBSCR register can be read into the general register by the instruction mfspr.
Debug events are used to force debug exceptions. be registered in the DBSCR debug status register.
To enable the debug event, you need to set the corresponding bit from the DBE group of the DBSCR register and thus raise a debug exception a certain type of event must be allowed by the corresponding bit or bits in the dbcr debug control registers. Once the DBSCR register bit is set and, if debug interrupts are enabled (a bit from the DBE group is 1), a debug interrupt will be generated.
The bit in the special DBSCR debug control register must be set to 1 to allow debugging interrupt corresponding to this bit. Debugging events are not allowed to occur when the corresponding bit in the DBSCR register is 0. In such situations, no debug exception of this type occurs. and no bits of this type of DBSCR register are set.
If the corresponding bit in the register dbscr is 1 (that is, debugging interrupts of this type are allowed) during this debug exception, interruption of debugging will occur immediately (if there is no exception with a higher priority, which is allowed to cause interrupts), the execution of the instruction, causing the exception will be suppressed, and CSRR0 will be set to the address of this instruction.
If debug interrupts of this type are blocked during a debug exception, interruption of debugging will not occur, and the instruction will complete execution (provided, the instruction doesn't cause some other exception, which generates an allowed interrupt).
Notes: If an instruction is suppressed due to an instruction, that raised some other exception that allows the generation of an interrupt, then the attempted implementation of that instruction does no Cause the instruction Complete debugging case. The trap instruction doesn't fall into the category of instructions whose execution is suppressed, starting with instructions, it actually completes execution and then generates an interruption to the system call. In this case, the finished debug exclusion command will also be installed.
A trap debugging event (trap) occurs if dbscr.trap=1 (that is, Trap debugging events are allowed) and the Trap instruction is unconditional or the conditions for the trap are met.
Interrupt instruction error – execution of trap instruction results in the Interrupt instruction error. An interrupt can be used to profile, debug, and enter the operating system. (although the instruction to enter the privileged code (syscall) is recommended, since it has lower costs).
If dbcr.trap=0 (that is, trap-type debugging interrupts are blocked) during the exception of debugging a trap, interruption of debugging will not occur, and the type of exception for the Trap. A Program Interruption will occur instead if the trap condition is met.
Trap «Decrease privilege level». when psr.lp=1, and the transition that occurs lowers the privilege level (psr.cpl becomes 1), this trap occurs. This trap allows the debugger to keep track of privilege drops, for example, to remove permissions granted to higher privileged code. After the trap iip and ipsr.ri point to the effective address of the branch, and iipa and cause.ei to the branch instruction that caused the trap.
When dbcr.idm=1, only non-critical interrupts can trigger debugging events of the interrupt that occurred. This is because all critical interrupts automatically clear psr.de, which would always prevent the associated debugging interrupt from appearing accurately. Also, debug interrupts directly are – critical class interrupts, and thus any debug interrupt (for any other debugging case) would always end up during installation an additional exception to dbsr.irpt after entering the debug interrupt handler. At this point, the debug interrupt routine is unable to determine Is the interruption a valid debugging event? It was related to the initial debugging event.
When dbcr.idm=0, then critical and non-critical class interruptions can cause the Abort Accepted debugging event. In this case, the assumption is that debugging events are not used to cause interruptions. (software can vote DBSR instead) and therefore it's proper to record an exception in DBSR even though that the critical interrupt that causes the interrupt is an accepted debug event, will clear psr.de.
Debug event «Interception return from interrupt» (a call to the ret instruction) occurs if dbcr.ret=1 (i.e. debugging events are allowed when returning from an interrupt) and an attempt was made to execute the rfi instruction. When a debug event occurs on return, dbsr.ret is set to 1 to record a debug exception.
Debug registers are designed to organize the interception of program calls to specific address ranges for specific purposes (e.g. execution or writing), and allow the debugger to verify the correctness of the program. Their number depends on the implementation. Read/write ability depends on the priority level, processor model. They are used in pairs, with at least 4 pairs for instructions and 4 for data.
The 128-bit instruction breakpoint registers ibr are for debug comparing instruction addresses. A debugging event can be allowed to occur after trying to execute a instruction at an address range specified by a ibr register. Since all instruction addresses must be aligned on the border of the bundle, the four least significant bits of the ibr register are reserved and do not participate in comparison with the address of the instruction bundle.
6 3 | 6 2 | 6 1 | 6 0 | 5 9 | 5 8 | 5 7 | 5 6 | 5 5 | 5 4 | 5 3 | 5 2 | 5 1 | 5 0 | 4 9 | 4 8 | 4 7 | 4 6 | 4 5 | 4 4 | 4 3 | 4 2 | 4 1 | 4 0 | 3 9 | 3 8 | 3 7 | 3 6 | 3 5 | 3 4 | 3 3 | 3 2 | 3 1 | 3 0 | 2 9 | 2 8 | 2 7 | 2 6 | 2 5 | 2 4 | 2 3 | 2 2 | 2 1 | 2 0 | 1 9 | 1 8 | 1 7 | 1 6 | 1 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 0 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
address | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
x | 0 | plm | mask | 0 |
The 128-bit data breakpoint registers dbr are for debug comparing data addresses. A debugging event can be allowed to occur after loads, stores, or atomic instructions to an address range specified by a dbr register.
6 3 | 6 2 | 6 1 | 6 0 | 5 9 | 5 8 | 5 7 | 5 6 | 5 5 | 5 4 | 5 3 | 5 2 | 5 1 | 5 0 | 4 9 | 4 8 | 4 7 | 4 6 | 4 5 | 4 4 | 4 3 | 4 2 | 4 1 | 4 0 | 3 9 | 3 8 | 3 7 | 3 6 | 3 5 | 3 4 | 3 3 | 3 2 | 3 1 | 3 0 | 2 9 | 2 8 | 2 7 | 2 6 | 2 5 | 2 4 | 2 3 | 2 2 | 2 1 | 2 0 | 1 9 | 1 8 | 1 7 | 1 6 | 1 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 0 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
address | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
r | w | 0 | plm | mask |
The contents of the register dbr are compared with the address computed by the memory access instruction. Data debugging event occurs, if enabled, attempted execution of the data memory access instruction, and type, address, and maybe even the meaning of the data access memory matches the criteria specified in the dbr.
All load instructions are treated as reads regarding debugging events, while all store instructions are treated as a write regarding debugging events. Additionally, cache management instructions, and some special cases handled as follows.
The cmp bits determine whether all or some of the bits of the instruction address should match the contents of debug register, whether the address should be inside or beyond a specific range specified by a ibr register for a debugging event to occur.
There are four modes for comparing instruction addresses.
High register part contains breakpoint addresses, low part contains offset or breakpoint mask. At least 4 data and instruction registers are implemented on all processor models. The first registers after zero by number are implemented.
The instruction and data memory addresses provided for compliance are always in the implemented address space. Programming an unimplemented physical address in ibr/dbr ensures that the physical addresses provided by ibr/dbr will never match. Similarly, programming unimplemented virtual addressing in ibr/dbr ensures that the virtual addresses submitted by ibr/dbr will never match.
Field | Description |
---|---|
Address 63:0 | Matching address– 64-bit virtual or physical breakpoint address. The address is interpreted as virtual or physical depending on psr.dt and psr.it. The trap «Instruction data breakpoint address» occurs when load, store, semaphore instruction. For fetching instructions, the lower four bits of ibr.addr{3:0} are ignored when comparing addresses. All 64 bits are implemented on all processors, regardless of the number of address bits implemented. |
mask 55:0 | mask for the address determines which address bits in the corresponding address register will be compared when determining the conformity of the control point. Address bits for which mask bits are 1 must match the address of the breakpoint, otherwise, the address bit is ignored. Address bits {63:56} for which there are no corresponding mask bits, always compared. All 56 bits are implemented on all processors, regardless of the number implemented bits of the address. |
plm 59:56 | Mask for all privilege levels – Allows data breakpoints that match the specified privilege level. Each bit corresponds to one of 4 privilege levels. Bit 56 corresponds to privilege level 0, bit 57 to level 1, etc. A value of 1 indicates that debugging comparisons are allowed at this privilege level. |
w 62 | Write - When dbr.w=1, any not canceled store, semaphore, probe.w.fault or probe.rw.fault to the address, causes the breakpoint to the corresponding address register. |
r 63 | Read - When dbr.r=1, any unannounced load, semaphore, lfetch.fault, probe.r.fault or probe.rw.fault at the address corresponding to the address register causes a breakpoint. When dbr.r=1, PT access that matches dbr (except those for the tak instruction) will cause an error «Missing in Instruction/Data TLB». If dbr.r=0 and dbr.w=0, the data breakpoint register is locked. |
x 63 | Execution - When ibr.x=1, executing instructions at the address corresponding to the address register causes a breakpoint. If ibr.x=0, then the instruction breakpoint register is locked. Control points for instructions will be reported, even the instruction is canceled. |
ig 62:60 | Ignored |
The registers dbr/ibr can only be accessed at the highest privilege level 0, otherwise, the «privileged operation» error occurs.
Debug register changes are not necessarily observed with the following instructions. The software must use the data serialization to ensure that modifications to dbr, psr.db, psr.tb and psr.lp observed before the dependent instruction is executed. Because changing the registers ibr and the flag psr.db, affect the subsequent instruction fetching, the software must execute the instruction serialization.
In some implementations, a hardware debugger may use two or more registers for its own use. When a hardware debugger is applied, only 2 dbr and only 2 ibr are available for program use. The software should be able to run with fewer implemented ibrand/or dbr registers if a hardware debugger is present. When a hardware debugger is not implemented, at least 4 ibr and 4 dbr are available for programmatic use.
Implemented debug registers used by the attached hardware debugger, arranged by number first (for example, if only 2 dbr are available software, the registers dbr[0-1]) are available.
Notes: When a hardware debugger is implemented and it uses two or more of debug registers, the processor doesn't force registers between the program and the hardware debugger, that is, the processor doesn't prohibit the program from reading or changing any of the debug registers. However, if the program modifies any of the registers used by the hardware debugger, the processor and/or hardware operation of the debugger may become undefined; the processor and/or hardware debugger may crash.
The instructions mfibr (move from instruction breakpoint register), mtibr (move to instruction breakpoint register), mfdbr (move from data breakpoint register), mtdbr (move to data breakpoint register) are used to indirectly read/write instruction/data debug registers. The sum of general register and simm10 is used to pass index of monitor register number.
mfibr ra, rb # ra = ibr[rb+imm] mtibr ra, rb # ibr[rb+imm] = ra mfdbr ra, rb # ra = dbr[rb+imm] mtdbr ra, rb # dbr[rb+imm] = ra
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | target | index | simm10 | opx |
Monitoring Registers are designed to count various internal events when executing an instruction thread. Their number depends on the implementation (minimum 4). Read/write ability depends on the priority level, processor model.
6 3 | 6 2 | 6 1 | 6 0 | 5 9 | 5 8 | 5 7 | 5 6 | 5 5 | 5 4 | 5 3 | 5 2 | 5 1 | 5 0 | 4 9 | 4 8 | 4 7 | 4 6 | 4 5 | 4 4 | 4 3 | 4 2 | 4 1 | 4 0 | 3 9 | 3 8 | 3 7 | 3 6 | 3 5 | 3 4 | 3 3 | 3 2 | 3 1 | 3 0 | 2 9 | 2 8 | 2 7 | 2 6 | 2 5 | 2 4 | 2 3 | 2 2 | 2 1 | 2 0 | 1 9 | 1 8 | 1 7 | 1 6 | 1 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 0 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
counter for the number of such events | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0 | event type |
There are at least 8 128-bit performance monitoring registers (mr0…mr7). Unimplemented monitoring registers when reading give zero, writing to them is ignored. Each monitoring register is associated with a specific event for which it is counting.
Name | Event |
---|---|
Page access | |
DTLB miss | |
ITLB miss | |
I1-cache miss | |
D1-cache miss | |
D1-cache write-back | |
L2-cache miss | |
L2-cache write-back |
An overflow of monitor counter raises an asynchronous event.
The instructions mfmr (move from monitor register) and mtmr (move to monitor register) are used to indirectly read/write monitor registers. The sum of general register and simm10 is used to pass index of monitor register number.
mfmr ra, rb # ra = MR[rb+imm] mtmr ra, rb # MR[rb+imm] = ra
41 | 40 | 39 | 38 | 37 | 36 | 35 | 34 | 33 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
opcode | target | index | simm10 | opx |
In a family of binary compatible machines, application and operating system developers require so that the hardware functions are implemented sequentially (uniformly). When functions correspond to a common interface, code using these functions can be used by several different architecture implementations without modification.
These can be such functions as: binary code of instructions and data, exception mechanisms, synchronization primitives. Some of these functions can be implemented cost-effectively in hardware, other functions are impractical to carry out directly in the equipment. These features include low-level hardware features such as miss buffer translation routines; interrupt handling; interrupt vector control. They also include support for privileged and atomic operations that require long sequences of instructions.
In earlier architectures, these functions were usually provided with microcode. Modern architectures try not to use microcode mechanisms. However, it is still desirable to provide an architectural interface to these functions, which will be compatible for the entire family of machines. The Privileged Architecture Library or PAL (Privileged Architecture Library) provides a mechanism for implementing these functions without microcode.
Three main components of Privileged Architecture Library: Processor Abstraction Layer (PAL), System Abstraction Layer (SAL), Extensible Firmware Interface (EFI). PAL, SAL, and EFI together initialize the processor and system before and for loading the operating system. PAL and SAL also provide machine check abort handling and other processor and system functions that may vary from implementation to implementation.
Extensible Firmware Interface (EFI) is the firmware layer that isolates the operating system loader from the details (differences) in the implementation of the platform and processor and organizes the basic functionality for controlling a machine without an OS.
System Abstraction Layer (SAL) is the firmware layer that isolates the operating system, the overlying EFI layer, and other high-level software from the details (differences) in the implementation of the platform.
Processor Abstraction Layer (PAL) is a software layer that abstracts the details of the processor implementation and isolates them from all: from the operating system, from the EFI layer and from the SAL layer. PAL is independent of the number of processors in the system. PAL encapsulates processor functions that are likely to change from implementation to implementation, so that SAL, EFI, and OS are independent of the processor version. This includes non-performance-critical functions, such as processor initialization, configuration, and correction of internal errors. PAL consists of two components:
The PAL address space occupies a maximum of 2 GB of physical address space. The PAL space contains addresses from 0x80000000 to 0xffffffff inclusive. Code execution after restart starts with the address 0x80000000.
PAL should perform the following functions:
The architecture allows these functions to be implemented in standard machine code, which residently resides in main memory. The PAL library is written in standard machine code with some implementation-specific extensions, to provide access to low-level hardware. This allows the implementation to make various project exchanges based on the used hardware technology. The PAL library allows you to abstract these differences and make them invisible to system software.
The PAL environment differs from the normal environment in the following ways:
Full control of the state of the machine allows you to manage all the functions of the machine. Disabling interrupts allows you to provide sequences of several instructions as atomic operation. Providing implementation-specific hardware features allows access to low-level system hardware. Preventing memory management Captures I-stream allows PAL to implement memory management functions such as filling the translation buffer.
Special Features Required for PAL
PAL uses the POSTRISC instruction set for most of its operations. A small number of additional functions are required to implement PAL. Some of the free primary and/or extended opcodes can be used for PAL functions. These instructions generate an error if executed outside the PAL environment.
Having PAL will have only one effect on system code. Because PAL can fit in main memory and support privileged data structures in main memory, the operating system code that allocates the physical memory cannot use all of the physical memory. The amount of memory required by the PAL is small, so the loss for the system is negligible.
POSTRISC systems require that you can replace PAL with a version defined by the operating system. The following functions can be implemented in PAL code, not directly in the hardware, to facilitate the replacement with different versions.
Fill translation buffer. Various operating systems may wish to replace the translation buffer fill (TLB) routines. Substitution routines will use different data structures for page tables. Therefore, no part of the TLB padding tools that would change with a change in the page tables can be placed in hardware, if it cannot be canceled by the PAL code.
Process structure. Various operating systems may wish to replace the process context switch routines. Substitution routines will use different data structures. Therefore, no part of the context switching threads that would change with a change in the structure of the process can be placed in hardware.
PAL consists of three components:
Development of the POSTRISC backend for the LLVM compiler: github.com/bdpx/llvm-project.
How to build/use.
Nullification doesn't work, in progress.
Pre/post update addressing is not used.
Currently, only static PIE executables are supported by compiler and emulator.
POSTRISC port for MUSL: github.com/bdpx/musl.
MUSL limitations: doesn't support f128.
POSTRISC limitations: currently, buildable only as a static lib.
Here are results for SQLite 3.33.0 compiled with Clang 10.0.1 on FreeBSD 12.1 with -Os for various architectures:
text | data | bss | arch, comments |
---|---|---|---|
445205 | 4576 | 964 | ARMv7-A, thumb mode |
649095 | 4576 | 964 | ARMv7-A, ARM mode (a32) |
588115 | 8280 | 1304 | ARMv8-A (a64) |
641257 | 8320 | 1312 | amd64 |
584276 | 4576 | 952 | i686 |
795319 | 16688 | 1304 | mips64el |
725083 | 4576 | 960 | mipsel |
691715 | 9148 | 960 | ppc |
712559 | 49144 | 1304 | ppc64 |
689035 | 4960 | 959 | rv32g |
509583 | 4960 | 959 | rv32gc (compressed) |
689035 | 4960 | 959 | rv64g, |
512500 | 8668 | 1299 | rv64gc (compressed) |
917929 | 8280 | 1304 | s390x |
The clear winner is ARM Thumb, but RISC-V does well indeed (with compressed instructions). It's the most space efficient 64 bit ISA for sure. i686 does a little worse (still the third most compact after RV32gc and T32) and the classic RISC instruction sets are just terrible. The clear loser is Z/Architecture (S390x).
The probably same SQLite 3.33.0 sqlite-chromium-version-3.33.0, compiled with the POSTRISC port for Clang 20.0 on Linux and for comparison x86-64 Clang 16.0.6 and gcc 13.2.
text | data | bss | arch, comments |
---|---|---|---|
519367 | 8320 | 1691 | x86_64, Os, clang 16 |
772703 | 8320 | 1691 | x86_64, O2, clang 16 |
430514 | 17032 | 1784 | x86_64, Os, gcc 14 |
705880 | 16864 | 1784 | x86_64, O2, gcc 14 |
757856 | 8280 | 1683 | postrisc, Os, clang 20, dense calls |
772528 | 8280 | 1683 | postrisc, O2, clang 20, dense calls |
801248 | 8280 | 1683 | postrisc, Os, aligned calls |
815792 | 8280 | 1683 | postrisc, O2, aligned calls |
The results for POSTRISC are without using nullification and without using the post-update addressing modes (not implemented yet in the compiler) which may improve code density a bit. The main factor for code density is the possibility of returning inside the middle of an instruction bundle (dense calls). In common, code density for POSTRISC is more or less similar to MIPS, PowerPC, S-390, and RISC-V (without compression) and only slightly worse. This is surprising taking into account 128 registers, bundles, nops, etc. For O2 mode results are similar to clang-x86_64, even lesser.
POSTRISC port for Doom-1: github.com/bdpx/postrisc_doom. Uses the MUSL standard library (as a static lib). Doom generic interface is implemented as additional system calls. Workable, with little graphic artifacts.
The emulator log doom-log.html with static/dynamic instruction statistic for Doom Shareware demo scene autoplay. First 3 demoscenes (around). Time: 341.022 seconds. Frames: 16321. Iinstructions per frame: 851263 (up to 8-bit indexed image not counting emulator scaling/mapping). Instructions per pixel: 13.301. Frames per second: 48.191.
POSTRISC virtual processor.
Instruction Set Architecture (ISA) reference manual.
Copyright © 2003-2024 by Dmitry Buvaylo.