notes of “Intel64 and IA-32 Architectures Software Developer’s Manual”

Basic execution environment

IA-32 architecture basic operating modes

Protected Mode

This mode is the native state of the processor.
Among the capabilities of protected mode is the ability to directly execute “real-address mode” 8086 software in a protected(virtual-8086 mode).

Real-address mode

This mode implements the programming environment of the Intel 8086 processor with extensions (such as the ability to switch to protected or system management mode).
The processor is placed in real-address mode following power-up or a reset.

System management mode(SMM)

This mode provides an operating system or executive with a transparent mechanism for implementing platform-specific functions such as power management and system security.
The processor enters SMM when the external SMM interrupt pin (SMI#) is activated or an SMI is received from the advanced programmable interrupt controller (APIC).
In SMM, the processor switches to a separate address space while saving the basic context of the currently running program or task.
SMM-specific code may then be executed transparently.
Upon returning from SMM,the processor is placed back into its state prior to the system management interrupt.

Intel 64 architecture adds IA-32e mode

IA-32e has two sub-modes.

Compatibility mode (sub-mode of IA-32e mode)

Compatibility mode permits most legacy 16-bit and32-bit applications to run without re-compilation under a 64-bit operating system.
Legacy applications that run in Virtual 8086 mode or use hardware task management will not work in this mode.

64-bit mode (sub-mode of IA-32e mode)

This mode enables a 64-bit operating system to run applications written to access 64-bit linear address space.

IA-32 Execution Environment

Intel 64 Execution Environment

Basic program execution registers

General-purpose registers

General-Purpose Registers in 64-Bit Mode

In 64-bit mode, there are limitations on accessing byte registers. An instruction cannot reference legacy highbytes(for example: AH, BH, CH, DH) and one of the new byte registers at the same time (for example: the low byte of the RAX register).
However, instructions may reference legacy low-bytes (for example: AL, BL, CL or DL)
and new byte registers at the same time (for example: the low byte of the R8 register, or RBP).
The architecture enforces this limitation by changing high-byte references (AH, BH, CH, DH) to low byte references (BPL, SPL, DIL,SIL: the low 8 bits for RBP, RSP, RDI and RSI) for instructions using a REX prefix.

When in 64-bit mode,
64-bit operands generate a 64-bit result in the destination general-purpose register.
32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination general-purpose register.
8-bit and 16-bit operands generate an 8-bit or 16-bit result. The upper 56 bits or 48 bits (respectively) of the destination general-purpose register are not modified by the operation. If the result of an 8-bit or 16-bit operation is intended for 64-bit address calculation, explicitly sign-extend the register to the full 64-bits.

Segment Registers

When using the flat (unsegmented) memory model, segment registers are loaded with segment selectors that point to overlapping segments, each of which begins at address 0 of the linear address space (see Figure 3-6). These overlapping segments then comprise the linear address space for the program. Typically, two overlapping segments are defined: one for code and another for data and stacks. The CS segment register points to the code segment and all the other segment registers point to the data and stack segment.

When using the segmented memory model, each segment register is ordinarily loaded with a different segment selector so that each segment register points to a different segment within the linear address space (see Figure 3-7). At any time, a program can thus access up to six segments in the linear address space. To access a segment not pointed to by one of the segment registers, a program must first load the segment selector for the segment to be accessed into a segment register.

Each of the segment registers is associated with one of three types of storage: code, data, or stack.

The CS register contains the segment selector for the code segment, where the instructions being executed are stored. The processor fetches instructions from the code segment, using a logical address that consists of the segment selector in the CS register and the contents of the EIP register.The EIP register contains the offset within the code segment of the next instruction to be executed. The CS register cannot be loaded explicitly by an application program. Instead, it is loaded implicitly by instructions or internal processor operations that change program control (such as procedure calls, interrupt handling, or task switching).

The DS, ES, FS, and GS registers point to four data segments. The availability of four data segments permits efficient and secure access to different types of data structures. For example, four separate data segments might be created: one for the data structures of the current module, another for the data exported from a higher-level module, a third for a dynamically created data structure, and a fourth for data shared with another program. To access additional data segments, the application program must load segment selectors for these segments into the DS, ES, FS, and GS registers, as needed.

The SS register contains the segment selector for the stack segment, where the procedure stack is stored for the program, task, or handler currently being executed. All stack operations use the SS register to find the stack segment. Unlike the CS register, the SS register can be loaded explicitly, which permits application programs to set up multiple stacks and switch among them.

The four segment registers CS, DS, SS, and ES are the same as the segment registers found in the Intel 8086 and Intel 286 processors and the FS and GS registers were introduced into the IA-32 Architecture with the Intel386™ family of processors.

In 64-bit mode: CS, DS, ES, SS are treated as if each segment base is 0, regardless of the value of the associated segment descriptor base.This creates a flat address space for code, data, and stack. FS and GS are exceptions.Both segment registers may be used as additional base registers in linear address calculations (in the addressing of local data and certain operating system data structures).

EFLAGS Register

Following initialization of the processor (either by asserting the RESET pin or the INIT pin), the state of the EFLAGS register is 00000002H. Bits 1, 3, 5, 15, and 22 through 31 of this register are reserved. Software should not use or depend on the states of any of these bits.

Some of the flags in the EFLAGS register can be modified directly, using special-purpose instructions (described in the following sections). There are no instructions that allow the whole register to be examined or modified directly.

The following instructions can be used to move groups of flags to and from the procedure stack or the EAX register:LAHF, SAHF, PUSHF, PUSHFD, POPF, and POPFD. After the contents of the EFLAGS register have been transferred to the procedure stack or EAX register, the flags can be examined and modified using the processor’s bit manipulation instructions (BT, BTS, BTR, and BTC).

When suspending a task (using the processor’s multitasking facilities), the processor automatically saves the state of the EFLAGS register in the task state segment (TSS) for the task being suspended. When binding itself to a new task, the processor loads the EFLAGS register with data from the new task’s TSS.

When a call is made to an interrupt or exception handler procedure, the processor automatically saves the state of the EFLAGS registers on the procedure stack. When an interrupt or exception is handled with a task switch, the state of the EFLAGS register is saved in the TSS for the task being suspended.

Status Flags

The status flags (bits 0, 2, 4, 6, 7, and 11) of the EFLAGS register indicate the results of arithmetic instructions,such as the ADD, SUB, MUL, and DIV instructions.

CF (bit 0) Carry flag — Set if an arithmetic operation generates a carry or a borrow out of the mostsignificant bit of the result; cleared otherwise. This flag indicates an overflow condition for
unsigned-integer arithmetic. It is also used in multiple-precision arithmetic.
PF (bit 2) Parity flag — Set if the least-significant byte of the result contains an even number of 1 bits; cleared otherwise.
AF (bit 4) Auxiliary Carry flag — Set if an arithmetic operation generates a carry or a borrow out of bit 3 of the result; cleared otherwise. This flag is used in binary-coded decimal (BCD) arithmetic.
ZF (bit 6) Zero flag — Set if the result is zero; cleared otherwise.
SF (bit 7) Sign flag — Set equal to the most-significant bit of the result, which is the sign bit of a signed integer. (0 indicates a positive value and 1 indicates a negative value.)
OF (bit 11) Overflow flag — Set if the integer result is too large a positive number or too small a negative number (excluding the sign-bit) to fit in the destination operand; cleared otherwise. This flag indicates an overflow condition for signed-integer (two’s complement) arithmetic.

Of these status flags, only the CF flag can be modified directly, using the STC, CLC, and CMC instructions. Also the bit instructions (BT, BTS, BTR, and BTC) copy a specified bit into the CF flag.

DF Flag

The direction flag (DF, located in bit 10 of the EFLAGS register) controls string instructions (MOVS, CMPS, SCAS,LODS, and STOS). Setting the DF flag causes the string instructions to auto-decrement (to process strings from high addresses to low addresses). Clearing the DF flag causes the string instructions to auto-increment (process strings from low addresses to high addresses).
The STD and CLD instructions set and clear the DF flag, respectively.

System Flags and IOPL Field

The system flags and IOPL field in the EFLAGS register control operating-system or executive operations. They should not be modified by application programs.

TF (bit 8) Trap flag — Set to enable single-step mode for debugging; clear to disable single-step mode.
IF (bit 9) Interrupt enable flag — Controls the response of the processor to maskable interrupt
requests. Set to respond to maskable interrupts; cleared to inhibit maskable interrupts.
IOPL (bits 12 and 13)
I/O privilege level field — Indicates the I/O privilege level of the currently running program
or task. The current privilege level (CPL) of the currently running program or task must be less
than or equal to the I/O privilege level to access the I/O address space. The POPF and IRET
instructions can modify this field only when operating at a CPL of 0.
NT (bit 14) Nested task flag — Controls the chaining of interrupted and called tasks. Set when the
current task is linked to the previously executed task; cleared when the current task is not
linked to another task.
RF (bit 16) Resume flag — Controls the processor’s response to debug exceptions.
VM (bit 17) Virtual-8086 mode flag — Set to enable virtual-8086 mode; clear to return to protected
mode without virtual-8086 mode semantics.
AC (bit 18) Alignment check (or access control) flag — If the AM bit is set in the CR0 register, alignment
checking of user-mode data accesses is enabled if and only if this flag is 1.
If the SMAP bit is set in the CR4 register, explicit supervisor-mode data accesses to user-mode
pages are allowed if and only if this bit is 1. See Section 4.6, “Access Rights,” in the Intel® 64
and IA-32 Architectures Software Developer’s Manual, Volume 3A.
VIF (bit 19) Virtual interrupt flag — Virtual image of the IF flag. Used in conjunction with the VIP flag.
(To use this flag and the VIP flag the virtual mode extensions are enabled by setting the VME
flag in control register CR4.)
VIP (bit 20) Virtual interrupt pending flag — Set to indicate that an interrupt is pending; clear when no
interrupt is pending. (Software sets and clears this flag; the processor only reads it.) Used in
conjunction with the VIF flag.
ID (bit 21) Identification flag — The ability of a program to set or clear this flag indicates support for
the CPUID instruction.

In 64-bit mode, EFLAGS is extended to 64 bits and called RFLAGS. The upper 32 bits of RFLAGS register is reserved. The lower 32 bits of RFLAGS is the same as EFLAGS.

Instruction Pointer

The instruction pointer (EIP) register contains the offset in the current code segment for the next instruction to be executed.

Operand-size and address-size attributes

When the processor is executing in protected mode, every code segment has a default operand-size attribute and address-size attribute. These attributes are selected with the D (default size) flag in the segment descriptor for the code segment.

When the D flag is set, the 32-bit operand-size and address-size attributes are selected; when the flag is clear, the 16-bit size attributes are selected. When the processor is executing in real-address mode, virtual-8086 mode, or SMM, the default operand-size and address-size attributes are always 16 bits.

The operand-size attribute selects the size of operands. When the 16-bit operand-size attribute is in force, operands can generally be either 8 bits or 16 bits, and when the 32-bit operand-size attribute is in force, operands can generally be 8 bits or 32 bits.

The address-size attribute selects the sizes of addresses used to address memory: 16 bits or 32 bits. When the 16-bit address-size attribute is in force, segment offsets and displacements are 16 bits. This restriction limits the size of a segment to 64 KBytes. When the 32-bit address-size attribute is in force, segment offsets and displacements are 32 bits, allowing up to 4 GBytes to be addressed.

The default operand-size attribute and/or address-size attribute can be overridden for a particular instruction by adding an operand-size and/or address-size prefix to an instruction.

In 64-bit mode, the default address size is 64 bits and the default operand size is 32 bits. Defaults can be overridden using prefixes.

Operand Addressing
Memory Operands

Source and destination operands in memory are referenced by means of a segment selector and an offset. Segment selectors specify the segment containing the operand. Offsets specify the linear or effective address of the operand. Offsets can be 32 bits (represented by the notation m16:32) or 16 bits (represented by the notation m16:16).

In 64-bit mode, a memory operand can be referenced by a segment selector and an offset. The offset can be 16 /bits, 32 bits or 64 bits.

Specifying a Segment Selector

The segment selector can be specified either implicitly or explicitly. The most common method of specifying a segment selector is to load it in a segment register and then allow the processor to select the register implicitly, depending on the type of operation being performed. The processor automatically chooses a segment according to the rules given in Table.

When storing data in memory or loading data from memory, the DS segment default can be overridden to allow other segments to be accessed. Within an assembler, the segment override is generally handled with a colon “:” operator. For example, the following MOV instruction moves a value from register EAX into the segment pointed to by the ES register. The offset into the segment is contained in the EBX register:
MOV ES:[EBX], EAX

In IA-32e mode, the effects of segmentation depend on whether the processor is running in compatibility mode or 64-bit mode. In compatibility mode, segmentation functions just as it does in legacy IA-32 mode, using the 16-bit or 32-bit protected mode semantics described above.

In 64-bit mode, segmentation is generally (but not completely) disabled, creating a flat 64-bit linear-address space. The processor treats the segment base of CS, DS, ES, SS as zero, creating a linear address that is equal to the effective address. The exceptions are the FS and GS segments, whose segment registers (which hold the segment base) can be used as additional base registers in some linear address calculations.

Specifying an Offset

The offset part of a memory address can be specified directly as a static value (called a displacement) or through an address computation made up of one or more of the following components:

Displacement — An 8-, 16-, or 32-bit value.
Base — The value in a general-purpose register.
Index — The value in a general-purpose register.
Scale factor — A value of 2, 4, or 8 that is multiplied by the index value.

The offset which results from adding these components is called an effective address. Each of these components can have either a positive or negative (2s complement) value, with the exception of the scaling factor.

Figure shows all the possible ways that these components can be combined to create an effective address in the selected segment.

The uses of general-purpose registers as base or index components are restricted in the following manner:
• The ESP register cannot be used as an index register.
• When the ESP or EBP register is used as the base, the SS segment is the default segment. In all other cases,the DS segment is the default segment.

Data Types

Fundamental Data Types

The fundamental data types are bytes, words, doublewords, quadwords, and double quadwords A byte is eight bits, a word is 2 bytes (16 bits), a doubleword is 4 bytes (32 bits), a quadword is 8 bytes (64 bits),and a double quadword is 16 bytes (128 bits).

Figure shows the byte order of each of the fundamental data types when referenced as operands in memory.The low byte (bits 0 through 7) of each data type occupies the lowest address in memory and that address is also the address of the operand.

Numeric Data Types

Single-precision (32-bit) floating-point and double-precision (64-bit) floating-point data types are supported across all generations of SSE extensions and Intel AVX extensions. Halfprecision (16-bit) floating-point data type is supported only with F16C extensions (VCVTPH2PS, VCVTPS2PH).

Integers

Some integer instructions (such as the ADD, SUB, PADDB, and PSUBB instructions) operate on either unsigned or signed integer operands. Other integer instructions (such as IMUL, MUL, IDIV, DIV, FIADD, and FISUB) operate on only one integer type.

Unsigned integers are unsigned binary numbers contained in a byte, word, doubleword, and quadword. Their values range from 0 to 255 for an unsigned byte integer, from 0 to 65,535 for an unsigned word integer, from 0 to 232 – 1 for an unsigned doubleword integer, and from 0 to 264 – 1 for an unsigned quadword integer. Unsigned integers are sometimes referred to as ordinals.

Signed integers are signed binary numbers held in a byte, word, doubleword, or quadword. All operations on signed integers assume a two’s complement representation. The sign bit is located in bit 7 in a byte integer, bit 15 in a word integer, bit 31 in a doubleword integer, and bit 63 in a quadword integer.

The integer indefinite is a special value that is sometimes returned by the x87 FPU when operating on integer values.

Floating-Point

The IA-32 architecture defines and operates on three floating-point data types: single-precision floating-point,double-precision floating-point, and double-extended precision floating-point. The data formats for these data types correspond directly to formats specified in the IEEE Standard 754 for Binary Floating-Point Arithmetic.

Half-precision (16-bit) floating-point data type is supported only for conversion operation with single-precision floating data using F16C extensions (VCVTPH2PS, VCVTPS2PH).

For the single-precision and double-precision formats, only the fraction part of the significand is encoded.The integer is assumed to be 1 for all numbers except 0 and denormalized finite numbers. For the double extendedprecision format, the integer is contained in bit 63, and the most-significant fraction bit is bit 62. Here, the integer is explicitly set to 1 for normalized numbers, infinities, and NaNs, and to 0 for zero and denormalized numbers.

1.Integer bit is implied and not stored for single-precision and double-precision formats.
2.The fraction for SNaN encodings must be non-zero with the most-significant bit 0.

The exponent of each floating-point data type is encoded in biased format.The biasing constant is 15 for the half-precision format, 127 for the single-precision format, 1023 for the doubleprecision format, and 16,383 for the double extended-precision format.

When storing floating-point values in memory, half-precision values are stored in 2 consecutive bytes in memory; single-precision values are stored in 4 consecutive bytes in memory; double-precision values are stored in 8 consecutive bytes; and double extended-precision values are stored in 10 consecutive bytes.

The single-precision and double-precision floating-point data types are operated on by x87 FPU, and SSE/SSE2/SSE3/SSE4.1 and Intel AVX instructions.

The double-extended-precision floating-point format is only operated on by the x87 FPU.

Pointer Data Type

Pointers are addresses of locations in memory.In non-64-bit modes, the architecture defines two types of pointers: a near pointer and a far pointer.

A near pointer is a 32-bit (or 16-bit) offset (also called an effective address) within a segment. Near pointers are used for all memory references in a flat memory model or for references in a segmented model where the identity of the segment being accessed is implied.

A far pointer is a logical address, consisting of a 16-bit segment selector and a 32-bit (or 16-bit) offset. Far pointers are used for memory references in a segmented memory model where the identity of a segment being accessed must be specified explicitly.

In 64-bit mode (a sub-mode of IA-32e mode), a near pointer is 64 bits. This equates to an effective address. Far pointers in 64-bit mode can be one of three forms:

16-bit segment selector, 16-bit offset if the operand size is 32 bits
16-bit segment selector, 32-bit offset if the operand size is 32 bits
16-bit segment selector, 64-bit offset if the operand size is 64 bits

Bit Field Data Type

A bit field is a contiguous sequence of bits. It can begin at any bit position of any byte in memory
and can contain up to 32 bits.

String Data Types

Strings are continuous sequences of bits, bytes, words, or doublewords. A bit string can begin at any bit position of any byte and can contain up to 232 – 1 bits. A byte string can contain bytes, words, or doublewords and can range from zero to 232 – 1 bytes (4 GBytes).

Packed SIMD Data Types

Intel 64 and IA-32 architectures define and operate on a set of 64-bit and 128-bit packed data type for use in SIMD operations. These data types consist of fundamental data types (packed bytes, words, doublewords, and quadwords) and numeric interpretations of fundamental types for use in packed integer and packed floating-point operations.

64-Bit SIMD Packed Data Types

The 64-bit packed SIMD data types were introduced into the IA-32 architecture in the Intel MMX technology. They are operated on in MMX registers. The fundamental 64-bit packed data types are packed bytes, packed words, and packed doublewords. When performing numeric SIMD operations on these data types, these data types are interpreted as containing byte, word, or doubleword integer values.

128-Bit Packed SIMD Data Types

The 128-bit packed SIMD data types were introduced into the IA-32 architecture in the SSE extensions and used with SSE2, SSE3 and SSSE3 extensions. They are operated on primarily in the 128-bit XMM registers and memory.The fundamental 128-bit packed data types are packed bytes, packed words, packed doublewords, and packed quadwords (see Figure 4-8). When performing SIMD operations on these fundamental data types in XMM registers,these data types are interpreted as containing packed or scalar single-precision floating-point or double-precision
floating-point values, or as containing packed byte, word, doubleword, or quadword integer values.

BCD And Packed BCD Integers

Binary-coded decimal integers (BCD integers) are unsigned 4-bit integers with valid values ranging from 0 to 9. IA-32 architecture defines operations on BCD integers located in one or more general-purpose registers or in one or more x87 FPU registers.

Real Numbers And Floating-Point Formats

This section describes how real numbers are represented in floating-point format in x87 FPU and
SSE/SSE2/SSE3/SSE4.1 and Intel AVX floating-point instructions. It also introduces terms such as normalized numbers, denormalized numbers, biased exponents, signed zeros, and NaNs. Readers who are already familiar with floating-point processing techniques and the IEEE Standard 754 for Binary Floating-Point Arithmetic may wish to skip this section.

the real-number system comprises the continuum of real numbers from minus infinity (−∞) to plus infinity (+ ∞). Because the size and number of registers that any computer can have is limited, only a subset of the real-number continuum can be used in real-number (floating-point) calculations.

Floating-Point Format

To increase the speed and efficiency of real-number computations, computers and microprocessors typically represent real numbers in a binary floating-point format.In this format, a real number has three parts: a sign, a significand, and an exponent.

The sign is a binary value that indicates whether the number is positive (0) or negative (1).The significand has two parts: a 1-bit binary integer (also referred to as the J-bit) and a binary fraction.The integer-bit is often not represented, but instead is an implied value. The exponent is a binary integer that represents the base-2 power by which the significand is multiplied.

In most cases, floating-point numbers are encoded in normalized form. This means that except for zero, the significand is always made up of an integer of 1 and the following fraction:
1.fff…ff
For values less than 1, leading zeros are eliminated. (For each leading zero eliminated, the exponent is decremented by one.)

In the IA-32 architecture, the exponents of floating-point numbers are encoded in a biased form. This means that a constant is added to the actual exponent so that the biased exponent is always a positive number. The value of the biasing constant depends on the number of bits available for representing exponents in the floating-point format being used. The biasing constant is chosen so that the smallest normalized number can be reciprocated without overflow.

Instruction Set Summary

General-Purpose Instructions

The general-purpose instructions preform basic data movement, arithmetic, logic, program flow, and string operations that programmers commonly use to write application and system software to run on Intel 64 and IA-32 processors.

X87 FPU Instructions

The x87 FPU instructions are executed by the processor’s x87 FPU. These instructions operate on floating-point,integer, and binary-coded decimal (BCD) operands.

MMX Instructions

Four extensions have been introduced into the IA-32 architecture to permit IA-32 processors to perform single instruction multiple-data (SIMD) operations. These extensions include the MMX technology, SSE extensions, SSE2 extensions, and SSE3 extensions.

MMX instructions operate on packed byte, word, doubleword, or quadword integer operands contained in memory,in MMX registers, and/or in general-purpose registers.

MMX instructions can only be executed on Intel 64 and IA-32 processors that support the MMX technology. Support for these instructions can be detected with the CPUID instruction.

SSE Instructions

SSE instructions represent an extension of the SIMD execution model introduced with the MMX technology.

SSE instructions are divided into four subgroups (note that the first subgroup has subordinate subgroups of its own):
1.SIMD single-precision floating-point instructions that operate on the XMM registers.
2.MXCSR state management instructions.
3.64-bit SIMD integer instructions that operate on the MMX registers.
4.Cacheability control, prefetch, and instruction ordering instructions.

SSE2 Instructions

SSE2 extensions represent an extension of the SIMD execution model introduced with MMX technology and the SSE extensions.

These instructions are divided into four subgroups (note that the first subgroup is further divided into subordinate subgroups):
1.Packed and scalar double-precision floating-point instructions.
2.Packed single-precision floating-point conversion instructions.
3.128-bit SIMD integer instructions.
4.Cacheability-control and instruction ordering instructions.

SSE3 Instructions

The SSE3 extensions offers 13 instructions that accelerate performance of Streaming SIMD Extensions technology,Streaming SIMD Extensions 2 technology, and x87-FP math capabilities. These instructions can be grouped into the following categories:
1.One x87FPU instruction used in integer conversion.
2.One SIMD integer instruction that addresses unaligned data loads.
3.Two SIMD floating-point packed ADD/SUB instructions.
4.Four SIMD floating-point horizontal ADD/SUB instructions.
5.Three SIMD floating-point LOAD/MOVE/DUPLICATE instructions.
6.Two thread synchronization instructions.

SSE4 Instructions

SSE4.1 is targeted to improve the performance of media, imaging, and 3D workloads. SSE4.1 adds instructions that improve compiler vectorization and significantly increase support for packed dword computation. The technology also provides a hint that can improve memory throughput when reading from uncacheable WC memory type.

SSE4.1 instructions can use an XMM register as a source or destination. Programming SSE4.1 is similar to programming 128-bit Integer SIMD and floating-point SIMD instructions in SSE/SSE2/SSE3/SSSE3. SSE4.1 does not provide any 64-bit integer SIMD instructions operating on MMX registers.

Five of the SSE4.2 instructions operate on XMM register as a source or destination.

Intel Advanced Vector Extensions

Intel® Advanced Vector Extensions (AVX) promotes legacy 128-bit SIMD instruction sets that operate on XMM register set to use a “vector extension“ (VEX) prefix and operates on 256-bit vector registers (YMM). Almost all prior generations of 128-bit SIMD instructions that operates on XMM (but not on MMX registers) are promoted to support three-operand syntax with VEX-128 encoding.

VEX-prefix encoded AVX instructions support 256-bit and 128-bit floating-point operations by extending the legacy 128-bit SIMD floating-point instructions to support three-operand syntax.

Intel SHA Extensions

Intel® SHA extensions provide a set of instructions that target the acceleration of the Secure Hash Algorithm (SHA), specifically the SHA-1 and SHA-256 variants.

Intel Advanced Vector Extensions 512 (Intel AVX-512)

The Intel® AVX-512 family comprises a collection of 512-bit SIMD instruction sets to accelerate a diverse range of applications. Intel AVX-512 instructions provide a wide range of functionality that support programming in 512-bit,256 and 128-bit vector register, plus support for opmask registers and instructions operating on opmask registers.

Virtual-Machine Extensions

The behavior of the VMCS-maintenance instructions.

Safer Mode Extensions

The behavior of the GETSEC instruction leaves of the Safer Mode Extensions (SMX).

Intel Memory Protection Extensions

Intel Memory Protection Extensions (MPX) provides a set of instructions to enable software to add robust bounds checking capability to memory references.

Intel Software Guard Extensions

Intel Software Guard Extensions (Intel SGX) provide two sets of instruction leaf functions to enable application software to instantiate a protected container, referred to as an enclave. The enclave instructions are organized as leaf functions under two instruction mnemonics: ENCLS (ring 0) and ENCLU (ring 3).

Programming With General-Purpose Instructions

Type Conversion Instructions

The type conversion instructions convert bytes into words, words into doublewords, and doublewords into quadwords.
These instructions are especially useful for converting integers to larger integer formats, because they perform sign extension.

Two kinds of type conversion instructions are provided: simple conversion and move and convert.

Simple conversion — The CBW (convert byte to word), CWDE (convert word to doubleword extended), CWD (convert word to doubleword), and CDQ (convert doubleword to quadword) instructions perform sign extension to double the size of the source operand.

The CBW instruction copies the sign (bit 7) of the byte in the AL register into every bit position of the upper byte of the AX register. The CWDE instruction copies the sign (bit 15) of the word in the AX register into every bit position of the high word of the EAX register.
The CWD instruction copies the sign (bit 15) of the word in the AX register into every bit position in the DX register.
The CDQ instruction copies the sign (bit 31) of the doubleword in the EAX register into every bit position in the EDX register. The CWD instruction can be used to produce a doubleword dividend from a word before a word division,and the CDQ instruction can be used to produce a quadword dividend from a doubleword before doubleword division.

Move with sign or zero extension — The MOVSX (move with sign extension) and MOVZX (move with zero extension) instructions move the source operand into a register then perform the sign extension.
The MOVSX instruction extends an 8-bit value to a 16-bit value or an 8-bit or 16-bit value to a 32-bit value by sign extending the source operand, as shown in Figure 7-5. The MOVZX instruction extends an 8-bit value to a 16-bit value or an 8-bit or 16-bit value to a 32-bit value by zero extending the source operand.

Decimal Arithmetic Instructions

Decimal arithmetic can be performed by combining the binary arithmetic instructions ADD, SUB, MUL, and DIV with the decimal arithmetic instructions. The decimal arithmetic instructions are provided to carry out the following operations:
1.To adjust the results of a previous binary arithmetic operation to produce a valid BCD result.
2.To adjust the operands of a subsequent binary arithmetic operation so that the operation will produce a valid BCD result.

These instructions operate on both packed and unpacked BCD values. For the purpose of this discussion, the decimal arithmetic instructions are divided into subordinate subgroups of instructions that provide:
1.Packed BCD adjustments
2.Unpacked BCD adjustments

Packed BCD Adjustment Instructions

The DAA (decimal adjust after addition) and DAS (decimal adjust after subtraction) instructions adjust the results of operations performed on packed BCD integers. Adding two packed BCD values requires two instructions: an ADD instruction followed by a DAA instruction. The ADD instruction adds (binary addition) the two values and stores the result in the AL register. The DAA instruction then adjusts the value in the AL register to obtain a valid, 2-digit, packed BCD value and sets the CF flag if a decimal carry occurred as the result of the addition.

Likewise, subtracting one packed BCD value from another requires a SUB instruction followed by a DAS instruction. The SUB instruction subtracts (binary subtraction) one BCD value from another and stores the result in the AL register. The DAS instruction then adjusts the value in the AL register to obtain a valid, 2-digit, packed BCD value and sets the CF flag if a decimal borrow occurred as the result of the subtraction.

Unpacked BCD Adjustment Instructions

The AAA (ASCII adjust after addition), AAS (ASCII adjust after subtraction), AAM (ASCII adjust after multiplication),and AAD (ASCII adjust before division) instructions adjust the results of arithmetic operations performed on unpacked BCD values. All these instructions assume that the value to be adjusted is stored in the AL register or, in one instance, the AL and AH registers.

The AAA instruction adjusts the contents of the AL register following the addition of two unpacked BCD values. It converts the binary value in the AL register into a decimal value and stores the result in the AL register in unpacked BCD format (the decimal number is stored in the lower 4 bits of the register and the upper 4 bits are cleared). If a decimal carry occurred as a result of the addition, the CF flag is set and the contents of the AH register are incremented by 1.

Programming With The X87 FPU

X87 FPU Execution Environment

This execution environment consists of eight data registers (called the x87 FPU data registers) and the following special-purpose registers:

1.Status register
2.Control register
3.Tag word register
4.Last instruction pointer register
5.Last data (operand) pointer register
6.Opcode register

The x87 FPU and Intel MMX technology share state because the MMX registers are aliased to the x87 FPU data registers.

x87 FPU Data Registers

The x87 FPU data registers (shown in Figure 8-1) consist of eight 80-bit registers. Values are stored in these registers in the double extended-precision floating-point format.

When floating-point, integer, or packed BCD integer values are loaded from memory into any of the x87 FPU data registers, the values are automatically converted into double extended-precision floating-point format (if they are not already in that format).
When computation results are subsequently transferred back into memory from any of the x87 FPU registers, the results can be left in the double extended-precision floating-point format or converted back into a shorter floatingpoint format, an integer format, or the packed BCD integer format.

The x87 FPU instructions treat the eight x87 FPU data registers as a register stack.

All addressing of the data registers is relative to the register on the top of the stack. The register number of the current top-of-stack register is stored in the TOP (stack TOP) field in the x87 FPU status word. Load operations decrement TOP by one and load a value into the new top-of-stack register, and store operations store the value from the current TOP register in memory and then increment TOP by one.

x87 FPU Status Register

The 16-bit x87 FPU status register indicates the current state of the x87 FPU.

x87 FPU Control Word

The 16-bit x87 FPU control word controls the precision of the x87 FPU and rounding method used.

x87 FPU Tag Word

The 16-bit tag word indicates the contents of each the 8 registers in the x87 FPU data-register stack (one 2-bit tag per register). The tag codes indicate whether a register contains a valid number, zero, or a special floating-point number (NaN, infinity, denormal, or unsupported format), or whether it is empty. The x87 FPU tag word is cached in the x87 FPU in the x87 FPU tag word register.

Programming With Intel MMX Technology

MMX technology defines a simple and flexible SIMD execution model to handle 64-bit packed integer data.

Although MMX registers are defined in the IA-32 architecture as separate registers, they are aliased to the registers in the FPU data register stack (R0 through R7).

MMX Data Types

MMX technology introduced the following 64-bit data types to the IA-32 architecture.

• 64-bit packed byte integers — eight packed bytes
• 64-bit packed word integers — four packed words
• 64-bit packed doubleword integers — two packed doublewords

MMX instructions move 64-bit packed data types (packed bytes, packed words, or packed doublewords) and the quadword data type between MMX registers and memory or between MMX registers in 64-bit blocks. However, when performing arithmetic or logical operations on the packed data types, MMX instructions operate in parallel on the individual bytes, words, or doublewords contained in MMX registers.

When stored in memory: bytes, words and doublewords in the packed data types are stored in consecutive addresses. The least significant byte, word, or doubleword is stored at the lowest address and the most significant byte, word, or doubleword is stored at the high address. The ordering of bytes, words, or doublewords in memory is always little endian. That is, the bytes with the low addresses are less significant than the bytes with high addresses.

Single Instruction, Multiple Data (SIMD) Execution Model

MMX technology uses the single instruction, multiple data (SIMD) technique for performing arithmetic and logical operations on bytes, words, or doublewords packed into MMX registers.For example, the PADDSW instruction adds 4 signed word integers from one source operand to 4 signed word integers in a second source operand and stores 4 word integer results in a destination operand. This SIMD technique speeds up software performance by allowing the same operation to be carried out on multiple data elements in parallel. MMX technology supports parallel operations on byte, word, and doubleword data elements when contained in MMX registers.

The SIMD execution model supported in the MMX technology directly addresses the needs of modern media,communications, and graphics applications, which often use sophisticated algorithms that perform the same operations on a large number of small data types (bytes, words, and doublewords). For example, most audio data is represented in 16-bit (word) quantities. The MMX instructions can operate on 4 words simultaneously with one instruction. Video and graphics information is commonly represented as palletized 8-bit (byte) quantities.In the figure,one MMX instruction operates on 8 bytes simultaneously.

Programming With Intel Streaming SIMD Extensions(SSE)

SSE extensions expand the SIMD execution model by adding facilities for handling packed and scalar single-precision floating-point values contained in 128-bit registers.

SSE Programming Environment

All SSE instructions operate on the XMM registers, MMX registers, and/or memory.

XMM Registers

Eight 128-bit XMM data registers were introduced into the IA-32 architecture with SSE extensions.These registers can be accessed directly using the names XMM0 to XMM7; and they can be accessed independently from the x87 FPU and MMX registers and the general-purpose registers (that is, they are not aliased to any other of the processor’s registers

SSE instructions use the XMM registers only to operate on packed single-precision floating-point operands. SSE2 extensions expand the functions of the XMM registers to operand on packed or scalar double-precision floatingpoint operands and packed integer operands.

XMM registers can only be used to perform calculations on data; they cannot be used to address memory.Addressing memory is accomplished by using the general-purpose registers.

Data can be loaded into XMM registers or written from the registers to memory in 32-bit, 64-bit, and 128-bit increments.When storing the entire contents of an XMM register in memory (128-bit store), the data is stored in 16 consecutive bytes, with the low-order byte of the register being stored in the first byte in memory.

MXCSR Control and Status Register

The 32-bit MXCSR register contains control and status information for SSE, SSE2, and SSE3
SIMD floating-point operations.

The state (XMM registers and MXCSR register) introduced into the IA-32 execution environment with the SSE extensions is shared with SSE2 and SSE3 extensions. SSE/SSE2/SSE3 instructions are fully compatible; they can be executed together in the same instruction stream with no need to save state when switching between instruction sets.

XMM registers are independent of the x87 FPU and MMX registers, so SSE/SSE2/SSE3 operations performed on the XMM registers can be performed in parallel with operations on the x87 FPU and MMX registers.

Data Types

SSE extensions introduced one data type, the 128-bit packed single-precision floating-point data type, to the IA-32 architecture. This data type consists of four IEEE 32-bit single-precision floating-point values packed into a double quadword.

This 128-bit packed single-precision floating-point data type is operated on in the XMM registers or in memory.Conversion instructions are provided to convert two packed single-precision floating-point values into two packed doubleword integers or a scalar single-precision floating-point value into a doubleword integer.

SSE extensions provide conversion instructions between XMM registers and MMX registers, and between XMM registers and general-purpose bit registers.

Be the first to reply

发表评论

电子邮件地址不会被公开。 必填项已用*标注

19 − 2 =