Chapter 2.1 - The x86 architecture instructions

 

Table of Contents

  • 2.1  The x86 architecture instructions
    • 2.1.1  Data movement instructions
    • 2.1.2  Type conversion instructions
    • 2.1.3  Binary arithmetic instructions
    • 2.1.4  Decimal arithmetic instructions
    • 2.1.5  Logical instructions
    • 2.1.6  Control transfer instructions
    • 2.1.7  I/O instructions
    • 2.1.8  Strings operations
    • 2.1.9  Flag control instructions
    • 2.1.10  Conditional operations
    • 2.1.11  Miscellaneous instructions
    • 2.1.12  System instructions
    • 2.1.13  FPU instructions
    • 2.1.14  MMX instructions
    • 2.1.15  SSE instructions
    • 2.1.16  SSE2 instructions
    • 2.1.17  SSE3 instructions
    • 2.1.18  AMD 3DNow! instructions
    • 2.1.19  The x86-64 long mode instructions
    • 2.1.20  SSE4 instructions
    • 2.1.21  AVX instructions
    • 2.1.22  Other extensions of instruction set

Chapter 2
Instruction set

 

2.1 The x86 architecture instructions

In this section you can find both the information about the syntax and purpose the assembly language instructions. If you need more technical information, look for the Intel Architecture Software Developer's Manual.

Assembly instructions consist of the mnemonic (instruction's name) and from zero to three operands. If there are two or more operands, usually first is the destination operand and second is the source operand. Each operand can be register, memory or immediate value (see 1.2 for details about syntax of operands). After the description of each instruction there are examples of different combinations of operands, if the instruction has any.

Some instructions act as prefixes and can be followed by other instruction in the same line, and there can be more than one prefix in a line. Each name of the segment register is also a mnemonic of instruction prefix, altough it is recommended to use segment overrides inside the square brackets instead of these prefixes.

2.1.1 Data movement instructions

mov transfers a byte, word or double word from the source operand to the destination operand. It can transfer data between general registers, from the general register to memory, or from memory to general register, but it cannot move from memory to memory. It can also transfer an immediate value to general register or memory, segment register to general register or memory, general register or memory to segment register, control or debug register to general register and general register to control or debug register. The mov can be assembled only if the size of source operand and size of destination operand are the same. Below are the examples for each of the allowed combinations:

  1.     mov bx,ax       ; general register to general register
  2.     mov [char],al   ; general register to memory
  3.     mov bl,[char]   ; memory to general register
  4.     mov dl,32       ; immediate value to general register
  5.     mov [char],32   ; immediate value to memory
  6.     mov ax,ds       ; segment register to general register
  7.     mov [bx],ds     ; segment register to memory
  8.     mov ds,ax       ; general register to segment register
  9.     mov ds,[bx]     ; memory to segment register
  10.     mov eax,cr0     ; control register to general register
  11.     mov cr3,ebx     ; general register to control register

xchg swaps the contents of two operands. It can swap two byte operands, two word operands or two double word operands. Order of operands is not important. The operands may be two general registers, or general register with memory. For example:

  1.     xchg ax,bx      ; swap two general registers
  2.     xchg al,[char]  ; swap register with memory

push decrements the stack frame pointer (ESP register), then transfers the operand to the top of stack indicated by ESP. The operand can be memory, general register, segment register or immediate value of word or double word size. If operand is an immediate value and no size is specified, it is by default treated as a word value if assembler is in 16-bit mode and as a double word value if assembler is in 32-bit mode. pushw and pushd mnemonics are variants of this instruction that store the values of word or double word size respectively. If more operands follow in the same line (separated only with spaces, not commas), compiler will assemble chain of the push instructions with these operands. The examples are with single operands:

  1.     push ax         ; store general register
  2.     push es         ; store segment register
  3.     pushw [bx]      ; store memory
  4.     push 1000h      ; store immediate value

pusha saves the contents of the eight general register on the stack. This instruction has no operands. There are two version of this instruction, one 16-bit and one 32-bit, assembler automatically generates the appropriate version for current mode, but it can be overridden by using pushaw or pushad mnemonic to always get the 16-bit or 32-bit version. The 16-bit version of this instruction pushes general registers on the stack in the following order: AX, CX, DX, BX, the initial value of SP before AX was pushed, BP, SI and DI. The 32-bit version pushes equivalent 32-bit general registers in the same order.

pop transfers the word or double word at the current top of stack to the destination operand, and then increments ESP to point to the new top of stack. The operand can be memory, general register or segment register. popw and popd mnemonics are variants of this instruction for restoring the values of word or double word size respectively. If more operands separated with spaces follow in the same line, compiler will assemble chain of the pop instructions with these operands.

  1.     pop bx          ; restore general register
  2.     pop ds          ; restore segment register
  3.     popw [si]       ; restore memory

popa restores the registers saved on the stack by pusha instruction, except for the saved value of SP (or ESP), which is ignored. This instruction has no operands. To force assembling 16-bit or 32-bit version of this instruction use popaw or popad mnemonic.

2.1.2 Type conversion instructions

The type conversion instructions convert bytes into words, words into double words, and double words into quad words. These conversions can be done using the sign extension or zero extension. The sign extension fills the extra bits of the larger item with the value of the sign bit of the smaller item, the zero extension simply fills them with zeros.

cwd and cdq double the size of value AX or EAX register respectively and store the extra bits into the DX or EDX register. The conversion is done using the sign extension. These instructions have no operands.

cbw extends the sign of the byte in AL throughout AX, and cwde extends the sign of the word in AX throughout EAX. These instructions also have no operands.

movsx converts a byte to word or double word and a word to double word using the sign extension. movzx does the same, but it uses the zero extension. The source operand can be general register or memory, while the destination operand must be a general register. For example:

  1.     movsx ax,al         ; byte register to word register
  2.     movsx edx,dl        ; byte register to double word register
  3.     movsx eax,ax        ; word register to double word register
  4.     movsx ax,byte [bx]  ; byte memory to word register
  5.     movsx edx,byte [bx] ; byte memory to double word register
  6.     movsx eax,word [bx] ; word memory to double word register

2.1.3 Binary arithmetic instructions

add replaces the destination operand with the sum of the source and destination operands and sets CF if overflow has occurred. The operands may be bytes, words or double words. The destination operand can be general register or memory, the source operand can be general register or immediate value, it can also be memory if the destination operand is register.

  1.     add ax,bx       ; add register to register
  2.     add ax,[si]     ; add memory to register
  3.     add [di],al     ; add register to memory
  4.     add al,48       ; add immediate value to register
  5.     add [char],48   ; add immediate value to memory

adc sums the operands, adds one if CF is set, and replaces the destination operand with the result. Rules for the operands are the same as for the add instruction. An add followed by multiple adc instructions can be used to add numbers longer than 32 bits.

inc adds one to the operand, it does not affect CF. The operand can be general register or memory, and the size of the operand can be byte, word or double word.

  1.     inc ax          ; increment register by one
  2.     inc byte [bx]   ; increment memory by one

sub subtracts the source operand from the destination operand and replaces the destination operand with the result. If a borrow is required, the CF is set. Rules for the operands are the same as for the add instruction.

sbb subtracts the source operand from the destination operand, subtracts one if CF is set, and stores the result to the destination operand. Rules for the operands are the same as for the add instruction. A sub followed by multiple sbb instructions may be used to subtract numbers longer than 32 bits.

dec subtracts one from the operand, it does not affect CF. Rules for the operand are the same as for the inc instruction.

cmp subtracts the source operand from the destination operand. It updates the flags as the sub instruction, but does not alter the source and destination operands. Rules for the operands are the same as for the sub instruction.

neg subtracts a signed integer operand from zero. The effect of this instructon is to reverse the sign of the operand from positive to negative or from negative to positive. Rules for the operand are the same as for the inc instruction.

xadd exchanges the destination operand with the source operand, then loads the sum of the two values into the destination operand. Rules for the operands are the same as for the add instruction.

All the above binary arithmetic instructions update SF, ZF, PF and OF flags. SF is always set to the same value as the result's sign bit, ZF is set when all the bits of result are zero, PF is set when low order eight bits of result contain an even number of set bits, OF is set if result is too large for a positive number or too small for a negative number (excluding sign bit) to fit in destination operand.

mul performs an unsigned multiplication of the operand and the accumulator. If the operand is a byte, the processor multiplies it by the contents of AL and returns the 16-bit result to AH and AL. If the operand is a word, the processor multiplies it by the contents of AX and returns the 32-bit result to DX and AX. If the operand is a double word, the processor multiplies it by the contents of EAX and returns the 64-bit result in EDX and EAX. mul sets CF and OF when the upper half of the result is nonzero, otherwise they are cleared. Rules for the operand are the same as for the inc instruction.

imul performs a signed multiplication operation. This instruction has three variations. First has one operand and behaves in the same way as the mul instruction. Second has two operands, in this case destination operand is multiplied by the source operand and the result replaces the destination operand. Destination operand must be a general register, it can be word or double word, source operand can be general register, memory or immediate value. Third form has three operands, the destination operand must be a general register, word or double word in size, source operand can be general register or memory, and third operand must be an immediate value. The source operand is multiplied by the immediate value and the result is stored in the destination register. All the three forms calculate the product to twice the size of operands and set CF and OF when the upper half of the result is nonzero, but second and third form truncate the product to the size of operands. So second and third forms can be also used for unsigned operands because, whether the operands are signed or unsigned, the lower half of the product is the same. Below are the examples for all three forms:

  1.     imul bl         ; accumulator by register
  2.     imul word [si]  ; accumulator by memory
  3.     imul bx,cx      ; register by register
  4.     imul bx,[si]    ; register by memory
  5.     imul bx,10      ; register by immediate value
  6.     imul ax,bx,10   ; register by immediate value to register
  7.     imul ax,[si],10 ; memory by immediate value to register

div performs an unsigned division of the accumulator by the operand. The dividend (the accumulator) is twice the size of the divisor (the operand), the quotient and remainder have the same size as the divisor. If divisor is byte, the dividend is taken from AX register, the quotient is stored in AL and the remainder is stored in AH. If divisor is word, the upper half of dividend is taken from DX, the lower half of dividend is taken from AX, the quotient is stored in AX and the remainder is stored in DX. If divisor is double word, the upper half of dividend is taken from EDX, the lower half of dividend is taken from EAX, the quotient is stored in EAX and the remainder is stored in EDX. Rules for the operand are the same as for the mul instruction.

idiv performs a signed division of the accumulator by the operand. It uses the same registers as the div instruction, and the rules for the operand are the same.

2.1.4 Decimal arithmetic instructions

Decimal arithmetic is performed by combining the binary arithmetic instructions (already described in the prior section) with the decimal arithmetic instructions. The decimal arithmetic instructions are used to adjust the results of a previous binary arithmetic operation to produce a valid packed or unpacked decimal result, or to adjust the inputs to a subsequent binary arithmetic operation so the operation will produce a valid packed or unpacked decimal result.

daa adjusts the result of adding two valid packed decimal operands in AL. daa must always follow the addition of two pairs of packed decimal numbers (one digit in each half-byte) to obtain a pair of valid packed decimal digits as results. The carry flag is set if carry was needed. This instruction has no operands.

das adjusts the result of subtracting two valid packed decimal operands in AL. das must always follow the subtraction of one pair of packed decimal numbers (one digit in each half-byte) from another to obtain a pair of valid packed decimal digits as results. The carry flag is set if a borrow was needed. This instruction has no operands.

aaa changes the contents of register AL to a valid unpacked decimal number, and zeroes the top four bits. aaa must always follow the addition of two unpacked decimal operands in AL. The carry flag is set and AH is incremented if a carry is necessary. This instruction has no operands.

aas changes the contents of register AL to a valid unpacked decimal number, and zeroes the top four bits. aas must always follow the subtraction of one unpacked decimal operand from another in AL. The carry flag is set and AH decremented if a borrow is necessary. This instruction has no operands.

aam corrects the result of a multiplication of two valid unpacked decimal numbers. aam must always follow the multiplication of two decimal numbers to produce a valid decimal result. The high order digit is left in AH, the low order digit in AL. The generalized version of this instruction allows adjustment of the contents of the AX to create two unpacked digits of any number base. The standard version of this instruction has no operands, the generalized version has one operand - an immediate value specifying the number base for the created digits.

aad modifies the numerator in AH and AL to prepare for the division of two valid unpacked decimal operands so that the quotient produced by the division will be a valid unpacked decimal number. AH should contain the high order digit and AL the low order digit. This instruction adjusts the value and places the result in AL, while AH will contain zero. The generalized version of this instruction allows adjustment of two unpacked digits of any number base. Rules for the operand are the same as for the aam instruction.

2.1.5 Logical instructions

not inverts the bits in the specified operand to form a one's complement of the operand. It has no effect on the flags. Rules for the operand are the same as for the inc instruction.

and, or and xor instructions perform the standard logical operations. They update the SF, ZF and PF flags. Rules for the operands are the same as for the add instruction.

bt, bts, btr and btc instructions operate on a single bit which can be in memory or in a general register. The location of the bit is specified as an offset from the low order end of the operand. The value of the offset is the taken from the second operand, it either may be an immediate byte or a general register. These instructions first assign the value of the selected bit to CF. bt instruction does nothing more, bts sets the selected bit to 1, btr resets the selected bit to 0, btc changes the bit to its complement. The first operand can be word or double word.

  1.     bt  ax,15        ; test bit in register
  2.     bts word [bx],15 ; test and set bit in memory
  3.     btr ax,cx        ; test and reset bit in register
  4.     btc word [bx],cx ; test and complement bit in memory

bsf and bsr instructions scan a word or double word for first set bit and store the index of this bit into destination operand, which must be general register. The bit string being scanned is specified by source operand, it may be either general register or memory. The ZF flag is set if the entire string is zero (no set bits are found); otherwise it is cleared. If no set bit is found, the value of the destination register is undefined. bsf scans from low order to high order (starting from bit index zero). bsr scans from high order to low order (starting from bit index 15 of a word or index 31 of a double word).

  1.     bsf ax,bx        ; scan register forward
  2.     bsr ax,[si]      ; scan memory reverse

shl shifts the destination operand left by the number of bits specified in the second operand. The destination operand can be byte, word, or double word general register or memory. The second operand can be an immediate value or the CL register. The processor shifts zeros in from the right (low order) side of the operand as bits exit from the left side. The last bit that exited is stored in CF. sal is a synonym for shl.

  1.     shl al,1         ; shift register left by one bit
  2.     shl byte [bx],1  ; shift memory left by one bit
  3.     shl ax,cl        ; shift register left by count from cl
  4.     shl word [bx],cl ; shift memory left by count from cl

shr and sar shift the destination operand right by the number of bits specified in the second operand. Rules for operands are the same as for the shl instruction. shr shifts zeros in from the left side of the operand as bits exit from the right side. The last bit that exited is stored in CF. sar preserves the sign of the operand by shifting in zeros on the left side if the value is positive or by shifting in ones if the value is negative.

shld shifts bits of the destination operand to the left by the number of bits specified in third operand, while shifting high order bits from the source operand into the destination operand on the right. The source operand remains unmodified. The destination operand can be a word or double word general register or memory, the source operand must be a general register, third operand can be an immediate value or the CL register.

  1.     shld ax,bx,1     ; shift register left by one bit
  2.     shld [di],bx,1   ; shift memory left by one bit
  3.     shld ax,bx,cl    ; shift register left by count from cl
  4.     shld [di],bx,cl  ; shift memory left by count from cl

shrd shifts bits of the destination operand to the right, while shifting low order bits from the source operand into the destination operand on the left. The source operand remains unmodified. Rules for operands are the same as for the shld instruction.

rol and rcl rotate the byte, word or double word destination operand left by the number of bits specified in the second operand. For each rotation specified, the high order bit that exits from the left of the operand returns at the right to become the new low order bit. rcl additionally puts in CF each high order bit that exits from the left side of the operand before it returns to the operand as the low order bit on the next rotation cycle. Rules for operands are the same as for the shl instruction.

ror and rcr rotate the byte, word or double word destination operand right by the number of bits specified in the second operand. For each rotation specified, the low order bit that exits from the right of the operand returns at the left to become the new high order bit. rcr additionally puts in CF each low order bit that exits from the right side of the operand before it returns to the operand as the high order bit on the next rotation cycle. Rules for operands are the same as for the shl instruction.

test performs the same action as the and instruction, but it does not alter the destination operand, only updates flags. Rules for the operands are the same as for the and instruction.

bswap reverses the byte order of a 32-bit general register: bits 0 through 7 are swapped with bits 24 through 31, and bits 8 through 15 are swapped with bits 16 through 23. This instruction is provided for converting little-endian values to big-endian format and vice versa.

  1.     bswap edx        ; swap bytes in register

2.1.6 Control transfer instructions

jmp unconditionally transfers control to the target location. The destination address can be specified directly within the instruction or indirectly through a register or memory, the acceptable size of this address depends on whether the jump is near or far (it can be specified by preceding the operand with near or far operator) and whether the instruction is 16-bit or 32-bit. Operand for near jump should be word size for 16-bit instruction or the dword size for 32-bit instruction. Operand for far jump should be dword size for 16-bit instruction or pword size for 32-bit instruction. A direct jmp instruction includes the destination address as part of the instruction (and can be preceded by short, near or far operator), the operand specifying address should be the numerical expression for near or short jump, or two numerical expressions separated with colon for far jump, the first specifies selector of segment, the second is the offset within segment. The pword operator can be used to force the 32-bit far call, and dword to force the 16-bit far call. An indirect jmp instruction obtains the destination address indirectly through a register or a pointer variable, the operand should be general register or memory. See also 1.2.5 for some more details.

  1.     jmp 100h         ; direct near jump
  2.     jmp 0FFFFh:0     ; direct far jump
  3.     jmp ax           ; indirect near jump
  4.     jmp pword [ebx]  ; indirect far jump

call transfers control to the procedure, saving on the stack the address of the instruction following the call for later use by a ret (return) instruction. Rules for the operands are the same as for the jmp instruction, but the call has no short variant of direct instruction and thus it not optimized.

ret, retn and retf instructions terminate the execution of a procedure and transfers control back to the program that originally invoked the procedure using the address that was stored on the stack by the call instruction. ret is the equivalent for retn, which returns from the procedure that was executed using the near call, while retf returns from the procedure that was executed using the far call. These instructions default to the size of address appropriate for the current code setting, but the size of address can be forced to 16-bit by using the retw, retnw and retfw mnemonics, and to 32-bit by using the retd, retnd and retfd mnemonics. All these instructions may optionally specify an immediate operand, by adding this constant to the stack pointer, they effectively remove any arguments that the calling program pushed on the stack before the execution of the call instruction.

iret returns control to an interrupted procedure. It differs from ret in that it also pops the flags from the stack into the flags register. The flags are stored on the stack by the interrupt mechanism. It defaults to the size of return address appropriate for the current code setting, but it can be forced to use 16-bit or 32-bit address by using the iretw or iretd mnemonic.

The conditional transfer instructions are jumps that may or may not transfer control, depending on the state of the CPU flags when the instruction executes. The mnemonics for conditional jumps may be obtained by attaching the condition mnemonic (see table 2.1) to the j mnemonic, for example jc instruction will transfer the control when the CF flag is set. The conditional jumps can be short or near, and direct only, and can be optimized (see 1.2.5), the operand should be an immediate value specifying target address.

Table 2.1 Conditions

Mnemonic Condition tested Description
o OF = 1 overflow
no OF = 0 not overflow
c
b
nae
CF = 1
carry
below
not above nor equal
nc
ae
nb
CF = 0
not carry
above or equal
not below
e
z
ZF = 1
equal
zero
ne
nz
ZF = 0
not equal
not zero
be
na
CF or ZF = 1
below or equal
not above
a
nbe
CF or ZF = 0
above
not below nor equal
s SF = 1 sign
ns SF = 0 not sign
p
pe
PF = 1
parity
parity even
np
po
PF = 0
not parity
parity odd
l
nge
SF xor OF = 1
less
not greater nor equal
ge
nl
SF xor OF = 0
greater or equal
not less
le
ng
(SF xor OF) or ZF = 1
less or equal
not greater
g
nle
(SF xor OF) or ZF = 0
greater
not less nor equal

The loop instructions are conditional jumps that use a value placed in CX (or ECX) to specify the number of repetitions of a software loop. All loop instructions automatically decrement CX (or ECX) and terminate the loop (don't transfer the control) when CX (or ECX) is zero. It uses CX or ECX whether the current code setting is 16-bit or 32-bit, but it can be forced to us CX with the loopw mnemonic or to use ECX with the loopd mnemonic. loope and loopz are the synonyms for the same instruction, which acts as the standard loop, but also terminates the loop when ZF flag is set. loopew and loopzw mnemonics force them to use CX register while looped and loopzd force them to use ECX register. loopne and loopnz are the synonyms for the same instructions, which acts as the standard loop, but also terminate the loop when ZF flag is not set. loopnew and loopnzw mnemonics force them to use CX register while loopned and loopnzd force them to use ECX register. Every loop instruction needs an operand being an immediate value specifying target address, it can be only short jump (in the range of 128 bytes back and 127 bytes forward from the address of instruction following the loop instruction).

jcxz branches to the label specified in the instruction if it finds a value of zero in CX, jecxz does the same, but checks the value of ECX instead of CX. Rules for the operands are the same as for the loop instruction.

int activates the interrupt service routine that corresponds to the number specified as an operand to the instruction, the number should be in range from 0 to 255. The interrupt service routine terminates with an iret instruction that returns control to the instruction that follows int. int3 mnemonic codes the short (one byte) trap that invokes the interrupt 3. into instruction invokes the interrupt 4 if the OF flag is set.

bound verifies that the signed value contained in the specified register lies within specified limits. An interrupt 5 occurs if the value contained in the register is less than the lower bound or greater than the upper bound. It needs two operands, the first operand specifies the register being tested, the second operand should be memory address for the two signed limit values. The operands can be word or dword in size.

  1.     bound ax,[bx]    ; check word for bounds
  2.     bound eax,[esi]  ; check double word for bounds

2.1.7 I/O instructions

in transfers a byte, word, or double word from an input port to AL, AX, or EAX. I/O ports can be addressed either directly, with the immediate byte value coded in instruction, or indirectly via the DX register. The destination operand should be AL, AX, or EAX register. The source operand should be an immediate value in range from 0 to 255, or DX register.

  1.     in al,20h        ; input byte from port 20h
  2.     in ax,dx         ; input word from port addressed by dx

out transfers a byte, word, or double word to an output port from AL, AX, or EAX. The program can specify the number of the port using the same methods as the in instruction. The destination operand should be an immediate value in range from 0 to 255, or DX register. The source operand should be AL, AX, or EAX register.

  1.     out 20h,ax       ; output word to port 20h
  2.     out dx,al        ; output byte to port addressed by dx

2.1.8 Strings operations

The string operations operate on one element of a string. A string element may be a byte, a word, or a double word. The string elements are addressed by SI and DI (or ESI and EDI) registers. After every string operation SI and/or DI (or ESI and/or EDI) are automatically updated to point to the next element of the string. If DF (direction flag) is zero, the index registers are incremented, if DF is one, they are decremented. The amount of the increment or decrement is 1, 2, or 4 depending on the size of the string element. Every string operation instruction has short forms which have no operands and use SI and/or DI when the code type is 16-bit, and ESI and/or EDI when the code type is 32-bit. SI and ESI by default address data in the segment selected by DS, DI and EDI always address data in the segment selected by ES. Short form is obtained by attaching to the mnemonic of string operation letter specifying the size of string element, it should be b for byte element, w for word element, and d for double word element. Full form of string operation needs operands providing the size operator and the memory addresses, which can be SI or ESI with any segment prefix, DI or EDI always with ES segment prefix.

movs transfers the string element pointed to by SI (or ESI) to the location pointed to by DI (or EDI). Size of operands can be byte, word, or double word. The destination operand should be memory addressed by DI or EDI, the source operand should be memory addressed by SI or ESI with any segment prefix.

  1.     movs byte [di],[si]        ; transfer byte
  2.     movs word [es:di],[ss:si]  ; transfer word
  3.     movsd                      ; transfer double word

cmps subtracts the destination string element from the source string element and updates the flags AF, SF, PF, CF and OF, but it does not change any of the compared elements. If the string elements are equal, ZF is set, otherwise it is cleared. The first operand for this instruction should be the source string element addressed by SI or ESI with any segment prefix, the second operand should be the destination string element addressed by DI or EDI.

  1.     cmpsb                      ; compare bytes
  2.     cmps word [ds:si],[es:di]  ; compare words
  3.     cmps dword [fs:esi],[edi]  ; compare double words

scas subtracts the destination string element from AL, AX, or EAX (depending on the size of string element) and updates the flags AF, SF, ZF, PF, CF and OF. If the values are equal, ZF is set, otherwise it is cleared. The operand should be the destination string element addressed by DI or EDI.

  1.     scas byte [es:di]          ; scan byte
  2.     scasw                      ; scan word
  3.     scas dword [es:edi]        ; scan double word

stos places the value of AL, AX, or EAX into the destination string element. Rules for the operand are the same as for the scas instruction.

lods places the source string element into AL, AX, or EAX. The operand should be the source string element addressed by SI or ESI with any segment prefix.

  1.     lods byte [ds:si]           ; load byte
  2.     lods word [cs:si]           ; load word
  3.     lodsd                       ; load double word

ins transfers a byte, word, or double word from an input port addressed by DX register to the destination string element. The destination operand should be memory addressed by DI or EDI, the source operand should be the DX register.

  1.     insb                       ; input byte
  2.     ins word [es:di],dx        ; input word
  3.     ins dword [edi],dx         ; input double word

outs transfers the source string element to an output port addressed by DX register. The destination operand should be the DX register and the source operand should be memory addressed by SI or ESI with any segment prefix.

  1.     outs dx,byte [si]          ; output byte
  2.     outsw                      ; output word
  3.     outs dx,dword [gs:esi]     ; output double word

The repeat prefixes rep, repe/repz, and repne/repnz specify repeated string operation. When a string operation instruction has a repeat prefix, the operation is executed repeatedly, each time using a different element of the string. The repetition terminates when one of the conditions specified by the prefix is satisfied. All three prefixes automatically decrease CX or ECX register (depending whether string operation instruction uses the 16-bit or 32-bit addressing) after each operation and repeat the associated operation until CX or ECX is zero. repe/repz and repne/repnz are used exclusively with the scas and cmps instructions (described below). When these prefixes are used, repetition of the next instruction depends on the zero flag (ZF) also, repe and repz terminate the execution when the ZF is zero, repne and repnz terminate the execution when the ZF is set.

  1.     rep  movsd       ; transfer multiple double words
  2.     repe cmpsb       ; compare bytes until not equal

2.1.9 Flag control instructions

The flag control instructions provide a method for directly changing the state of bits in the flag register. All instructions described in this section have no operands.

stc sets the CF (carry flag) to 1, clc zeroes the CF, cmc changes the CF to its complement. std sets the DF (direction flag) to 1, cld zeroes the DF, sti sets the IF (interrupt flag) to 1 and therefore enables the interrupts, cli zeroes the IF and therefore disables the interrupts.

lahf copies SF, ZF, AF, PF, and CF to bits 7, 6, 4, 2, and 0 of the AH register. The contents of the remaining bits are undefined. The flags remain unaffected.

sahf transfers bits 7, 6, 4, 2, and 0 from the AH register into SF, ZF, AF, PF, and CF.

pushf decrements esp by two or four and stores the low word or double word of flags register at the top of stack, size of stored data depends on the current code setting. pushfw variant forces storing the word and pushfd forces storing the double word.

popf transfers specific bits from the word or double word at the top of stack, then increments esp by two or four, this value depends on the current code setting. popfw variant forces restoring from the word and popfd forces restoring from the double word.

2.1.10 Conditional operations

The instructions obtained by attaching the condition mnemonic (see table 2.1) to the set mnemonic set a byte to one if the condition is true and set the byte to zero otherwise. The operand should be an 8-bit be general register or the byte in memory.

  1.     setne al         ; set al if zero flag cleared
  2.     seto byte [bx]   ; set byte if overflow

salc instruction sets the all bits of AL register when the carry flag is set and zeroes the AL register otherwise. This instruction has no arguments.

The instructions obtained by attaching the condition mnemonic to the cmov mnemonic transfer the word or double word from the general register or memory to the general register only when the condition is true. The destination operand should be general register, the source operand can be general register or memory.

  1.     cmove ax,bx      ; move when zero flag set
  2.     cmovnc eax,[ebx] ; move when carry flag cleared

cmpxchg compares the value in the AL, AX, or EAX register with the destination operand. If the two values are equal, the source operand is loaded into the destination operand. Otherwise, the destination operand is loaded into the AL, AX, or EAX register. The destination operand may be a general register or memory, the source operand must be a general register.

  1.     cmpxchg dl,bl    ; compare and exchange with register
  2.     cmpxchg [bx],dx  ; compare and exchange with memory

cmpxchg8b compares the 64-bit value in EDX and EAX registers with the destination operand. If the values are equal, the 64-bit value in ECX and EBX registers is stored in the destination operand. Otherwise, the value in the destination operand is loaded into EDX and EAX registers. The destination operand should be a quad word in memory.

  1.     cmpxchg8b [bx]   ; compare and exchange 8 bytes

2.1.11 Miscellaneous instructions

nop instruction occupies one byte but affects nothing but the instruction pointer. This instruction has no operands and doesn't perform any operation.

ud2 instruction generates an invalid opcode exception. This instruction is provided for software testing to explicitly generate an invalid opcode. This is instruction has no operands.

xlat replaces a byte in the AL register with a byte indexed by its value in a translation table addressed by BX or EBX. The operand should be a byte memory addressed by BX or EBX with any segment prefix. This instruction has also a short form xlatb which has no operands and uses the BX or EBX address in the segment selected by DS depending on the current code setting.

lds transfers a pointer variable from the source operand to DS and the destination register. The source operand must be a memory operand, and the destination operand must be a general register. The DS register receives the segment selector of the pointer while the destination register receives the offset part of the pointer. les, lfs, lgs and lss operate identically to lds except that rather than DS register the ES, FS, GS and SS is used respectively.

  1.     lds bx,[si]      ; load pointer to ds:bx

lea transfers the offset of the source operand (rather than its value) to the destination operand. The source operand must be a memory operand, and the destination operand must be a general register.

  1.     lea dx,[bx+si+1] ; load effective address to dx

cpuid returns processor identification and feature information in the EAX, EBX, ECX, and EDX registers. The information returned is selected by entering a value in the EAX register before the instruction is executed. This instruction has no operands.

pause instruction delays the execution of the next instruction an implementation specific amount of time. It can be used to improve the performance of spin wait loops. This instruction has no operands.

enter creates a stack frame that may be used to implement the scope rules of block-structured high-level languages. A leave instruction at the end of a procedure complements an enter at the beginning of the procedure to simplify stack management and to control access to variables for nested procedures. The enter instruction includes two parameters. The first parameter specifies the number of bytes of dynamic storage to be allocated on the stack for the routine being entered. The second parameter corresponds to the lexical nesting level of the routine, it can be in range from 0 to 31. The specified lexical level determines how many sets of stack frame pointers the CPU copies into the new stack frame from the preceding frame. This list of stack frame pointers is sometimes called the display. The first word (or double word when code is 32-bit) of the display is a pointer to the last stack frame. This pointer enables a leave instruction to reverse the action of the previous enter instruction by effectively discarding the last stack frame. After enter creates the new display for a procedure, it allocates the dynamic storage space for that procedure by decrementing ESP by the number of bytes specified in the first parameter. To enable a procedure to address its display, enter leaves BP (or EBP) pointing to the beginning of the new stack frame. If the lexical level is zero, enter pushes BP (or EBP), copies SP to BP (or ESP to EBP) and then subtracts the first operand from ESP. For nesting levels greater than zero, the processor pushes additional frame pointers on the stack before adjusting the stack pointer.

  1.     enter 2048,0     ; enter and allocate 2048 bytes on stack

2.1.12 System instructions

lmsw loads the operand into the machine status word (bits 0 through 15 of CR0 register), while smsw stores the machine status word into the destination operand. The operand for both those instructions can be 16-bit general register or memory, for smsw it can also be 32-bit general register.

  1.     lmsw ax          ; load machine status from register
  2.     smsw [bx]        ; store machine status to memory

lgdt and lidt instructions load the values in operand into the global descriptor table register or the interrupt descriptor table register respectively. sgdt and sidt store the contents of the global descriptor table register or the interrupt descriptor table register in the destination operand. The operand should be a 6 bytes in memory.

  1.     lgdt [ebx]       ; load global descriptor table

lldt loads the operand into the segment selector field of the local descriptor table register and sldt stores the segment selector from the local descriptor table register in the operand. ltr loads the operand into the segment selector field of the task register and str stores the segment selector from the task register in the operand. Rules for operand are the same as for the lmsw and smsw instructions.

lar loads the access rights from the segment descriptor specified by the selector in source operand into the destination operand and sets the ZF flag. The destination operand can be a 16-bit or 32-bit general register. The source operand should be a 16-bit general register or memory.

  1.     lar ax,[bx]      ; load access rights into word
  2.     lar eax,dx       ; load access rights into double word

lsl loads the segment limit from the segment descriptor specified by the selector in source operand into the destination operand and sets the ZF flag. Rules for operand are the same as for the lar instruction.

verr and verw verify whether the code or data segment specified with the operand is readable or writable from the current privilege level. The operand should be a word, it can be general register or memory. If the segment is accessible and readable (for verr) or writable (for verw) the ZF flag is set, otherwise it's cleared. Rules for operand are the same as for the lldt instruction.

arpl compares the RPL (requestor's privilege level) fields of two segment selectors. The first operand contains one segment selector and the second operand contains the other. If the RPL field of the destination operand is less than the RPL field of the source operand, the ZF flag is set and the RPL field of the destination operand is increased to match that of the source operand. Otherwise, the ZF flag is cleared and no change is made to the destination operand. The destination operand can be a word general register or memory, the source operand must be a general register.

  1.     arpl bx,ax       ; adjust RPL of selector in register
  2.     arpl [bx],ax     ; adjust RPL of selector in memory

clts clears the TS (task switched) flag in the CR0 register. This instruction has no operands.

lock prefix causes the processor's bus-lock signal to be asserted during execution of the accompanying instruction. In a multiprocessor environment, the bus-lock signal insures that the processor has exclusive use of any shared memory while the signal is asserted. The lock prefix can be prepended only to the following instructions and only to those forms of the instructions where the destination operand is a memory operand: add, adc, and, btc, btr, bts, cmpxchg, cmpxchg8b, dec, inc, neg, not, or, sbb sub, xor, xadd and xchg. If the lock prefix is used with one of these instructions and the source operand is a memory operand, an undefined opcode exception may be generated. An undefined opcode exception will also be generated if the lock prefix is used with any instruction not in the above list. The xchg instruction always asserts the bus-lock signal regardless of the presence or absence of the lock prefix.

hlt stops instruction execution and places the processor in a halted state. An enabled interrupt, a debug exception, the BINIT, INIT or the RESET signal will resume execution. This instruction has no operands.

invlpg invalidates (flushes) the TLB (translation lookaside buffer) entry specified with the operand, which should be a memory. The processor determines the page that contains that address and flushes the TLB entry for that page.

rdmsr loads the contents of a 64-bit MSR (model specific register) of the address specified in the ECX register into registers EDX and EAX. wrmsr writes the contents of registers EDX and EAX into the 64-bit MSR of the address specified in the ECX register. rdtsc loads the current value of the processor's time stamp counter from the 64-bit MSR into the EDX and EAX registers. The processor increments the time stamp counter MSR every clock cycle and resets it to 0 whenever the processor is reset. rdpmc loads the contents of the 40-bit performance monitoring counter specified in the ECX register into registers EDX and EAX. These instructions have no operands.

wbinvd writes back all modified cache lines in the processor's internal cache to main memory and invalidates (flushes) the internal caches. The instruction then issues a special function bus cycle that directs external caches to also write back modified data and another bus cycle to indicate that the external caches should be invalidated. This instruction has no operands.

rsm return program control from the system management mode to the program that was interrupted when the processor received an SMM interrupt. This instruction has no operands.

sysenter executes a fast call to a level 0 system procedure, sysexit executes a fast return to level 3 user code. The addresses used by these instructions are stored in MSRs. These instructions have no operands.

2.1.13 FPU instructions

The FPU (Floating-Point Unit) instructions operate on the floating-point values in three formats: single precision (32-bit), double precision (64-bit) and double extended precision (80-bit). The FPU registers form the stack and each of them holds the double extended precision floating-point value. When some values are pushed onto the stack or are removed from the top, the FPU registers are shifted, so ST0 is always the value on the top of FPU stack, ST1 is the first value below the top, etc. The ST0 name has also the synonym ST.

fld pushes the floating-point value onto the FPU register stack. The operand can be 32-bit, 64-bit or 80-bit memory location or the FPU register, it's value is then loaded onto the top of FPU register stack (the ST0 register) and is automatically converted into the double extended precision format.

  1.     fld dword [bx]   ; load single prevision value from memory
  2.     fld st2          ; push value of st2 onto register stack

fld1, fldz, fldl2t, fldl2e, fldpi, fldlg2 and fldln2 load the commonly used contants onto the FPU register stack. The loaded constants are +1.0, +0.0, log210, log2e, π, log102 and ln 2 respectively. These instructions have no operands.

fild convert the singed integer source operand into double extended precision floating-point format and pushes the result onto the FPU register stack. The source operand can be a 16-bit, 32-bit or 64-bit memory location.

  1.     fild qword [bx]  ; load 64-bit integer from memory

fst copies the value of ST0 register to the destination operand, which can be 32-bit or 64-bit memory location or another FPU register. fstp performs the same operation as fst and then pops the register stack, getting rid of ST0. fstp accepts the same operands as the fst instruction and can also store value in the 80-bit memory.

  1.     fst st3          ; copy value of st0 into st3 register
  2.     fstp tword [bx]  ; store value in memory and pop stack

fist converts the value in ST0 to a signed integer and stores the result in the destination operand. The operand can be 16-bit or 32-bit memory location. fistp performs the same operation and then pops the register stack, it accepts the same operands as the fist instruction and can also store integer value in the 64-bit memory, so it has the same rules for operands as fild instruction.

fbld converts the packed BCD integer into double extended precision floating-point format and pushes this value onto the FPU stack. fbstp converts the value in ST0 to an 18-digit packed BCD integer, stores the result in the destination operand, and pops the register stack. The operand should be an 80-bit memory location.

fadd adds the destination and source operand and stores the sum in the destination location. The destination operand is always an FPU register, if the source is a memory location, the destination is ST0 register and only source operand should be specified. If both operands are FPU registers, at least one of them should be ST0 register. An operand in memory can be a 32-bit or 64-bit value.

  1.     fadd qword [bx]  ; add double precision value to st0
  2.     fadd st2,st0     ; add st0 to st2

faddp adds the destination and source operand, stores the sum in the destination location and then pops the register stack. The destination operand must be an FPU register and the source operand must be the ST0. When no operands are specified, ST1 is used as a destination operand.

  1.     faddp            ; add st0 to st1 and pop the stack
  2.     faddp st2,st0    ; add st0 to st2 and pop the stack

fiadd instruction converts an integer source operand into double extended precision floating-point value and adds it to the destination operand. The operand should be a 16-bit or 32-bit memory location.

  1.     fiadd word [bx]  ; add word integer to st0

fsub, fsubr, fmul, fdiv, fdivr instruction are similar to fadd, have the same rules for operands and differ only in the perfomed computation. fsub substracts the source operand from the destination operand, fsubr substract the destination operand from the source operand, fmul multiplies the destination and source operands, fdiv divides the destination operand by the source operand and fdivr divides the source operand by the destination operand. fsubp, fsubrp, fmulp, fdivp, fdivrp perform the same operations and pop the register stack, the rules for operand are the same as for the faddp instruction. fisub, fisubr, fimul, fidiv, fidivr perform these operations after converting the integer source operand into floating-point value, they have the same rules for operands as fiadd instruction.

fsqrt computes the square root of the value in ST0 register, fsin computes the sine of that value, fcos computes the cosine of that value, fchs complements its sign bit, fabs clears its sign to create the absolute value, frndint rounds it to the nearest integral value, depending on the current rounding mode. f2xm1 computes the exponential value of 2 to the power of ST0 and substracts the 1.0 from it, the value of ST0 must lie in the range -1.0 to +1.0. All these instruction store the result in ST0 and have no operands.

fsincos computes both the sine and the cosine of the value in ST0 register, stores the sine in ST0 and pushes the cosine on the top of FPU register stack. fptan computes the tangent of the value in ST0, stores the result in ST0 and pushes a 1.0 onto the FPU register stack. fpatan computes the arctangent of the value in ST1 divided by the value in ST0, stores the result in ST1 and pops the FPU register stack. fyl2x computes the binary logarithm of ST0, multiplies it by ST1, stores the result in ST1 and pop the FPU register stack; fyl2xp1 performs the same operation but it adds 1.0 to ST0 before computing the logarithm. fprem computes the remainder obtained from dividing the value in ST0 by the value in ST1, and stores the result in ST0. fprem1 performs the same operation as fprem, but it computes the remainder in the way specified by IEEE Standard 754. fscale truncates the value in ST1 and increases the exponent of ST0 by this value. fxtract separates the value in ST0 into its exponent and significand, stores the exponent in ST0 and pushes the significand onto the register stack. fnop performs no operation. These instruction have no operands.

fxch exchanges the contents of ST0 an another FPU register. The operand should be an FPU register, if no operand is specified, the contents of ST0 and ST1 are exchanged.

fcom and fcomp compare the contents of ST0 and the source operand and set flags in the FPU status word according to the results. fcomp additionally pops the register stack after performing the comparision. The operand can be a single or double precision value in memory or the FPU register. When no operand is specified, ST1 is used as a source operand.

  1.     fcom             ; compare st0 with st1
  2.     fcomp st2        ; compare st0 with st2 and pop stack

fcompp compares the contents of ST0 and ST1, sets flags in the FPU status word according to the results and pops the register stack twice. This instruction has no operands.

fucom, fucomp and fucompp performs an unordered comparision of two FPU registers. Rules for operands are the same as for the fcom, fcomp and fcompp, but the source operand must be an FPU register.

ficom and ficomp compare the value in ST0 with an integer source operand and set the flags in the FPU status word according to the results. ficomp additionally pops the register stack after performing the comparision. The integer value is converted to double extended precision floating-point format before the comparision is made. The operand should be a 16-bit or 32-bit memory location.

  1.     ficom word [bx]  ; compare st0 with 16-bit integer

fcomi, fcomip, fucomi, fucomip perform the comparision of ST0 with another FPU register and set the ZF, PF and CF flags according to the results. fcomip and fucomip additionaly pop the register stack after performing the comparision. The instructions obtained by attaching the FPU condition mnemonic (see table 2.2) to the fcmov mnemonic transfer the specified FPU register into ST0 register if the fiven test condition is true. These instruction allow two different syntaxes, one with single operand specifying the source FPU register, and one with two operands, in that case destination operand should be ST0 register and the second operand specifies the source FPU register.

  1.     fcomi st2        ; compare st0 with st2 and set flags
  2.     fcmovb st0,st2   ; transfer st2 to st0 if below

Table 2.2 FPU conditions

Mnemonic Condition tested Description
b CF = 1 below
e ZF = 1 equal
be CF or ZF = 1 equal
u PF = 1 unordered
nb CF = 0 not below
ne ZF = 0 not equal
nbe CF or ZF = 0 not equal
nu PF = 0 not unordered

ftst compares the value in ST0 with 0.0 and sets the flags in the FPU status word according to the results. fxam examines the contents of the ST0 and sets the flags in FPU status word to indicate the class of value in the register. These instructions have no operands.

fstsw and fnstsw store the current value of the FPU status word in the destination location. The destination operand can be either a 16-bit memory or the AX register. fstsw checks for pending umasked FPU exceptions before storing the status word, fnstsw does not.

fstcw and fnstcw store the current value of the FPU control word at the specified destination in memory. fstcw checks for pending umasked FPU exceptions before storing the control word, fnstcw does not. fldcw loads the operand into the FPU control word. The operand should be a 16-bit memory location.

fstenv and fnstenv store the current FPU operating environment at the memory location specified with the destination operand, and then mask all FPU exceptions. fstenv checks for pending umasked FPU exceptions before proceeding, fnstenv does not. fldenv loads the complete operating environment from memory into the FPU. fsave and fnsave store the current FPU state (operating environment and register stack) at the specified destination in memory and reinitializes the FPU. fsave check for pending unmasked FPU exceptions before proceeding, fnsave does not. frstor loads the FPU state from the specified memory location. All these instructions need an operand being a memory location. For each of these instruction exist two additional mnemonics that allow to precisely select the type of the operation. The fstenvw, fnstenvw, fldenvw, fsavew, fnsavew and frstorw mnemonics force the instruction to perform operation as in the 16-bit mode, while fstenvd, fnstenvd, fldenvd, fsaved, fnsaved and frstord force the operation as in 32-bit mode.

finit and fninit set the FPU operating environment into its default state. finit checks for pending unmasked FPU exception before proceeding, fninit does not. fclex and fnclex clear the FPU exception flags in the FPU status word. fclex checks for pending unmasked FPU exception before proceeding, fnclex does not. wait and fwait are synonyms for the same instruction, which causes the processor to check for pending unmasked FPU exceptions and handle them before proceeding. These instruction have no operands.

ffree sets the tag associated with specified FPU register to empty. The operand should be an FPU register.

fincstp and fdecstp rotate the FPU stack by one by adding or substracting one to the pointer of the top of stack. These instruction have no operands.

2.1.14 MMX instructions

The MMX instructions operate on the packed integer types and use the MMX registers, which are the low 64-bit parts of the 80-bit FPU registers. Because of this MMX instructions cannot be used at the same time as FPU instructions. They can operate on packed bytes (eight 8-bit integers), packed words (four 16-bit integers) or packed double words (two 32-bit integers), use of packed formats allows to perform operations on multiple data at one time.

movq copies a quad word from the source operand to the destination operand. At least one of the operands must be a MMX register, the second one can be also a MMX register or 64-bit memory location.

  1.     movq mm0,mm1     ; move quad word from register to register
  2.     movq mm2,[ebx]   ; move quad word from memory to register

movd copies a double word from the source operand to the destination operand. One of the operands must be a MMX register, the second one can be a general register or 32-bit memory location. Only low double word of MMX register is used.

All general MMX operations have two operands, the destination operand should be a MMX register, the source operand can be a MMX register or 64-bit memory location. Operation is performed on the corresponding data elements of the source and destination operand and stored in the data elements of the destination operand. paddb, paddw and paddd perform the addition of packed bytes, packed words, or packed double words. psubb, psubw and psubd perform the substraction of appropriate types. paddsb, paddsw, psubsb and psubsw perform the addition or substraction of packed bytes or packed words with the signed saturation. paddusb, paddusw, psubusb, psubusw are analoguous, but with unsigned saturation. pmulhw and pmullw performs a signed multiplication of the packed words and store the high or low words of the results in the destination operand. pmaddwd performs a multiply of the packed words and adds the four intermediate double word products in pairs to produce result as a packed double words. pand, por and pxor perform the logical operations on the quad words, pandn peforms also a logical negation of the destination operand before performing the and operation. pcmpeqb, pcmpeqw and pcmpeqd compare for equality of packed bytes, packed words or packed double words. If a pair of data elements is equal, the corresponding data element in the destination operand is filled with bits of value 1, otherwise it's set to 0. pcmpgtb, pcmpgtw and pcmpgtd perform the similar operation, but they check whether the data elements in the destination operand are greater than the correspoding data elements in the source operand. packsswb converts packed signed words into packed signed bytes, packssdw converts packed signed double words into packed signed words, using saturation to handle overflow conditions. packuswb converts packed signed words into packed unsigned bytes. Converted data elements from the source operand are stored in the low part of the destination operand, while converted data elements from the destination operand are stored in the high part. punpckhbw, punpckhwd and punpckhdq interleaves the data elements from the high parts of the source and destination operands and stores the result into the destination operand. punpcklbw, punpcklwd and punpckldq perform the same operation, but the low parts of the source and destination operand are used.

  1.     paddsb mm0,[esi] ; add packed bytes with signed saturation
  2.     pcmpeqw mm3,mm7  ; compare packed words for equality

psllw, pslld and psllq perform logical shift left of the packed words, packed double words or a single quad word in the destination operand by the amount specified in the source operand. psrlw, psrld and psrlq perform logical shift right of the packed words, packed double words or a single quad word. psraw and psrad perform arithmetic shift of the packed words or double words. The destination operand should be a MMX register, while source operand can be a MMX register, 64-bit memory location, or 8-bit immediate value.

  1.     psllw mm2,mm4    ; shift words left logically
  2.     psrad mm4,[ebx]  ; shift double words right arithmetically

emms makes the FPU registers usable for the FPU instructions, it must be used before using the FPU instructions if any MMX instructions were used.

2.1.15 SSE instructions

The SSE extension adds more MMX instructions and also introduces the operations on packed single precision floating point values. The 128-bit packed single precision format consists of four single precision floating point values. The 128-bit SSE registers are designed for the purpose of operations on this data type.

movaps and movups transfer a double quad word operand containing packed single precision values from source operand to destination operand. At least one of the operands have to be a SSE register, the second one can be also a SSE register or 128-bit memory location. Memory operands for movaps instruction must be aligned on boundary of 16 bytes, operands for movups instruction don't have to be aligned.

  1.     movups xmm0,[ebx]  ; move unaligned double quad word

movlps moves packed two single precision values between the memory and the low quad word of SSE register. movhps moved packed two single precision values between the memory and the high quad word of SSE register. One of the operands must be a SSE register, and the other operand must be a 64-bit memory location.

  1.     movlps xmm0,[ebx]  ; move memory to low quad word of xmm0
  2.     movhps [esi],xmm7  ; move high quad word of xmm7 to memory

movlhps moves packed two single precision values from the low quad word of source register to the high quad word of destination register. movhlps moves two packed single precision values from the high quad word of source register to the low quad word of destination register. Both operands have to be a SSE registers.

movmskps transfers the most significant bit of each of the four single precision values in the SSE register into low four bits of a general register. The source operand must be a SSE register, the destination operand must be a general register.

movss transfers a single precision value between source and destination operand (only the low double word is trasferred). At least one of the operands have to be a SSE register, the second one can be also a SSE register or 32-bit memory location.

  1.     movss [edi],xmm3   ; move low double word of xmm3 to memory

Each of the SSE arithmetic operations has two variants. When the mnemonic ends with ps, the source operand can be a 128-bit memory location or a SSE register, the destination operand must be a SSE register and the operation is performed on packed four single precision values, for each pair of the corresponding data elements separately, the result is stored in the destination register. When the mnemonic ends with ss, the source operand can be a 32-bit memory location or a SSE register, the destination operand must be a SSE register and the operation is performed on single precision values, only low double words of SSE registers are used in this case, the result is stored in the low double word of destination register. addps and addss add the values, subps and subss substract the source value from destination value, mulps and mulss multiply the values, divps and divss divide the destination value by the source value, rcpps and rcpss compute the approximate reciprocal of the source value, sqrtps and sqrtss compute the square root of the source value, rsqrtps and rsqrtss compute the approximate reciprocal of square root of the source value, maxps and maxss compare the source and destination values and return the greater one, minps and minss compare the source and destination values and return the lesser one.

  1.     mulss xmm0,[ebx]   ; multiply single precision values
  2.     addps xmm3,xmm7    ; add packed single precision values

andps, andnps, orps and xorps perform the logical operations on packed single precision values. The source operand can be a 128-bit memory location or a SSE register, the destination operand must be a SSE register.

cmpps compares packed single precision values and returns a mask result into the destination operand, which must be a SSE register. The source operand can be a 128-bit memory location or SSE register, the third operand must be an immediate operand selecting code of one of the eight compare conditions (table 2.3). cmpss performs the same operation on single precision values, only low double word of destination register is affected, in this case source operand can be a 32-bit memory location or SSE register. These two instructions have also variants with only two operands and the condition encoded within mnemonic. Their mnemonics are obtained by attaching the mnemonic from table 2.3 to the cmp mnemonic and then attaching the ps or ss at the end.

  1.     cmpps xmm2,xmm4,0  ; compare packed single precision values
  2.     cmpltss xmm0,[ebx] ; compare single precision values

Table 2.3 SSE conditions

Code Mnemonic Description
0 eq equal
1 lt less than
2 le less than or equal
3 unord unordered
4 neq not equal
5 nlt not less than
6 nle not less than nor equal
7 ord ordered

comiss and ucomiss compare the single precision values and set the ZF, PF and CF flags to show the result. The destination operand must be a SSE register, the source operand can be a 32-bit memory location or SSE register.

shufps moves any two of the four single precision values from the destination operand into the low quad word of the destination operand, and any two of the four values from the source operand into the high quad word of the destination operand. The destination operand must be a SSE register, the source operand can be a 128-bit memory location or SSE register, the third operand must be an 8-bit immediate value selecting which values will be moved into the destination operand. Bits 0 and 1 select the value to be moved from destination operand to the low double word of the result, bits 2 and 3 select the value to be moved from the destination operand to the second double word, bits 4 and 5 select the value to be moved from the source operand to the third double word, and bits 6 and 7 select the value to be moved from the source operand to the high double word of the result.

  1.     shufps xmm0,xmm0,10010011b ; shuffle double words

unpckhps performs an interleaved unpack of the values from the high parts of the source and destination operands and stores the result in the destination operand, which must be a SSE register. The source operand can be a 128-bit memory location or a SSE register. unpcklps performs an interleaved unpack of the values from the low parts of the source and destination operand and stores the result in the destination operand, the rules for operands are the same.

cvtpi2ps converts packed two double word integers into the the packed two single precision floating point values and stores the result in the low quad word of the destination operand, which should be a SSE register. The source operand can be a 64-bit memory location or MMX register.

  1.     cvtpi2ps xmm0,mm0  ; convert integers to single precision values

cvtsi2ss converts a double word integer into a single precision floating point value and stores the result in the low double word of the destination operand, which should be a SSE register. The source operand can be a 32-bit memory location or 32-bit general register.

  1.     cvtsi2ss xmm0,eax  ; convert integer to single precision value

cvtps2pi converts packed two single precision floating point values into packed two double word integers and stores the result in the destination operand, which should be a MMX register. The source operand can be a 64-bit memory location or SSE register, only low quad word of SSE register is used. cvttps2pi performs the similar operation, except that truncation is used to round a source values to integers, rules for the operands are the same.

  1.     cvtps2pi mm0,xmm0  ; convert single precision values to integers

cvtss2si convert a single precision floating point value into a double word integer and stores the result in the destination operand, which should be a 32-bit general register. The source operand can be a 32-bit memory location or SSE register, only low double word of SSE register is used. cvttss2si performs the similar operation, except that truncation is used to round a source value to integer, rules for the operands are the same.

  1.     cvtss2si eax,xmm0  ; convert single precision value to integer

pextrw copies the word in the source operand specified by the third operand to the destination operand. The source operand must be a MMX register, the destination operand must be a 32-bit general register (the high word of the destination is cleared), the third operand must an 8-bit immediate value.

  1.     pextrw eax,mm0,1   ; extract word into eax

pinsrw inserts a word from the source operand in the destination operand at the location specified with the third operand, which must be an 8-bit immediate value. The destination operand must be a MMX register, the source operand can be a 16-bit memory location or 32-bit general register (only low word of the register is used).

  1.     pinsrw mm1,ebx,2   ; insert word from ebx

pavgb and pavgw compute average of packed bytes or words. pmaxub return the maximum values of packed unsigned bytes, pminub returns the minimum values of packed unsigned bytes, pmaxsw returns the maximum values of packed signed words, pminsw returns the minimum values of packed signed words. pmulhuw performs a unsigned multiplication of the packed words and stores the high words of the results in the destination operand. psadbw computes the absolute differences of packed unsigned bytes, sums the differences, and stores the sum in the low word of destination operand. All these instructions follow the same rules for operands as the general MMX operations described in previous section.

pmovmskb creates a mask made of the most significant bit of each byte in the source operand and stores the result in the low byte of destination operand. The source operand must be a MMX register, the destination operand must a 32-bit general register.

pshufw inserts words from the source operand in the destination operand from the locations specified with the third operand. The destination operand must be a MMX register, the source operand can be a 64-bit memory location or MMX register, third operand must an 8-bit immediate value selecting which values will be moved into destination operand, in the similar way as the third operand of the shufps instruction.

movntq moves the quad word from the source operand to memory using a non-temporal hint to minimize cache pollution. The source operand should be a MMX register, the destination operand should be a 64-bit memory location. movntps stores packed single precision values from the SSE register to memory using a non-temporal hint. The source operand should be a SSE register, the destination operand should be a 128-bit memory location. maskmovq stores selected bytes from the first operand into a 64-bit memory location using a non-temporal hint. Both operands should be a MMX registers, the second operand selects wich bytes from the source operand are written to memory. The memory location is pointed by DI (or EDI) register in the segment selected by DS.

prefetcht0, prefetcht1, prefetcht2 and prefetchnta fetch the line of data from memory that contains byte specified with the operand to a specified location in hierarchy. The operand should be an 8-bit memory location.

sfence performs a serializing operation on all instruction storing to memory that were issued prior to it. This instruction has no operands.

ldmxcsr loads the 32-bit memory operand into the MXCSR register. stmxcsr stores the contents of MXCSR into a 32-bit memory operand.

fxsave saves the current state of the FPU, MXCSR register, and all the FPU and SSE registers to a 512-byte memory location specified in the destination operand. fxrstor reloads data previously stored with fxsave instruction from the specified 512-byte memory location. The memory operand for both those instructions must be aligned on 16 byte boundary, it should declare operand of no specified size.

2.1.16 SSE2 instructions

The SSE2 extension introduces the operations on packed double precision floating point values, extends the syntax of MMX instructions, and adds also some new instructions.

movapd and movupd transfer a double quad word operand containing packed double precision values from source operand to destination operand. These instructions are analogous to movaps and movups and have the same rules for operands.

movlpd moves double precision value between the memory and the low quad word of SSE register. movhpd moved double precision value between the memory and the high quad word of SSE register. These instructions are analogous to movlps and movhps and have the same rules for operands.

movmskpd transfers the most significant bit of each of the two double precision values in the SSE register into low two bits of a general register. This instruction is analogous to movmskps and has the same rules for operands.

movsd transfers a double precision value between source and destination operand (only the low quad word is trasferred). At least one of the operands have to be a SSE register, the second one can be also a SSE register or 64-bit memory location.

Arithmetic operations on double precision values are: addpd, addsd, subpd, subsd, mulpd, mulsd, divpd, divsd, sqrtpd, sqrtsd, maxpd, maxsd, minpd, minsd, and they are analoguous to arithmetic operations on single precision values described in previous section. When the mnemonic ends with pd instead of ps, the operation is performed on packed two double precision values, but rules for operands are the same. When the mnemonic ends with sd instead of ss, the source operand can be a 64-bit memory location or a SSE register, the destination operand must be a SSE register and the operation is performed on double precision values, only low quad words of SSE registers are used in this case.

andpd, andnpd, orpd and xorpd perform the logical operations on packed double precision values. They are analoguous to SSE logical operations on single prevision values and have the same rules for operands.

cmppd compares packed double precision values and returns and returns a mask result into the destination operand. This instruction is analoguous to cmpps and has the same rules for operands. cmpsd performs the same operation on double precision values, only low quad word of destination register is affected, in this case source operand can be a 64-bit memory or SSE register. Variant with only two operands are obtained by attaching the condition mnemonic from table 2.3 to the cmp mnemonic and then attaching the pd or sd at the end.

comisd and ucomisd compare the double precision values and set the ZF, PF and CF flags to show the result. The destination operand must be a SSE register, the source operand can be a 128-bit memory location or SSE register.

shufpd moves any of the two double precision values from the destination operand into the low quad word of the destination operand, and any of the two values from the source operand into the high quad word of the destination operand. This instruction is analoguous to shufps and has the same rules for operand. Bit 0 of the third operand selects the value to be moved from the destination operand, bit 1 selects the value to be moved from the source operand, the rest of bits are reserved and must be zeroed.

unpckhpd performs an unpack of the high quad words from the source and destination operands, unpcklpd performs an unpack of the low quad words from the source and destination operands. They are analoguous to unpckhps and unpcklps, and have the same rules for operands.

cvtps2pd converts the packed two single precision floating point values to two packed double precision floating point values, the destination operand must be a SSE register, the source operand can be a 64-bit memory location or SSE register. cvtpd2ps converts the packed two double precision floating point values to packed two single precision floating point values, the destination operand must be a SSE register, the source operand can be a 128-bit memory location or SSE register. cvtss2sd converts the single precision floating point value to double precision floating point value, the destination operand must be a SSE register, the source operand can be a 32-bit memory location or SSE register. cvtsd2ss converts the double precision floating point value to single precision floating point value, the destination operand must be a SSE register, the source operand can be 64-bit memory location or SSE register.

cvtpi2pd converts packed two double word integers into the the packed double precision floating point values, the destination operand must be a SSE register, the source operand can be a 64-bit memory location or MMX register. cvtsi2sd converts a double word integer into a double precision floating point value, the destination operand must be a SSE register, the source operand can be a 32-bit memory location or 32-bit general register. cvtpd2pi converts packed double precision floating point values into packed two double word integers, the destination operand should be a MMX register, the source operand can be a 128-bit memory location or SSE register. cvttpd2pi performs the similar operation, except that truncation is used to round a source values to integers, rules for operands are the same. cvtsd2si converts a double precision floating point value into a double word integer, the destination operand should be a 32-bit general register, the source operand can be a 64-bit memory location or SSE register. cvttsd2si performs the similar operation, except that truncation is used to round a source value to integer, rules for operands are the same.

cvtps2dq and cvttps2dq convert packed single precision floating point values to packed four double word integers, storing them in the destination operand. cvtpd2dq and cvttpd2dq convert packed double precision floating point values to packed two double word integers, storing the result in the low quad word of the destination operand. cvtdq2ps converts packed four double word integers to packed single precision floating point values. cvtdq2pd converts packed two double word integers from the low quad word of the source operand to packed double precision floating point values. For all these instruction destination operand must be a SSE register, the source operand can be a 128-bit memory location or SSE register.

movdqa and movdqu transfer a double quad word operand containing packed integers from source operand to destination operand. At least one of the operands have to be a SSE register, the second one can be also a SSE register or 128-bit memory location. Memory operands for movdqa instruction must be aligned on boundary of 16 bytes, operands for movdqu instruction don't have to be aligned.

movq2dq moves the contents of the MMX source register to the low quad word of destination SSE register. movdq2q moves the low quad word from the source SSE register to the destination MMX register.

  1.     movq2dq xmm0,mm1   ; move from MMX register to SSE register
  2.     movdq2q mm0,xmm1   ; move from SSE register to MMX register

All MMX instructions operating on the 64-bit packed integers (those with mnemonics starting with p) are extended to operate on 128-bit packed integers located in SSE registers. Additional syntax for these instructions needs an SSE register where MMX register was needed, and the 128-bit memory location or SSE register where 64-bit memory location or MMX register were needed. The exception is pshufw instruction, which doesn't allow extended syntax, but has two new variants: pshufhw and pshuflw, which allow only the extended syntax, and perform the same operation as pshufw on the high or low quad words of operands respectively. Also the new instruction pshufd is introduced, which performs the same operation as pshufw, but on the double words instead of words, it allows only the extended syntax.

  1.     psubb xmm0,[esi]   ; substract 16 packed bytes
  2.     pextrw eax,xmm0,7  ; extract highest word into eax

paddq performs the addition of packed quad words, psubq performs the substraction of packed quad words, pmuludq performs an unsigned multiplication of low double words from each corresponding quad words and returns the results in packed quad words. These instructions follow the same rules for operands as the general MMX operations described in 2.1.14.

pslldq and psrldq perform logical shift left or right of the double quad word in the destination operand by the amount of bits specified in the source operand. The destination operand should be a SSE register, source operand should be an 8-bit immediate value.

punpckhqdq interleaves the high quad word of the source operand and the high quad word of the destination operand and writes them to the destination SSE register. punpcklqdq interleaves the low quad word of the source operand and the low quad word of the destination operand and writes them to the destination SSE register. The source operand can be a 128-bit memory location or SSE register.

movntdq stores packed integer data from the SSE register to memory using non-temporal hint. The source operand should be a SSE register, the destination operand should be a 128-bit memory location. movntpd stores packed double precision values from the SSE register to memory using a non-temporal hint. Rules for operand are the same. movnti stores integer from a general register to memory using a non-temporal hint. The source operand should be a 32-bit general register, the destination operand should be a 32-bit memory location. maskmovdqu stores selected bytes from the first operand into a 128-bit memory location using a non-temporal hint. Both operands should be a SSE registers, the second operand selects wich bytes from the source operand are written to memory. The memory location is pointed by DI (or EDI) register in the segment selected by DS and does not need to be aligned.

clflush writes and invalidates the cache line associated with the address of byte specified with the operand, which should be a 8-bit memory location.

lfence performs a serializing operation on all instruction loading from memory that were issued prior to it. mfence performs a serializing operation on all instruction accesing memory that were issued prior to it, and so it combines the functions of sfence (described in previous section) and lfence instructions. These instructions have no operands.

2.1.17 SSE3 instructions

Prescott technology introduced some new instructions to improve the performance of SSE and SSE2 - this extension is called SSE3.

fisttp behaves like the fistp instruction and accepts the same operands, the only difference is that it always used truncation, irrespective of the rounding mode.

movshdup loads into destination operand the 128-bit value obtained from the source value of the same size by filling the each quad word with the two duplicates of the value in its high double word. movsldup performs the same action, except it duplicates the values of low double words. The destination operand should be SSE register, the source operand can be SSE register or 128-bit memory location.

movddup loads the 64-bit source value and duplicates it into high and low quad word of the destination operand. The destination operand should be SSE register, the source operand can be SSE register or 64-bit memory location.

lddqu is functionally equivalent to movdqu instruction with memory as source operand, but it may improve performance when the source operand crosses a cacheline boundary. The destination operand has to be SSE register, the source operand must be 128-bit memory location.

addsubps performs single precision addition of second and fourth pairs and single precision substracion of the first and third pairs of floating point values in the operands. addsubpd performs double precision addition of the second pair and double precision substraction of the first pair of floating point values in the operand. haddps performs the addition of two single precision values within the each quad word of source and destination operands, and stores the results of such horizontal addition of values from destination operand into low quad word of destination operand, and the results from the source operand into high quad word of destination operand. haddpd performs the addition of two double precision values within each operand, and stores the result from destination operand into low quad word of destination operand, and the result from source operand into high quad word of destination operand. All these instruction need the destination operand to be SSE register, source operand can be SSE register or 128-bit memory location.

monitor sets up an address range for monitoring of write-back stores. It need its three operands to be EAX, ECX and EDX register in that order. mwait waits for a write-back store to the address range set up by the monitor instruction. It uses two operands with additional parameters, first being the EAX and second the ECX register.

The functionality of SSE3 is further extended by the set of Supplemental SSE3 instructions (SSSE3). They generally follow the same rules for operands as all the MMX operations extended by SSE.

phaddw and phaddd perform the horizontal additional of the pairs of adjacent values from both the source and destination operand, and stores the sums into the destination (sums from the source operand go into lower part of destination register). They operate on 16-bit or 32-bit chunks, respectively. phaddsw performs the same operation on signed 16-bit packed values, but the result of each addition is saturated. phsubw and phsubd analogously perform the horizontal substraction of 16-bit or 32-bit packed value, and phsubsw performs the horizontal substraction of signed 16-bit packed values with saturation.

pabsb, pabsw and pabsd calculate the absolute value of each signed packed signed value in source operand and stores them into the destination register. They operator on 8-bit, 16-bit and 32-bit elements respectively.

pmaddubsw multiplies signed 8-bit values from the source operand with the corresponding unsigned 8-bit values from the destination operand to produce intermediate 16-bit values, and every adjacent pair of those intermediate values is then added horizontally and those 16-bit sums are stored into the destination operand.

pmulhrsw multiplies corresponding 16-bit integers from the source and destination operand to produce intermediate 32-bit values, and the 16 bits next to the highest bit of each of those values are then rounded and packed into the destination operand.

pshufb shuffles the bytes in the destination operand according to the mask provided by source operand - each of the bytes in source operand is an index of the target position for the corresponding byte in the destination.

psignb, psignw and psignd perform the operation on 8-bit, 16-bit or 32-bit integers in destination operand, depending on the signs of the values in the source. If the value in source is negative, the corresponding value in the destination register is negated, if the value in source is positive, no operation is performed on the corresponding value is performed, and if the value in source is zero, the value in destination is zeroed, too.

palignr appends the source operand to the destination operand to form the intermediate value of twice the size, and then extracts into the destination register the 64 or 128 bits that are right-aligned to the byte offset specified by the third operand, which should be an 8-bit immediate value. This is the only SSSE3 instruction that takes three arguments.

2.1.18 AMD 3DNow! instructions

The 3DNow! extension adds a new MMX instructions to those described in 2.1.14, and introduces operation on the 64-bit packed floating point values, each consisting of two single precision floating point values.

These instructions follow the same rules as the general MMX operations, the destination operand should be a MMX register, the source operand can be a MMX register or 64-bit memory location. pavgusb computes the rounded averages of packed unsigned bytes. pmulhrw performs a signed multiplication of the packed words, round the high word of each double word results and stores them in the destination operand. pi2fd converts packed double word integers into packed floating point values. pf2id converts packed floating point values into packed double word integers using truncation. pi2fw converts packed word integers into packed floating point values, only low words of each double word in source operand are used. pf2iw converts packed floating point values to packed word integers, results are extended to double words using the sign extension. pfadd adds packed floating point values. pfsub and pfsubr substracts packed floating point values, the first one substracts source values from destination values, the second one substracts destination values from the source values. pfmul multiplies packed floating point values. pfacc adds the low and high floating point values of the destination operand, storing the result in the low double word of destination, and adds the low and high floating point values of the source operand, storing the result in the high double word of destination. pfnacc substracts the high floating point value of the destination operand from the low, storing the result in the low double word of destination, and substracts the high floating point value of the source operand from the low, storing the result in the high double word of destination. pfpnacc substracts the high floating point value of the destination operand from the low, storing the result in the low double word of destination, and adds the low and high floating point values of the source operand, storing the result in the high double word of destination. pfmax and pfmin compute the maximum and minimum of floating point values. pswapd reverses the high and low double word of the source operand. pfrcp returns an estimates of the reciprocals of floating point values from the source operand, pfrsqrt returns an estimates of the reciprocal square roots of floating point values from the source operand, pfrcpit1 performs the first step in the Newton-Raphson iteration to refine the reciprocal approximation produced by pfrcp instruction, pfrsqit1 performs the first step in the Newton-Raphson iteration to refine the reciprocal square root approximation produced by pfrsqrt instruction, pfrcpit2 performs the second final step in the Newton-Raphson iteration to refine the reciprocal approximation or the reciprocal square root approximation. pfcmpeq, pfcmpge and pfcmpgt compare the packed floating point values and sets all bits or zeroes all bits of the correspoding data element in the destination operand according to the result of comparision, first checks whether values are equal, second checks whether destination value is greater or equal to source value, third checks whether destination value is greater than source value.

prefetch and prefetchw load the line of data from memory that contains byte specified with the operand into the data cache, prefetchw instruction should be used when the data in the cache line is expected to be modified, otherwise the prefetch instruction should be used. The operand should be an 8-bit memory location.

femms performs a fast clear of MMX state. This instruction has no operands.

2.1.19 The x86-64 long mode instructions

The AMD64 and EM64T architectures (we will use the common name x86-64 for them both) extend the x86 instruction set for the 64-bit processing. While legacy and compatibility modes use the same set of registers and instructions, the new long mode extends the x86 operations to 64 bits and introduces several new registers. You can turn on generating the code for this mode with the use64 directive.

Each of the general purpose registers is extended to 64 bits and the eight whole new general purpose registers and also eight new SSE registers are added. See table 2.4 for the summary of new registers (only the ones that was not listed in table 1.2). The general purpose registers of smallers sizes are the low order portions of the larger ones. You can still access the ah, bh, ch and dh registers in long mode, but you cannot use them in the same instruction with any of the new registers.

Table 2.4 New registers in long mode

Type General SSE AVX
Bits 8 16 32 64 128 256
 
 
 
 
 
spl
bpl
sil
dil
r8b
r9b
r10b
r11b
r12b
r13b
r14b
r15b
 
 
 
 
 
 
 
 
r8w
r9w
r10w
r11w
r12w
r13w
r14w
r15w
 
 
 
 
 
 
 
 
r8d
r9d
r10d
r11d
r12d
r13d
r14d
r15d
rax
rcx
rdx
rbx
rsp
rbp
rsi
rdi
r8
r9
r10
r11
r12
r13
r14
r15
 
 
 
 
 
 
 
 
xmm8
xmm9
xmm10
xmm11
xmm12
xmm13
xmm14
xmm15
 
 
 
 
 
 
 
 
ymm8
ymm9
ymm10
ymm11
ymm12
ymm13
ymm14
ymm15

In general any instruction from x86 architecture, which allowed 16-bit or 32-bit operand sizes, in long mode allows also the 64-bit operands. The 64-bit registers should be used for addressing in long mode, the 32-bit addressing is also allowed, but it's not possible to use the addresses based on 16-bit registers. Below are the samples of new operations possible in long mode on the example of mov instruction:

  1.     mov rax,r8   ; transfer 64-bit general register
  2.     mov al,[rbx] ; transfer memory addressed by 64-bit register

The long mode uses also the instruction pointer based addresses, you can specify it manually with the special RIP register symbol, but such addressing is also automatically generated by flat assembler, since there is no 64-bit absolute addressing in long mode. You can still force the assembler to use the 32-bit absolute addressing by putting the dword size override for address inside the square brackets. There is also one exception, where the 64-bit absolute addressing is possible, it's the mov instruction with one of the operand being accumulator register, and second being the memory operand. To force the assembler to use the 64-bit absolute addressing there, use the qword size operator for address inside the square brackets. When no size operator is applied to address, assembler generates the optimal form automatically.

  1.     mov [qword 0],rax  ; absolute 64-bit addressing
  2.     mov [dword 0],r15d ; absolute 32-bit addressing
  3.     mov [0],rsi        ; automatic RIP-relative addressing
  4.     mov [rip+3],sil    ; manual RIP-relative addressing

Also as the immediate operands for 64-bit operations only the signed 32-bit values are possible, with the only exception being the mov instruction with destination operand being 64-bit general purpose register. Trying to force the 64-bit immediate with any other instruction will cause an error.

If any operation is performed on the 32-bit general registers in long mode, the upper 32 bits of the 64-bit registers containing them are filled with zeros. This is unlike the operations on 16-bit or 8-bit portions of those registers, which preserve the upper bits.

Three new type conversion instructions are available. The cdqe sign extends the double word in EAX into quad word and stores the result in RAX register. cqo sign extends the quad word in RAX into double quad word and stores the extra bits in the RDX register. These instructions have no operands. movsxd sign extends the double word source operand, being either the 32-bit register or memory, into 64-bit destination operand, which has to be register. No analogous instruction is needed for the zero extension, since it is done automatically by any operations on 32-bit registers, as noted in previous paragraph. And the movzx and movsx instructions, conforming to the general rule, can be used with 64-bit destination operand, allowing extension of byte or word values into quad words.

All the binary arithmetic and logical instruction are promoted to allow 64-bit operands in long mode. The use of decimal arithmetic instructions in long mode prohibited.

The stack operations, like push and pop in long mode default to 64-bit operands and it's not possible to use 32-bit operands with them. The pusha and popa are disallowed in long mode.

The indirect near jumps and calls in long mode default to 64-bit operands and it's not possible to use the 32-bit operands with them. On the other hand, the indirect far jumps and calls allow any operands that were allowed by the x86 architecture and also 80-bit memory operand is allowed (though only EM64T seems to implement such variant), with the first eight bytes defining the offset and two last bytes specifying the selector. The direct far jumps and calls are not allowed in long mode.

The I/O instructions, in, out, ins and outs are the exceptional instructions that are not extended to accept quad word operands in long mode. But all other string operations are, and there are new short forms movsq, cmpsq, scasq, lodsq and stosq introduced for the variants of string operations for 64-bit string elements. The RSI and RDI registers are used by default to address the string elements.

The lfs, lgs and lss instructions are extended to accept 80-bit source memory operand with 64-bit destination register (though only EM64T seems to implement such variant). The lds and les are disallowed in long mode.

The system instructions like lgdt which required the 48-bit memory operand, in long mode require the 80-bit memory operand.

The cmpxchg16b is the 64-bit equivalent of cmpxchg8b instruction, it uses the double quad word memory operand and 64-bit registers to perform the analoguous operation.

swapgs is the new instruction, which swaps the contents of GS register and the KernelGSbase model-specific register (MSR address 0C0000102h).

syscall and sysret is the pair of new instructions that provide the functionality similar to sysenter and sysexit in long mode, where the latter pair is disallowed. The sysexitq and sysretq mnemonics provide the 64-bit versions of sysexit and sysret instructions.

The rdmsrq and wrmsrq mnemonics are the 64-bit variants of the rdmsr and wrmsr instructions.

2.1.20 SSE4 instructions

There are actually three different sets of instructions under the name SSE4. Intel designed two of them, SSE4.1 and SSE4.2, with latter extending the former into the full Intel's SSE4 set. On the other hand, the implementation by AMD includes only a few instructions from this set, but also contains some additional instructions, that are called the SSE4a set.

The SSE4.1 instructions mostly follow the same rules for operands, as the basic SSE operations, so they require destination operand to be SSE register and source operand to be 128-bit memory location or SSE register, and some operations require a third operand, the 8-bit immediate value.

pmulld performs a signed multiplication of the packed double words and stores the low double words of the results in the destination operand. pmuldq performs a two signed multiplications of the corresponding double words in the lower quad words of operands, and stores the results as packed quad words into the destination register. pminsb and pmaxsb return the minimum or maximum values of packed signed bytes, pminuw and pmaxuw return the minimum and maximum values of packed unsigned words, pminud, pmaxud, pminsd and pmaxsd return minimum or maximum values of packed unsigned or signed words. These instruction complement the instructions computing packed minimum or maximum introduced by SSE.

ptest sets the ZF flag to one when the result of bitwise AND of the both operands is zero, and zeroes the ZF otherwise. It also sets CF flag to one, when the result of bitwise AND of the destination operand with the bitwise NOT of the source operand is zero, and zeroes the CF otherwise. pcmpeqq compares packed quad words for equality, and fills the corresponding elements of destination operand with either ones or zeros, depending on the result of comparison.

paskusdw converts packed signed double words from both the source and destination operand into the unsigned words using saturation, and stores the eight resulting word values into the destination register.

pmovsxbw and pmovzxbw perform sign extension or zero extension of the lowest eight byte values from the source operand into packed word values in destination operand. pmovsxbd and pmovzxbd perform sign extension or zero extension of the lowest four byte values from the source operand into packed double word values in destination operand. pmovsxbq and pmovzxbq perform sign extension or zero extension of the lowest two byte values from the source operand into packed quad word value in destination operand. pmovsxwd and pmovzxwd perform sign extension or zero extension of the lowest four word values from the source operand into packed double words in destination operand. pmovsxwq and pmovzxwq perform sign extension or zero extension of the lowest two word values from the source operand into packed quad words in destination operand. pmovsxdq and pmovzxdq perform signe extension or zero extension of the lowest two double word values from the source operand into packed quad words in destination operand.

phminposuw finds the minimum unsigned word value in source operand and places it into the lowest word of destination operand, setting the remaining upper bits of destination to zero.

roundps, roundss, roundpd and roundsd perform the rounding of packed or individual floating point value of single or double precision, using the rounding mode specified by the third operand.

  1.     roundsd xmm0,xmm1,0011b ; round toward zero

dpps calculates dot product of packed single precision floating point values, that is it multiplies the corresponding pairs of values from source and destination operand and then sums the products up. The high four bits of the 8-bit immediate third operand control which products are calculated and taken to the sum, and the low four bits control, into which elements of destination the resulting dot product is copied (the other elements are filled with zero). dppd calculates dot product of packed double precision floating point values. The bits 4 and 5 of third operand control, which products are calculated and added, and bits 0 and 1 of this value control, which elements in destination register should get filled with the result. mpsadbw calculates multiple sums of absolute differences of unsigned bytes. The third operand controls, with value in bits 0-1, which of the four-byte blocks in source operand is taken to calculate the absolute differencies, and with value in bit 2, at which of the two first four-byte block in destination operand start calculating multiple sums. The sum is calculated from four absolute differencies between the corresponding unsigned bytes in the source and destination block, and each next sum is calculated in the same way, but taking the four bytes from destination at the position one byte after the position of previous block. The four bytes from the source stay the same each time. This way eight sums of absolute differencies are calculated and stored as packed word values into the destination operand. The instructions described in this paragraph follow the same reules for operands, as roundps instruction.

blendps, blendvps, blendpd and blendvpd conditionally copy the values from source operand into the destination operand, depending on the bits of the mask provided by third operand. If a mask bit is set, the corresponding element of source is copied into the same place in destination, otherwise this position is destination is left unchanged. The rules for the first two operands are the same, as for general SSE instructions. blendps and blendpd need third operand to be 8-bit immediate, and they operate on single or double precision values, respectively. blendvps and blendvpd require third operand to be the XMM0 register.

  1.     blendvps xmm3,xmm7,xmm0 ; blend according to mask

pblendw conditionally copies word elements from the source operand into the destination, depending on the bits of mask provided by third operand, which needs to be 8-bit immediate value. pblendvb conditionally copies byte elements from the source operands into destination, depending on mask defined by the third operand, which has to be XMM0 register. These instructions follow the same rules for operands as blendps and blendvps instructions, respectively.

insertps inserts a single precision floating point value taken from the position in source operand specified by bits 6-7 of third operand into location in destination register selected by bits 4-5 of third operand. Additionally, the low four bits of third operand control, which elements in destination register will be set to zero. The first two operands follow the same rules as for the general SSE operation, the third operand should be 8-bit immediate.

extractps extracts a single precision floating point value taken from the location in source operand specified by low two bits of third operand, and stores it into the destination operand. The destination can be a 32-bit memory value or general purpose register, the source operand must be SSE register, and the third operand should be 8-bit immediate value.

  1.     extractps edx,xmm3,3 ; extract the highest value

pinsrb, pinsrd and pinsrq copy a byte, double word or quad word from the source operand into the location of destination operand determined by the third operand. The destination operand has to be SSE register, the source operand can be a memory location of appropriate size, or the 32-bit general purpose register (but 64-bit general purpose register for pinsrq, which is only available in long mode), and the third operand has to be 8-bit immediate value. These instructions complement the pinsrw instruction operating on SSE register destination, which was introduced by SSE2.

  1.     pinsrd xmm4,eax,1 ; insert double word into second position

pextrb, pextrw, pextrd and pextrq copy a byte, word, double word or quad word from the location in source operand specified by third operand, into the destination. The source operand should be SSE register, the third operand should be 8-bit immediate, and the destination operand can be memory location of appropriate size, or the 32-bit general purpose register (but 64-bit general purpose register for pextrq, which is only available in long mode). The pextrw instruction with SSE register as source was already introduced by SSE2, but SSE4 extends it to allow memory operand as destination.

  1.     pextrw [ebx],xmm3,7 ; extract highest word into memory

movntdqa loads double quad word from the source operand to the destination using a non-temporal hint. The destination operand should be SSE register, and the source operand should be 128-bit memory location.

The SSE4.2, described below, adds not only some new operations on SSE registers, but also introduces some completely new instructions operating on general purpose registers only.

pcmpistri compares two zero-ended (implicit length) strings provided in its source and destination operand and generates an index stored to ECX; pcmpistrm performs the same comparison and generates a mask stored to XMM0. pcmpestri compares two strings of explicit lengths, with length provided in EAX for the destination operand and in EDX for the source operand, and generates an index stored to ECX; pcmpestrm performs the same comparision and generates a mask stored to XMM0. The source and destination operand follow the same rules as for general SSE instructions, the third operand should be 8-bit immediate value determining the details of performed operation - refer to Intel documentation for information on those details.

pcmpgtq compares packed quad words, and fills the corresponding elements of destination operand with either ones or zeros, depending on whether the value in destination is greater than the one in source, or not. This instruction follows the same rules for operands as pcmpeqq.

crc32 accumulates a CRC32 value for the source operand starting with initial value provided by destination operand, and stores the result in destination. Unless in long mode, the destination operand should be a 32-bit general purpose register, and the source operand can be a byte, word, or double word register or memory location. In long mode the destination operand can also be a 64-bit general purpose register, and the source operand in such case can be a byte or quad word register or memory location.

  1.     crc32 eax,dl          ; accumulate CRC32 on byte value
  2.     crc32 eax,word [ebx]  ; accumulate CRC32 on word value
  3.     crc32 rax,qword [rbx] ; accumulate CRC32 on quad word value

popcnt calculates the number of bits set in the source operand, which can be 16-bit, 32-bit, or 64-bit general purpose register or memory location, and stores this count in the destination operand, which has to be register of the same size as source operand. The 64-bit variant is available only in long mode.

  1.     popcnt ecx,eax ; count bits set to 1

The SSE4a extension, which also includes the popcnt instruction introduced by SSE4.2, at the same time adds the lzcnt instruction, which follows the same syntax, and calculates the count of leading zero bits in source operand (if the source operand is all zero bits, the total number of bits in source operand is stored in destination).

extrq extract the sequence of bits from the low quad word of SSE register provided as first operand and stores them at the low end of this register, filling the remaining bits in the low quad word with zeros. The position of bit string and its length can either be provided with two 8-bit immediate values as second and third operand, or by SSE register as second operand (and there is no third operand in such case), which should contain position value in bits 8-13 and length of bit string in bits 0-5.

  1.     extrq xmm0,8,7  ; extract 8 bits from position 7
  2.     extrq xmm0,xmm5 ; extract bits defined by register

insertq writes the sequence of bits from the low quad word of the source operand into specified position in low quad word of the destination operand, leaving the other bits in low quad word of destination intact. The position where bits should be written and the length of bit string can either be provided with two 8-bit immediate values as third and fourth operand, or by the bit fields in source operand (and there are only two operands in such case), which should contain position value in bits 72-77 and length of bit string in bits 64-69.

  1.     insertq xmm1,xmm0,4,2 ; insert 4 bits at position 2
  2.     insertq xmm1,xmm0     ; insert bits defined by register

movntss and movntsd store single or double precision floating point value from the source SSE register into 32-bit or 64-bit destination memory location respectively, using non-temporal hint.

2.1.21 AVX instructions

This section has not been written yet.

2.1.22 Other extensions of instruction set

There is a number of additional instruction set extensions recognized by flat assembler, and the general syntax of the instructions introduced by those extensions is provided here. For a detailed information on the operations performed by them, check out the manuals from Intel (for the VMX and SVM extensions) or AMD (for the SVM extension).

The Virtual-Machine Extensions (VMX) provide a set of instructions for the management of virtual machines. The vmxon instruction, which enters the VMX operation, requires a single 64-bit memory operand, which should be a physical address of memory region, which the logical processor may use to support VMX operation. The vmxoff instruction, which leaves the VMX operation, has no operands. The vmlaunch and vmresume, which launch or resume the virtual machines, and vmcall, which allows guest software to call the VM monitor, use no operands either.

The vmptrld loads the physical address of current Virtual Machine Control Structure (VMCS) from its memory operand, vmptrst stores the pointer to current VMCS into address specified by its memory operand, and vmclear sets the launch state of the VMCS referenced by its memory operand to clear. These three instruction all require single 64-bit memory operand.

The vmread reads from VCMS a field specified by the source operand and stores it into the destination operand. The source operand should be a general purpose register, and the destination operand can be a register of memory. The vmwrite writes into a VMCS field specified by the destination operand the value provided by source operand. The source operand can be a general purpose register or memory, and the destination operand must be a register. The size of operands for those instructions should be 64-bit when in long mode, and 32-bit otherwise.

The invept and invvpid invalidate the translation lookaside buffers (TLBs) and paging-structure caches, either derived from extended page tables (EPT), or based on the virtual processor identifier (VPID). These instructions require two operands, the first one being the general purpose register specifying the type of invalidation, and the second one being a 128-bit memory operand providing the invalidation descriptor. The first operand should be a 64-bit register when in long mode, and 32-bit register otherwise.

The Safer Mode Extensions (SMX) provide the functionalities available throught the getsec instruction. This instruction takes no operands, and the function that is executed is determined by the contents of EAX register upon executing this instruction.

The Secure Virtual Machine (SVM) is a variant of virtual machine extension used by AMD. The skinit instruction securely reinitializes the processor allowing the startup of trusted software, such as the virtual machine monitor (VMM). This instruction takes a single operand, which must be EAX, and provides a physical address of the secure loader block (SLB).

The vmrun instruction is used to start a guest virtual machine, its only operand should be an accumulator register (AX, EAX or RAX, the last one available only in long mode) providing the physical address of the virtual machine control block (VMCB). The vmsave stores a subset of processor state into VMCB specified by its operand, and vmload loads the same subset of processor state from a specified VMCB. The same operand rules as for the vmrun apply to those two instructions.

vmmcall allows the guest software to call the VMM. This instruction takes no operands.

stgi set the global interrupt flag to 1, and clgi zeroes it. These instructions take no operands.

invlpga invalidates the TLB mapping for a virtual page specified by the first operand (which has to be accumulator register) and address space identifier specified by the second operand (which must be ECX register).