Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.0k views
in Technique[技术] by (71.8m points)

assembly - why we can't move a 64-bit immediate value to memory?

First I am a little bit confused with the differences between movq and movabsq, my text book says:

The regular movq instruction can only have immediate source operands that can be represented as 32-bit two’s-complement numbers. This value is then sign extended to produce the 64-bit value for the destination. The movabsq instruction can have an arbitrary 64-bit immediate value as its source operand and can only have a register as a destination.

I have two questions to this.

Question 1

The movq instruction can only have immediate source operands that can be represented as 32-bit two’s-complement numbers.

so it means that we can't do

movq    $0x123456789abcdef, %rbp

and we have to do:

movabsq $0x123456789abcdef, %rbp

but why movq is designed to not work for 64 bits immediate value, which is really against the purpose of q (quard word), and we need to have another movabsq just for this purpose, isn't that hassle?

Question 2

Since the destination of movabsq has to be a register, not memory, so we can't move a 64-bit immediate value to memory as:

movabsq $0x123456789abcdef, (%rax)

but there is a workaround:

movabsq $0x123456789abcdef, %rbx
movq    %rbx, (%rax)   // the source operand is a register, not immediate constant, and the destination of movq can be memory

so why the rule is designed to make things harder?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Yes, mov to a register then to memory for immediates that won't fit in a sign-extended 32-bit, unlike -1 aka 0xFFFFFFFFFFFFFFFF. The why part is interesting question, though:


Remember that asm only lets you do what's possible in machine code. Thus it's really a question about ISA design. Such decisions often involve what's easy for the hardware to decode, as well as encoding efficiency considerations. (Using up opcodes on rarely-used instructions would be bad.)

It's not designed to make things harder, it's designed to not need any new opcodes for mov. And also to limit 64-bit immediates to one special instruction format. mov is the only instruction that can ever use a 64-bit immediate at all (or a 64-bit absolute address, for load/store of AL/AX/EAX/RAX).

Check out Intel's manual for the forms of mov (note that it uses Intel syntax, destination first, and so will my answer.) I also summarized the forms (and their instruction lengths) in Difference between movq and movabsq in x86-64, as did @MargaretBloom in answer to What's the difference between the x86-64 AT&T instructions movq and movabsq?.

Allowing an imm64 along with a ModR/M addressing mode would also make it possible to run into the 15-byte upper limit on instruction length pretty easily, e.g. REX + opcode + imm64 is 10 bytes, and ModRM+SIB+disp32 is 6. So mov [rdi + rax*8 + 1234], imm64 would not be encodeable even if there was an opcode for mov r/m64, imm64.

And that's assuming they repurposed one of the 1-byte opcodes that were freed up by making some instructions invalid in 64-bit mode (e.g. aaa), which might be inconvenient for the decoders (and instruction-length pre-decoders) because in other modes those opcodes don't take a ModRM byte or an immediate.


movq is for the forms of mov with a normal ModRM byte to allow an arbitrary addressing mode as the destination. (Or as the source for movq r64, r/m64). AMD chose to keep the immediate for these as 32-bit, same as with 32-bit operand size1.

These forms of mov are the same instruction format as other instructions like add. For ease of decoding, this means a REX prefix doesn't change the instruction-length for these opcodes. Instruction-length decoding is already hard enough when the addressing mode is variable-length.

So movq is 64-bit operand-size but otherwise the same instruction format mov r/m64, imm32 (becoming the sign-extended-immediate form, same as every other instruction which only has one immediate form), and mov r/m64, r64 or mov r64, r/m64.

movabs is the 64-bit form of the existing no-ModRM short form mov reg, imm32. This one is already a special case (because of the no-modrm encoding, with register number from the low 3 bits of the opcode byte). Small positive constants can just use 32-bit operand-size for implicit zero-extension to 64-bit with no loss of efficiency (like 5-byte mov eax, 123 / AT&T mov $123, %eax in 32 or 64-bit mode). And having a 64-bit absolute mov is useful so it makes sense AMD did that.

Since there's no ModRM byte, it can only encode a register destination. It would take a whole different opcode to add a form that could take a memory operand.


From one POV, be grateful you get a mov with 64-bit immediates at all; RISC ISAs like AArch64 (with fixed-width 32-bit instructions) need more like 4 instructions just to get a 64-bit value into a register. (Unless it's a repeating bit-pattern; AArch64 is actually pretty cool. Unlike earlier RISCs like MIPS64 or PowerPC64)

If AMD64 was going to introduce a new opcode for mov, mov r/m, sign_extended_imm8 would be vastly more useful to save code-size. It's not at all rare for compilers to emit multiple mov qword ptr [rsp+8], 0 instructions to zero a local array or struct, each one containing a 4-byte 0 immediate. Putting a non-zero small number in a register is fairly common, and would make mov eax, 123 a 3-byte instruction (down from 5), and mov rax, -123 a 4-byte instruction (down from 7). It would also make zeroing a register without clobbering FLAGS 3 bytes.

Allowing mov imm64 to memory would be useful rarely enough that AMD decided it wasn't worth making the decoders more complex. In this case I agree with them, but AMD was very conservative with adding new opcodes. So many missed opportunities to clean up x86 warts, like widening setcc would have been nice. But I think AMD wasn't sure AMD64 would catch on, and didn't want to be stuck needing a lot of extra transistors / power to support a feature if people didn't use it.

Footnote 1:
32-bit immediates in general is pretty obviously a good decision for code-size. It's very rare to want to add an immediate to something that's outside the +-2GiB range. It could be useful for bitwise stuff like AND, but for setting/clearing/flipping a single bit the bts / btr / btc instructions are good (taking a bit-position as an 8-bit immediate, instead of needing a mask). You don't want sub rsp, 1024 to be an 11-byte instruction; 7 is already bad enough.


Giant instructions? Not very efficient

At the time AMD64 was designed (early 2000s), CPUs with uop caches weren't a thing. (Intel P4 with a trace cache did exist, but in hindsight it was regarded as a mistake.) Instruction fetch/decode happens in chunks of up-to-16 bytes, so having one instruction that's nearly 16 bytes isn't much better for the front-end than movabs $imm64, %reg.

Of course if the back-end isn't keeping up with the front-end, that bubble of only 1 instruction decoded this cycle can be hidden by buffering between stages.

Keeping track of that much data for one instruction would also be a problem. The CPU has to put that data somewhere, and if there's a 64-bit immediate and a 32-bit displacement in the addressing mode, that's a lot of bits. Normally an instruction needs at most 64-bits of space for an imm32 + a disp32.


BTW, there are special no-modrm opcodes for most operations with RAX and an immediate. (x86-64 evolved out of 8086, where AX/AL was more special, see this for more history and explanation). It would have been a plausible design for those add/sub/cmp/and/or/xor/... rax, sign_extended_imm32 forms with no ModRM to instead use a full imm64. The most common case for RAX, immediate uses an 8-bit sign-extended immediate (-128..127), not this form anyway, and it only saves 1 byte for instructions that need a 4-byte immediate. If you do need an 8-byte constant, though, putting it in a register or memory for reuse would be better than doing a 10-byte and-imm64 in a loop, though.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
...