X264 asm intro
Open common/x86/predict-a.asm, go to predict_4x4_dc_mmxext, git link
This function does the following:
A B C D E X X X X F X X X X G X X X X H X X X X
It calculates (A+B+C+D+E+F+G+H+4)/8, and sets all the Xs equal to that value where those are 8-bit pixels in a 2D array with a stride of FDEC_STRIDE.
What is FDEC_STRIDE?
- <A> x264 does all its pixel operations on the current macroblock in a temporary buffer of constant stride. It's faster that way, and better on cache. So for example, motion compensation will write to this buffer (or intra prediction).
What's a stride?
- <A> Stride is the distance between (x,y) and (x,y+1), so to get from one row to the next.
Now that you understand what the function does, let's look at the asm.
cglobal predict_4x4_dc_mmxext, 1,4
cglobal: declares a function accessible from outside of asm
The function's name is x264_predict_4x4_dc_mmxext (the x264_ is auto-added).
The "1" means "we have one argument. Put it in r0.", that argument is uint8_t *src
If we had a second argument, we'd say 2 and the second one would go in r1 and if we had a third, it'd go in r2, etc. So at the start of the function, r0 contains uint8_t *src.
"that argument is uint8_t *src", what does this mean?
- <A> See the comment above: void predict_4x4_dc( uint8_t *src )
what tells the function that it's uint8_t?
- <A> Nothing. It doesn't need to know. Types are a C-ism.
The "4" means we want x264 to give us 4 registers to use. r0, r1, r2, r3. This, of course, includes the r0 used for the parameter. So in short, after the first line:
r0 = src
r1/r2/r3 = free
r4 and up: can't use.
That's x86inc.asm's doing right?
- <A> Yes, but we aren't going into that.
I assume it means we can use them, but if you do, it'll screw around with something you don't want to?
- <A> Yes, which is why you can't use it.
So now, this function as you can see has 4 real steps:
- Sum up A through D
- Sum up E through H
- Do the math to get our final value
- Store it into the 16 output Xs
So let's see how this asm implements these.
First, we'll look at step 1
pxor mm7, mm7
mm7 is a 64-bit register.
xor, as you might know, is a nice way to zero things.
How do you tell how large a register is?
- <A> mm* is 64-bit, xmm* is 128-bit. The mm registers have a fixed size, only the general purpose registers are wordsize-dependant on x86
So, now mm7 is zero.
movd mm0, [r0-FDEC_STRIDE]
This sets mm0 equal to {A,B,C,D,0,0,0,0}
Oh, and how do we know the mm* registers are free?
- <A> They always are.
In x86, b = byte, w = word (16-bit), d = doubleword (32-bit), q = quadword (64-bit), dq = double quadword (128-bit)
So movd = move doubleword = move 32 bits
So movd to mm0 will load data to the first 4 bytes and zero the rest. Thus mm0 is now ABCD0000
[r0-FDEC_STRIDE] is equivalent to *(src-FDEC_STRIDE) in Cstyle. Hence why it points to ABCD.
Are the "A B C D" on top of the "X X X X" or do they start on top of the "E"?
- <A> Former
psadbw mm0, mm7
This is what psadbw does:
uint16_t psadbw( uint8_t in[8], uint8_t out[8] ) { uint16_t sum = 0; for(int i = 0; i < 8; i++) sum += abs(in[i]-out[i]); return sum; }
Parse that for a moment
where is the sum stored?
- <A> psadbw X, Y, X is where the output is stored. So X is overwritten. So it's stored in the low 16 bits of X.
Now, of course, mm7 is zero! So we get abs(A-0) + abs(B-0) + abs(C-0) + abs(D-0) + abs(0-0) ... Or A+B+C+D.
So after psadbw, mm0 is A+B+C+D and mm7 is still zero.
movd r3d, mm0
Now, we move the result to r3d, a general purpose register
is r3d one of the things that come with the 4 registers that are free?
- <A> This is an optimization: on 64-bit, using 32-bit versions of registers results in smaller instruction opcode sizes. So it's really just r3. r0, r1, r2, r3 are the 4 that are free. So we're using r3. Note: the suffix 'd' means the 32-bit version, as opposed to the native-size version.
Get moving with part 2 of the algorithm.
Now r0 has our source pointer, and r3 has A+B+C+D. While the CPU is busy doing that, we'll go and do part 2, the E+F+G+H.
Unfortunately, these bytes aren't in a straight line ("straight line" meaning "adjacent in memory"). So we can't just load EFGH and sad them. We'll have to do it the naive/slow way. So, now we're going to load E, F, G, H. Now you might notice some preprocessor commands here. %assign, %rep, etc are preprocessor commands.
So, first step: load E into r1d
movzx r1d, byte [r0-1]
movzx means "move, with zero extend". In C this would be: int r1d = r0[-1];
My C is a bit rusty, what does that do? does it just take the location in memory before r0[0]?
- <A> Yes, [] is just a dereference of a pointer. *(r0-1) = r0[-1] = (r0-1)[0]
what is r0-1 in that ascii matrix?
- <A> E.
So, here's what these 7 lines look like after the macro runs
movzx r1d, byte [r0-1] movzx r2d, byte [r0+FDEC_STRIDE*1-1] add r1d, r2d movzx r2d, byte [r0+FDEC_STRIDE*2-1] add r1d, r2d movzx r2d, byte [r0+FDEC_STRIDE*3-1] add r1d, r2d
in order: load E, load F, add F to E, load G, add G to E, load H, add H to E
Where is n stored?
- <A> It isn't. It's a preprocessor variable.
Oh, so it's like a macro?
- <A> It is a macro. Note the pre-processed code above. Everything starting with % in yasm syntax is a macro
Ok, now we have to do step 3: calculating A+B+C+D+E+F+G+H+4 / 8
lea r1d, [r1+r3+4]
First, let's go over x86 addressing. What you can put inside the brackets is not infinite. Here's the capabilities, specifically: [REG1 + REG2 * {1,2,4,8} + CONST]
A register, plus another register * 1/2/4/8, plus a constant (positive or negative). As you might note, this is pretty useful for accessing things like arrays. E.g. array[n+5], where array is an int array, would be: [array + n*4 + 20]
I suppose the [r0+FDEC_STRIDE*n-1] bit gets simplified on assembly to [register + const]?
- <A> Yes, yasm sums up constants for you.
So, as you might note, that's a pretty powerful addressing system. That's more powerful than, say... "add". So why not expose it in an instruction to let us use it for math? So Intel did.
lea X, [expr] sets X equal to the value of expr just as fast as add. So that lea does r1d = r1 + r3 + 4
Wait, how does that work?
- <A> lea runs the [REG1 + REG2 * {1,2,4,8} + CONST] math on its second argument and adds to the first. lea doesn't actually address it. It just calculates the result and stores it instead of going to memory.
And it's faster than add?
- <A> It's just as fast except that you can do more with it.
Now, technically, you can do more adds per cycle than lea, so you shouldn't go replacing all your adds with lea. But if you can use it to do more than one thing at a time, it's a big win. So this lets us add r3, and add 4, in one op.
shr r1d, 3
There's one that you can probably figure out yourself - shift right.
Why are we shifting right?
- <A> +4 for correct rounding, >> 3 to divide (>>3 = /(2^3) = /8).
Now for the final part: storing the results.
imul r1d, 0x01010101
This is called a "splat" and you may have seen it in C as well. We're turning an 8-bit value into 4x that value, e.g. A -> AAAA
how does this work?
- <A> A * 0x01010101 = A A A A
So now we have a 32-bit register, r1d with one copy of A in each 8-bit nibble of that register. Now we go ahead and store this 4 times and we're done.
Finally, we RET: x264 will automatically clean up after us.