X264 asm intro
One
Open common/x86/predict-a.asm, go to predict_4x4_dc_mmxext, git link
This function does the following:
A B C D E X X X X F X X X X G X X X X H X X X X
It calculates (A+B+C+D+E+F+G+H+4)/8, and sets all the Xs equal to that value where those are 8-bit pixels in a 2D array with a stride of FDEC_STRIDE.
What is FDEC_STRIDE?
- <A> x264 does all its pixel operations on the current macroblock in a temporary buffer of constant stride. It's faster that way, and better on cache. So for example, motion compensation will write to this buffer (or intra prediction).
What's a stride?
- <A> Stride is the distance between (x,y) and (x,y+1), so to get from one row to the next.
Now that you understand what the function does, let's look at the asm.
cglobal predict_4x4_dc_mmxext, 1,4
cglobal: declares a function accessible from outside of asm
The function's name is x264_predict_4x4_dc_mmxext (the x264_ is auto-added).
The "1" means "we have one argument. Put it in r0.", that argument is uint8_t *src
If we had a second argument, we'd say 2 and the second one would go in r1 and if we had a third, it'd go in r2, etc. So at the start of the function, r0 contains uint8_t *src.
"that argument is uint8_t *src", what does this mean?
- <A> See the comment above: void predict_4x4_dc( uint8_t *src )
what tells the function that it's uint8_t?
- <A> Nothing. It doesn't need to know. Types are a C-ism.
The "4" means we want x264 to give us 4 registers to use. r0, r1, r2, r3. This, of course, includes the r0 used for the parameter. So in short, after the first line:
r0 = src
r1/r2/r3 = free
r4 and up: can't use.
That's x86inc.asm's doing right?
- <A> Yes, but we aren't going into that.
I assume it means we can use them, but if you do, it'll screw around with something you don't want to?
- <A> Yes, which is why you can't use it.
So now, this function as you can see has 4 real steps:
- Sum up A through D
- Sum up E through H
- Do the math to get our final value
- Store it into the 16 output Xs
So let's see how this asm implements these.
First, we'll look at step 1
pxor mm7, mm7
mm7 is a 64-bit register.
xor, as you might know, is a nice way to zero things.
How do you tell how large a register is?
- <A> mm* is 64-bit, xmm* is 128-bit. The mm registers have a fixed size, only the general purpose registers are wordsize-dependant on x86
So, now mm7 is zero.
movd mm0, [r0-FDEC_STRIDE]
This sets mm0 equal to {A,B,C,D,0,0,0,0}
Oh, and how do we know the mm* registers are free?
- <A> They always are.
In x86, b = byte, w = word (16-bit), d = doubleword (32-bit), q = quadword (64-bit), dq = double quadword (128-bit)
So movd = move doubleword = move 32 bits
So movd to mm0 will load data to the first 4 bytes and zero the rest. Thus mm0 is now ABCD0000
[r0-FDEC_STRIDE] is equivalent to *(src-FDEC_STRIDE) in Cstyle. Hence why it points to ABCD.
Are the "A B C D" on top of the "X X X X" or do they start on top of the "E"?
- <A> Former
psadbw mm0, mm7
This is what psadbw does:
uint16_t psadbw( uint8_t in[8], uint8_t out[8] ) { uint16_t sum = 0; for(int i = 0; i < 8; i++) sum += abs(in[i]-out[i]); return sum; }
Parse that for a moment
where is the sum stored?
- <A> psadbw X, Y, X is where the output is stored. So X is overwritten. So it's stored in the low 16 bits of X.
Now, of course, mm7 is zero! So we get abs(A-0) + abs(B-0) + abs(C-0) + abs(D-0) + abs(0-0) ... Or A+B+C+D.
So after psadbw, mm0 is A+B+C+D and mm7 is still zero.
movd r3d, mm0
Now, we move the result to r3d, a general purpose register
is r3d one of the things that come with the 4 registers that are free?
- <A> This is an optimization: on 64-bit, using 32-bit versions of registers results in smaller instruction opcode sizes. So it's really just r3. r0, r1, r2, r3 are the 4 that are free. So we're using r3. Note: the suffix 'd' means the 32-bit version, as opposed to the native-size version.
Get moving with part 2 of the algorithm.
Now r0 has our source pointer, and r3 has A+B+C+D. While the CPU is busy doing that, we'll go and do part 2, the E+F+G+H.
Unfortunately, these bytes aren't in a straight line ("straight line" meaning "adjacent in memory"). So we can't just load EFGH and sad them. We'll have to do it the naive/slow way. So, now we're going to load E, F, G, H. Now you might notice some preprocessor commands here. %assign, %rep, etc are preprocessor commands.
So, first step: load E into r1d
movzx r1d, byte [r0-1]
movzx means "move, with zero extend". In C this would be: int r1d = r0[-1];
My C is a bit rusty, what does that do? does it just take the location in memory before r0[0]?
- <A> Yes, [] is just a dereference of a pointer. *(r0-1) = r0[-1] = (r0-1)[0]
what is r0-1 in that ascii matrix?
- <A> E.
So, here's what these 7 lines look like after the macro runs
movzx r1d, byte [r0-1] movzx r2d, byte [r0+FDEC_STRIDE*1-1] add r1d, r2d movzx r2d, byte [r0+FDEC_STRIDE*2-1] add r1d, r2d movzx r2d, byte [r0+FDEC_STRIDE*3-1] add r1d, r2d
in order: load E, load F, add F to E, load G, add G to E, load H, add H to E
Where is n stored?
- <A> It isn't. It's a preprocessor variable.
Oh, so it's like a macro?
- <A> It is a macro. Note the pre-processed code above. Everything starting with % in yasm syntax is a macro
Ok, now we have to do step 3: calculating A+B+C+D+E+F+G+H+4 / 8
lea r1d, [r1+r3+4]
First, let's go over x86 addressing. What you can put inside the brackets is not infinite. Here's the capabilities, specifically: [REG1 + REG2 * {1,2,4,8} + CONST]
A register, plus another register * 1/2/4/8, plus a constant (positive or negative). As you might note, this is pretty useful for accessing things like arrays. E.g. array[n+5], where array is an int array, would be: [array + n*4 + 20]
I suppose the [r0+FDEC_STRIDE*n-1] bit gets simplified on assembly to [register + const]?
- <A> Yes, yasm sums up constants for you.
So, as you might note, that's a pretty powerful addressing system. That's more powerful than, say... "add". So why not expose it in an instruction to let us use it for math? So Intel did.
lea X, [expr] sets X equal to the value of expr just as fast as add. So that lea does r1d = r1 + r3 + 4
Wait, how does that work?
- <A> lea runs the [REG1 + REG2 * {1,2,4,8} + CONST] math on its second argument and adds to the first. lea doesn't actually address it. It just calculates the result and stores it instead of going to memory.
And it's faster than add?
- <A> It's just as fast except that you can do more with it.
Now, technically, you can do more adds per cycle than lea, so you shouldn't go replacing all your adds with lea. But if you can use it to do more than one thing at a time, it's a big win. So this lets us add r3, and add 4, in one op.
shr r1d, 3
There's one that you can probably figure out yourself - shift right.
Why are we shifting right?
- <A> +4 for correct rounding, >> 3 to divide (>>3 = /(2^3) = /8).
Now for the final part: storing the results.
imul r1d, 0x01010101
This is called a "splat" and you may have seen it in C as well. We're turning an 8-bit value into 4x that value, e.g. A -> AAAA
how does this work?
- <A> A * 0x01010101 = A A A A
So now we have a 32-bit register, r1d with one copy of A in each 8-bit nibble of that register. Now we go ahead and store this 4 times and we're done.
Finally, we RET: x264 will automatically clean up after us.
Two
Ok, next. You may have noticed that psadbw is awesome. It does like 8 things in one. Whereas abs() is typically 4 instructions on x86. psadbw does 8 subtracts, 8 absolute values on those results and then adds them up. That's 8 + 32 + 7 = 47 instructions in one (at least, 47 equivalent).
why is abs() so slow?
- <A> abs() isn't slow, there's just no instructin for it. The typical algorithm is:
int sign = x >> 31; (x ^ sign) - sign;
- This needs a mov on x86, so that's 4 instructions.
So psadbw is pretty awesome. It's very awesome for doing what its name implies you should do with it, that is -- SADs -- sum of absolute differences. So let's open sad-a.asm and hop down to line 95 (git link). Also open common/pixel.c and look at the first function: SAD (git link). This function is pretty simple. You should be able to see how it works. Look only at the C for now.
So as you'll notice, the C SAD has 7 different versions for 16x16, 16x8, 8x16, etc and it's instantiated via a macro. So, for our asm, we also need 7 versions and we also don't want to write the function 7 times, just like in the case of C we didn't.
So in the asm, we define a macro: %macro SAD 2 that means this macro has two paremeters. They are accessed as %1 and %2. We call the macro 7 times, one for each size. The function takes 4 args (as you'd expect) and needs 4 regs (just the args)
Is SAD_INC_2x%1P another macro?
- <A> Yes, it's one of three macros, look above, each one does 2 rows worth of SAD for width 4, width 8, and width 16. It picks the right one based on the width and it %reps it based on the height.
Now, start analyzing the 3 macros above (the sad macros) and trying to figure out how they work. Note mm0 is the accumulator which is why it's zeroed at the start.
The order of args is the same as in the C function?
- <A> Yes
What does punpckldq do?
- <A> Good question! punpck is a set of instructions that interleave their arguments in some fashion.
To start with, it can be l or h, low or high. So punpckl__ ABCD, EFGH will use AB and EF. And punpbkh__ ABCD, EFGH will use CD and GH.
The next two letters are the source size, and destination size. For example, punpcklbw interleaves bytes, to create words. So punpcklbw ABCD, EFGH gives you AEBF (if the letters are bytes). So punpckldq ABCDEFGH, IJKLMNOP gives us ABCDIJKL. So in other words, it stuffs the two sets of 4 bytes we just loaded into one register
So we can do only one SAD, instead of two. punpckldq ABCD0000, EFGH0000 results in ABCDEFGH. So it effectively concatenates mm1 and mm2 for us. If we didn't do this, we'd have to do twice as many sads and adds. We do this because the registers are width 8, but our sad is width 4. So we need to stuff sad information side by side to fill the whole reg.
Why are we punpckldq'ing the [r0+r1] and not [r0]?
- <A> We're concatenating row 0 and row 1 of each input.
lea r0, [r0+2*r1] Why are we doing this step? Doesn't it move r0 over 2*r1?
- <A> We're incrementing the pointer by 2*stride
Now you should understand what SAD_INC_2x4P does, the others work similarly except without the punpck magic because they don't need it.
Why is the lea out of order in SAD_INC_2x8P? By out of order i mean not next to each other.
- <A> No particular reason.
So we rep the SAD for however many times so the 2x%1 is completed?
- <A> Yes, so if it's height 8, we rep it 4 times
Why are strides not hardcoded btw?
- <A> SAD can be called on a reference frame thus variable stride
I don't really get it...
- A<> It's called on frames, as opposed to some temporary block of memory.
Now, for the kicker: the 16x16 SAD function declared here is 15 times faster than C.
What? Why is it so much faster?
- <A> psadbw
Now let's get a bit to how we measure performance. For any asm instruction, there are three things that matter: latency, inverse throughput, and execution units. The first two are represented with notation like this: "3/1". This means a psadbw takes 3 clocks to finish from when it's started and you can do one of them per cycle. Another example is mov. Between two registers it is 1/0.33, takes 1 cycle, and you can do 3 per clock.
Execution unit usage is a bit trickier. Not all execution units can do all instructions. Intel chips have 6 execution units: p0, p1, p2, p3, p4, p5
Wait, what is latency?
- <A> time from start to finish, in clocks
and inverse throughput and execution units?
- <A> Inverse throughput is how many you can do per clock. Execution units are the things in the chip that do stuff.
Of these 6 execution units, three can do math: p0, p1, p5. psadbw, for example, can only use one of these (p1), pxor can use all three.
Generally execution units aren't important until you get into serious optimizing but they can often affect the best instruction choices. For example, if an execution unit is sitting around doing nothing for a whole function.
The instruction tables sheet here http://agner.org/optimize/ has all the information on latency, execution units, and inverse throughput for a wide variety of CPUs.
How about branching? I heard branching ruins things.
- <A> Not generally unless it's unpredictable, branch mispredictions are the cause. We can get to a case of that later.
Now, let's just analyze SAD. Suppose we want to analyze the 8x8 SAD. In this function we do: 8 SADs, 8 adds (accumulates), 16 loads, plus the start, end, and calling overhead. 8 SADs: takes 8 cycles (inverse throughput of 1). 8 adds: takes 8 cycles (inverse throughput of 1), and can run at the same time as SADs. 16 loads: takes 16 cycles, and can run at the same time as the above.
So the loads are the bottleneck. This is an important thing to understand: it's possible for one type of operation to bottleneck a function. Loads are a common example. In this case, SAD is *so fast* that it is effectively free, as we're sitting around waiting for loads the whole time. The actual runtime of the function is about 22 clocks. Which is fitting for 16 + start + end + overhead.
So that's some basic performance analysis for you. How long the function should take in theory, how long each instruction takes in theory, and how you can be bottlenecked.
Is there anything that does this automatically for you?
- <A> Analysis? not really. There are intel performance counters and such on the chip but they're not magic. It might be useful to have some kind of tool to analyze asm functions. In general though, intuition is a powerful tool.