Difference between revisions of "SoC 2008/x264/Speed Optimizations"

Revision as of 23:03, 23 June 2008

This project is part of Google Summer of Code 2008.

Student:	Holger Lubitz
Mentor:	Alex Izvorski

Project Abstract

My goal is to search for speed optimizations in x264 (and, time allowing or if anything seems an obvious port, possibly a bit in vlc and ffmpeg as well). I intend to thouroughly profile the thing and find out where we might do better.

Implementation

Will be mainly on a day to day basis - find something, see if something can be done about it. My main development system so far was an Athlon X2 3800+, so I mainly did 64bit and MMX, some SSE code so far, but it recently lost its case to a new and shiny Q9300 (will be back in service once that case arrives - seems it may take a while). I intend to be learning a lot of SSSE3 and SSE4.1 and will hopefully find some uses for it.

Current Status

Battling with Hardware. Fedora 9 is said to have the latest and greatest of all distributions in regard to supporting the Intel Onboard-Graphics (it's just a little bit behind on the version number and i think most of the newer stuff is already in). However, my G35 mobo does not really like me yet. Some things still to be found out. Expect me to rant a bit on the channel if it was just another really bad day.

Other than that, I hope to get into vtune next week, maybe even try to get the free IDA going somehow. And then we'll see where that gets me.

Week 1

Downloaded free IDA 4.9 and demo version of 5.2. The free version works surprisingly well in current wine. Just some font problems (fixed by selecting another one) and window handling is a bit awkward sometimes. Successfully disassembled some linux programs. Unfortunately the free version doesn't do 64 bit code and newer SSE, and the demo is severely limited in other aspects (no saves, time limit per session).

But then again, so far good old objconv -fasm worked well enough, too. So IDA should be nice to have if needed.

Registered with Intel for non-commercial downloads and got Intel Compiler Collection, Performance Primitives and of course vtune, but vtune officially only supports Fedora up to 7 (and with stock kernel) currently. Will have to fiddle a bit with that in the next days (given the X problems I had even with the current driver, downgrading to 7 is probably not even an option).

Found out about liboil and checked their zigzag implementation. They have an altivec zigzag, but it doesn't look like their idea can be borrowed - ends with a lot of fixup in c. But I think checking there for other ideas cannot hurt. Will keep a copy.

Hardware troubles turned out to be partly hardware, partly software related. Both 4 GB kits have been faulty (either A-Data Vitesta wasn't the best brand to buy or it was just my sheer luck again that both gave errors in memtest86). Returned them, hope the vendor is going to refund the money soon. Running on the 4x1 GB from my previous system for the time being. X currently only works with uncached video memory. Shouldn't affect the encoding measurements, however I noted that h264 playback becomes very jerky with uncached memory. Fedora already has an open bug for the issue - let's hope it can be resolved soon.

In related news: Finished term project paper at least so far that I could present it to my professor. Had minor comments, going to hand it in by Jun 6.

Week 2

Was pretty tied up with finishing my term paper, so no real work on GSoC happened. Things i still managed to do:

Looked into SSEPlus.

Fiddled around with vtune a bit more. Sampling driver doesn't even compile with current kernels and doesn't look like it's easily fixed. The Support Forum seems to indicate people are even having problems after downgrading to the supported kernel on unsupported (newer) Fedora Releases like F8. Response was "not in the list of supported OS, wait for next release". I don't want to waste time fixing bugs in their proprietary software. This will probably mean waiting for the next vtune or preparing a test partition with the supposedly supported F7. For now, oprofile it is.

Had a couple test runs of pengvados SSE4-mubench on my system. Short run got a couple errors, but at least results, long run bombs. My perl is next to non-existent, so I'll have to hope for pengvado or AI4 to fix this.

Got git running the way I want (follow master from x264.git instead of the not updating master in the private repo). Turns out the easy way to do this seems to be:

git remote add x264 git://git.videolan.org/x264.git

and than change "remote = origin" to "remote = x264" for the master branch in .git/config. Did a test push after pulling, which successfully updated the master branch in my private repo, so i think it works the way I want.

Pushed the rest to

Week 3

Some initial profiling next. Dark_Shikari suggested using his Black Pearl clip for that. Open to suggestions for suitable HD material. Going to use Elephants Dream if nothing better comes up. Will also try to find out if there are notable differences with interlaced material (will use DVB-S captures for that, probably a bit of Torchwood).

Ran the profiles (used a sequence of BBB instead of ED). Found a version of gprof2dot that groks oprofile output, managed to get it to work. Produces pretty graphs Media:Oprofile-graph.png, but due to the way call graph profiling in oprofile works, not all cycles are necessarily attributed correctly. Need to find out if gprof is better suited to that.

Week 4

Revisited the zigzag functions. The 8x8 SSE2 performs nicely on my Penryn (cuts runtime from ~97 cycles to ~47), the 4x4 seems to cut ~27 cycles to ~10. (however, I do not know if I can trust the C cycle counts - they used to be lower when we first benched this).

Took a look at pengvados last set of profiles (generated using different settings and on a larger encode), found out that at higher settings satd, sa8d and dct/idct show up high enough to warrant some optimizing. Found out that they could be made faster (around 3%) by doing SUMSUB differently (two paddw and one psubw cannot run in one cycle - now it has a mov, a sub and an add and those do seem to run in less. The mov probably can't pair with the following sub/add due to data dependency, but it seems those can pair with the _next_ mov).

But with the current code organization, each file has its own set of helper macros. Some refactoring would probably be useful before doing this change.

Schedule

Had a presentation on May 23rd (yeah, just the right time to buy and install new hardware). Will need to hand in the write up for all of it soonish. After that, mainly weather dependent. Cold heads code, hot heads need to be near (preferably cold) water. Well. Wasn't coding what nights were made for?

Been to my professors office hour Jun 2. Paper on the presentation (or rather, the term project presented) due Jun 6. Right in time for Euro 2008 ;) No, honestly: After that, mostly clear skies ahead time-wise. (Except maybe a bit of footy plus poker every now and then. Let's see how Germany does :)

Other probable time constraints: Will have to juggle a bit between doing things for GSoC and thinking (and writing) about these things for my Thesis. Switching sometimes should help keep up motivation (and idea flow) for the other. Otherwise, nothing largish. Maybe a weekend trip or two, but nothing long lasting.

May 31/Jun 1: Visit family. (Nice visit. Sister dropped by on her trip back from UK to Australia, and saw my oldest aunt again after >8 years. Sunny weather, unusually hot for the time of year. Usually, we only see >30C in August, if at all. Hope it'll get a bit colder soon.)

@@ Line 59: / Line 59: @@
 == Week 4 ==
-Revisited the zigzag functions. The 8x8 SSE2 performs nicely on my Penryn (cuts runtime from ~97 cycles to ~47), the 4x4 seems to cut ~27 cycles to ~10. (however, I do not know if I can trust the C cycle counts - they used to be lower when we firsted benched this).
+Revisited the zigzag functions. The 8x8 SSE2 performs nicely on my Penryn (cuts runtime from ~97 cycles to ~47), the 4x4 seems to cut ~27 cycles to ~10. (however, I do not know if I can trust the C cycle counts - they used to be lower when we first benched this).
-Took a look on pengvados last set of profiles (generated using different settings and on a larger encode), found out that at higher settings satd, sa8d and dct/idct show up high enough to warrant some optimizing. Found out that they could be made faster (around 3%) by doing SUMSUB differently (two paddw and one psubw cannot run in one cycle - now it has a mov, a sub and an add and those do seem to run in less. The mov probably can't pair with the following sub/add due to data dependency, but it seems those can pair with the _next_ mov).
+Took a look at pengvados last set of profiles (generated using different settings and on a larger encode), found out that at higher settings satd, sa8d and dct/idct show up high enough to warrant some optimizing. Found out that they could be made faster (around 3%) by doing SUMSUB differently (two paddw and one psubw cannot run in one cycle - now it has a mov, a sub and an add and those do seem to run in less. The mov probably can't pair with the following sub/add due to data dependency, but it seems those can pair with the _next_ mov).
 But with the current code organization, each file has its own set of helper macros. Some refactoring would probably be useful before doing this change.

Difference between revisions of "SoC 2008/x264/Speed Optimizations"

Revision as of 23:03, 23 June 2008

Contents

Project Abstract

Implementation

Current Status

Week 1

Week 2

Week 3

Week 4

Schedule

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Help / Documentation

Development

VideoLAN wiki

Tools