Difference between revisions of "X264 TODO"

Revision as of 00:18, 14 September 2010

This page contains an incomplete list of things available in x264 for you to do. It's organized into sections covering various parts of x264.

Some useful resources: Dark Shikari's pile of junk, Pengvado's pile of junk.

If you're interested in doing any of this, drop by #x264dev on Freenode IRC. There are no experience or educational requirements for doing any of this, though you are expected to know how to code.

Sequential elimination (SEA), used for exhaustive search, might be more generally applicable to algorithms like UMH, by letting us skip a lot of SADs. The downside is we won't be able to use SAD_X4 anymore.
(T)ESA is currently wrong for motion searches done on weightp duplicates. This effect is miniscule, but it still should be fixed.
Hierarchical motion estimation might be a useful way to catch very long motion vectors without the cost of UMH or ESA. It might also help regularize motion.
Somehow take into account the effect of motion vector decision on future blocks.
- Hierarchical motion estimation
- Approximations from lookahead MVs
- Iterative ME (as per Snow)
- Trellis motion estimation
We don't need to check all 11 predictors all the time for 16x16 fullpel motion search.
- But how do we know which ones we can afford to skip, and when?
- Xvid and libtheora have algorithms for this, but the former's is almost surely 100% useless and the latter doesn't seem impressive either.
libtheora does fullpel motion estimation on the source pixels instead of decoded pixels. Does this give a better starting point for the subpel search and discourage "weird" MVs?
With extremely fast encoding settings (subme 0), can we rip off lookahead MVs instead of doing a real search?
Try sub-8x8 partitions in B-frames. Is it at all useful?
Try bidir motion estimation for fullpel. That is, considering L1's MV when doing L0 (or vice versa). Xvid does this. How much does it help?

Intra Analysis

Make the early terminations smarter. Currently they're just hacks -- some statistical analysis might be useful.
SAD (subme 1) i8x8 vs i4x4 decision is a bit bad. Can it be improved without significant speed loss?

Mode Decision

Can we find more ways to skip more motion searches in multiref?
On extremely fast encoding settings, fast skip is actually kind of slow. But anything dumber (e.g. SAD) is completely useless. Is there some better balance that can be achieved here?
Chroma-aware mode decision for B-frames?
See the TODOs for deblock-aware RD in common/deblock.c.
Is there a faster way than SA8D/SATD to do 8x8dct vs 4x4dct mode decision? At very fast settings, the time this uses is nontrivial.
Is there a faster way than RD to do 8x8dct vs 4x4dct mode decision that's still better than SATD? RD takes over an order of magnitude more time than SATD, so it might be useful to have something in the middle.
Is there some value to swapping the mode decision metric from SATD to SA8D if we think that the macroblock will use 8x8dct? This has been tried before, but only helped if our guess was extremely good (better than we could get in reality).
With trellis 2, can we skip most of CABAC and CAVLC bit cost calculation?
How about a "brute force" mode decision that takes no shortcuts (no early ref termination in p8x8, no SATD thresholds, etc)?

Psy

Psy-RD is a hack. It works, but it's a hack. If you apply QNS with Psy-RD as the metric, it goes way overboard and gives terrible results. This means that Psy-RD only works because normal mode decision is limited in the way it can modify the image to better suit the metric. Is there a way to make it better?
Should RD be linear at all? Perhaps we should weight more heavily against low quality blocks and also try to ignore minuscule distortion that viewers can't see.
Psy-trellis (and maybe psy-RD?) are too strong at very high QPs.
Psy-trellis should be merged with Psy-RD. There are patches for this, but they probably won't be committed until psy-RD itself is fixed.
RD should take into account local variance.
Lambda should be varied on a per-DCT-block basis instead of a per-macroblock basis.
Lambda should be picked independent of quantizer (i.e. with greater precision).
Classic problem: a block is mostly high complexity but has a small area of low complexity. How do we judge whether that area is important? Good example: sharp text on background with film grain; grain gets blurred out because of the text.
- If we think it's important all the time, we ruin the quality of many clips that rely on raising complexity on edges (Touhou).
Should motion estimation lambda be as high as it is at very high quantizers? There's some value to capturing "true motion"...

Lookahead

Lookahead should be multithreaded, either by splitting the frame (sliced threads) or running multiple frame analysis calls at once.
Temporal MV predictors in lookahead? There's a patch for these somewhere, but they biased heavily in favor of B-frames, likely by improving the motion search.
Should lookahead use variable lambda based on quantizer (esp. due to adaptive quant)? If so, should it take into account estimated ratecontrol quantizer, too? If so, how?

Quantization

CAVLC "trellis" is a hack. It works, but it's a hack. Make it better. See the TODOs in encoder/rdo.c.
There's room for something between trellis and deadzone in terms of complexity. libvpx has a good example -- it biases towards zero-runs in its "medium speed" quantizer. This can't be SIMD'd easily, but is still vastly faster than trellis. A nonlinear quantizer (be more likely to round up larger coefficients) might also be useful.
Floyd-Steinberg for quantization? Try pushing quantization error to nearby DCT coefficients. Should this go from high to low or low to high?
Energy-preserving quantizer -- maintain L1 (or maybe L2? I'm not sure) energy. Should we maintain it in the spatial domain (post-iDCT) or residual domain? Probably the former.
Decimation is currently just a ripoff of the JVT recommended algorithm. Can we do this more optimally? With RD?

Transform

Analyze the error characteristics of the fDCT. Is there any way to make it more accurate without much speed loss? Particularly at extremely low quantizers, this might help.
Before forward transform, run a "blocking filter" that acts as the approximate inverse of the deblock filter. See this paper.

Interlacing

Finish adaptive MBAFF. Talk to horlicks about this one.
Make slice-max-mbs/slice-max-size work with interlacing. Pretty much requires a good portion of adaptive MBAFF.
Lookahead currently blend-deinterlaces to get the lowres. Is this a good idea? Is there something better that isn't much slower?
Constrained intra + adaptive MBAFF. Does anyone care about this?

Weighted Prediction

Make weightp work with interlacing. Preferably abuse reference duplication to make it useful for MBAFF.
Make weightp work with chroma. Talk to DylanZA about getting his current patch for this one.
Finish K-means decision for weightp. Talk to DylanZA about getting his current patch for this one.
Add explicit weighting for B-frames, too. This helps in nonlinear fades, among other cases.

Ratecontrol

VBV might be able to utilize the ability to re-encode a row of the frame for improved accuracy.
- Maybe re-encode everything in case of an underflow that row-reencoding can't fix? This might be better than underflowing.
Current per-frame VBV is a hack. It only adapts per row and is O(N^2), where N is the number of rows. An O(N) solution would be able to react more often and thus be more accurate.
Make the frame size and row size predictors better. They currently are kind of crappy.
Ratecontrol code as a whole is a bit of a mess. It could be improved. There's a lot of cruft left over that is probably not needed now, like qblur.
2-pass VBV is actually more likely to underflow than 1-pass because it doesn't adapt as aggressively and trusts first pass data a lot. This trust is often misplaced if the first pass was a fast one. This should be improved.
"Emergency mode" for VBV -- handle cases where abiding by VBV is impossible within normal quantizer bounds.
- Just force all blocks to skip?
- Drop frames?
- Allow QPs higher than 51? (Fake QPs -- just results in "denoising" away DCT coefficients)
2-pass macroblock-tree: if we added the ability to do macroblock-tree on real encoded data, we'd get better results (particularly with repeating patterns and multiref, such as an anime character's mouth moving).
Macroblock-tree: make it more psy-aware. Maybe we should cap how much it lowers the quantizer on extremely static scenes? This might tie into the "just-noticeable error" issue in RD.

GPU

Motion estimation?
- Methods
  - Hierarchical?
  - 2D Wave?
  - Something else?
- "Easy": lookahead motion estimation
  - Extremely high parallelism, hundreds of frame searches (each with thousands of searches) at once.
- "Hard": main motion estimation
  - Difficult synchronization issues, not as heavily parallel in terms of numb of macroblocks, but far more partition sizes and refs to search.
  - But potentially more useful...
Other things?

x86 assembly

Finish AVX support for x86inc.asm. Talk to Dark Shikari for a patch.
Optimize more for the Phenom.
Optimize for the upcoming Sandy Bridge.
Work on 10-bit asm. Talk to irock about this.
Convince holger to commit his local patches.
Make a merged SA8D/SATD for the 8x8dct mode decision, since the two share most of their calculation. Hadamard_ac already does this, but slightly differently.

Other assembly

NEON assembly is nowhere near complete.
- Chroma MC needs to be rewritten for NV12 support.
Altivec assembly is very lacking.

Other CPU optimizations

x264 needs more prefetching. How many L1 and L2 cache misses (particularly L1) can we get rid of via smart prefetching in the right places? Warning: this is often hard to benchmark.
Different CPUs take different relative times for some functions. Is this enough (particularly across architectures) to justify different encoding settings for different CPUs?

Other features

VP8 encoding support

x264CLI

Finish audio support. Talk to Kovensky about this one.
Add more filters.
- Deinterlacers (YADIF).
- Denoisers (HQDN3D?).
- IVTC, decomb?
Merge L-SMASH mp4 muxer.
Add TS muxing support using HRD. Talk to kierank about this one.
Add --device support.

@@ Line 82: / Line 82: @@
 * Current per-frame VBV is a hack.  It only adapts per row and is O(N^2), where N is the number of rows.  An O(N) solution would be able to react more often and thus be more accurate.
 * Make the frame size and row size predictors better.  They currently are kind of crappy.
-* Ratecontrol code as a whole is a bit of a mess.  It could be improved.  There's a lot of cruft left over that is probably not needed now, like qcomp.
+* Ratecontrol code as a whole is a bit of a mess.  It could be improved.  There's a lot of cruft left over that is probably not needed now, like qblur.
 * 2-pass VBV is actually more likely to underflow than 1-pass because it doesn't adapt as aggressively and trusts first pass data a lot.  This trust is often misplaced if the first pass was a fast one.  This should be improved.
 * "Emergency mode" for VBV -- handle cases where abiding by VBV is impossible within normal quantizer bounds.

Difference between revisions of "X264 TODO"

Revision as of 00:18, 14 September 2010

Contents

Motion Estimation

Intra Analysis

Mode Decision

Psy

Lookahead

Quantization

Transform

Interlacing

Weighted Prediction

Ratecontrol

GPU

x86 assembly

Other assembly

Other CPU optimizations

Other features

x264CLI

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Help / Documentation

Development

VideoLAN wiki

Tools