Difference between revisions of "X264 TODO"

From VideoLAN Wiki
Jump to navigation Jump to search
m (+{{Back to|Category:x264}})
 
(19 intermediate revisions by 7 users not shown)
Line 1: Line 1:
 +
{{Lowercase}}
 +
{{Back to|Category:x264}}
 
This page contains an incomplete list of things available in x264 for you to do. It's organized into sections covering various parts of x264.  
 
This page contains an incomplete list of things available in x264 for you to do. It's organized into sections covering various parts of x264.  
  
Line 11: Line 13:
 
*Sequential elimination (SEA), used for exhaustive search, might be more generally applicable to algorithms like UMH, by letting us skip a lot of SADs. The downside is we won't be able to use SAD_X4 anymore.  
 
*Sequential elimination (SEA), used for exhaustive search, might be more generally applicable to algorithms like UMH, by letting us skip a lot of SADs. The downside is we won't be able to use SAD_X4 anymore.  
 
*(T)ESA is currently wrong for motion searches done on weightp duplicates. This effect is miniscule, but it still should be fixed.  
 
*(T)ESA is currently wrong for motion searches done on weightp duplicates. This effect is miniscule, but it still should be fixed.  
*Hierarchical motion estimation might be a useful way to catch very long motion vectors without the cost of UMH or ESA. It might also help regularize motion.  
+
*Hierarchical motion estimation might be a useful way to catch very long motion vectors without the cost of UMH or ESA. It might also help regularize motion.
 +
**I have a patch for this in the lookahead, but it didn't help much, since it only added predictors.
 
*Somehow take into account the effect of motion vector decision on future blocks.  
 
*Somehow take into account the effect of motion vector decision on future blocks.  
 
**Hierarchical motion estimation  
 
**Hierarchical motion estimation  
Line 22: Line 25:
 
*libtheora does fullpel motion estimation on the source pixels instead of decoded pixels. Does this give a better starting point for the subpel search and discourage "weird" MVs?  
 
*libtheora does fullpel motion estimation on the source pixels instead of decoded pixels. Does this give a better starting point for the subpel search and discourage "weird" MVs?  
 
*With extremely fast encoding settings (subme 0), can we rip off lookahead MVs instead of doing a real search?  
 
*With extremely fast encoding settings (subme 0), can we rip off lookahead MVs instead of doing a real search?  
 +
**This seems to be awful from my testing, but maybe there's something we can do?
 
*Try sub-8x8 partitions in B-frames. Is it at all useful?  
 
*Try sub-8x8 partitions in B-frames. Is it at all useful?  
 
*Try bidir motion estimation for fullpel. That is, considering L1's MV when doing L0 (or vice versa). Xvid does this. How much does it help?  
 
*Try bidir motion estimation for fullpel. That is, considering L1's MV when doing L0 (or vice versa). Xvid does this. How much does it help?  
Line 29: Line 33:
 
=== Intra Analysis ===
 
=== Intra Analysis ===
  
*Make the early terminations smarter. Currently they're just hacks -- some statistical analysis might be useful.  
+
*Make the early terminations smarter. Currently they're just hacks -- some statistical analysis might be useful.
 +
**With the SSSE3-based fast intra analysis, we no longer do any early terminations for different modes, at least in SAD/SATD analysis.  But there might still be improvements to be made.
 
*SAD (subme 1) i8x8 vs i4x4 decision is a bit bad. Can it be improved without significant speed loss?
 
*SAD (subme 1) i8x8 vs i4x4 decision is a bit bad. Can it be improved without significant speed loss?
  
 
=== Mode Decision ===
 
=== Mode Decision ===
  
*Can we find more ways to skip more motion searches in multiref?  
+
*Can we find more ways to skip more motion searches in multiref?
*On extremely fast encoding settings, fast skip is actually kind of slow. But anything dumber (e.g. SAD) is completely useless. Is there some better balance that can be achieved here?  
+
**A while back, I tried using weaker motion searches on older refs.  This helped a bit for speed-vs-compression, but is ironically the opposite of what one wants; older refs will be harder to find good MVs in, and therefore really need better searches.
*See the TODOs for deblock-aware RD in common/deblock.c.  
+
*On extremely fast encoding settings, fast skip is actually kind of slow. But anything dumber (e.g. SAD) is completely useless. Is there some better balance that can be achieved here?
*Is there a faster way than SA8D/SATD to do 8x8dct vs 4x4dct mode decision? At very fast settings, the time this uses is nontrivial.  
+
**Can we do something smart by analyzing fenc?  It's impossible to tell whether a block is motionless by looking at fdec, but looking at the source pixels is useful.  There's still complexity such as lower-QP-than-reference though.
 +
*See the TODOs for deblock-aware RD in common/deblock.c.
 +
**I tried correcting weightp references for deblock RDO, but it didn't help.
 +
**I tried chroma, too, and again, it didn't help measurably.
 +
*Is there a faster way than SA8D/SATD to do 8x8dct vs 4x4dct mode decision? At very fast settings, the time this uses is nontrivial.
 +
**Doing a merged 4x4/8x8 SATD would help here, but would require new asm.
 
*Is there a faster way than RD to do 8x8dct vs 4x4dct mode decision that's still better than SATD? RD takes over an order of magnitude more time than SATD, so it might be useful to have something in the middle.  
 
*Is there a faster way than RD to do 8x8dct vs 4x4dct mode decision that's still better than SATD? RD takes over an order of magnitude more time than SATD, so it might be useful to have something in the middle.  
 
*Is there some value to swapping the mode decision metric from SATD to SA8D if we think that the macroblock will use 8x8dct? [http://akuvian.org/src/x264/x264_dct8_guess.diff This has been tried before], but only helped if our guess was extremely good (better than we could get in reality).  
 
*Is there some value to swapping the mode decision metric from SATD to SA8D if we think that the macroblock will use 8x8dct? [http://akuvian.org/src/x264/x264_dct8_guess.diff This has been tried before], but only helped if our guess was extremely good (better than we could get in reality).  
*With trellis 2, can we skip most of CABAC and CAVLC bit cost calculation?  
+
*With trellis 2, can we skip most of CABAC and CAVLC bit cost calculation?
*How about a "brute force" mode decision that takes no shortcuts (no early ref termination in p8x8, no SATD thresholds, etc)?
+
*How about saving CABAC state between each trellis call, rather than basing them all on the CABAC state at the start of the macroblock?
 +
*Make subme=11 not do thresholding in qpel RD and bidir RD.
  
 
=== Psy ===
 
=== Psy ===
Line 59: Line 70:
 
=== Lookahead ===
 
=== Lookahead ===
  
*'''Lookahead should be multithreaded, either by splitting the frame (sliced threads) or running multiple frame analysis calls at once.'''
 
 
*Temporal MV predictors in lookahead? There's a patch for these somewhere, but they biased heavily in favor of B-frames, likely by improving the motion search.  
 
*Temporal MV predictors in lookahead? There's a patch for these somewhere, but they biased heavily in favor of B-frames, likely by improving the motion search.  
 
*Should lookahead use variable lambda based on quantizer (esp. due to adaptive quant)? If so, should it take into account estimated ratecontrol quantizer, too? If so, how?
 
*Should lookahead use variable lambda based on quantizer (esp. due to adaptive quant)? If so, should it take into account estimated ratecontrol quantizer, too? If so, how?
 +
*B-adapt 1 could be made quite a bit better -- it's important because it's used on all the fast speed modes (and even the defaults).  "Harbour 4CIF" is a good example of a clip where it does noticeably badly.
  
 
=== Quantization ===
 
=== Quantization ===
  
*CAVLC "trellis" is a hack. It works, but it's a hack. Make it better. See the TODOs in encoder/rdo.c.  
+
*CAVLC "trellis" is a hack. It works, but it's a hack. Make it better. See the TODOs in encoder/rdo.c.
*There's room for something between trellis and deadzone in terms of complexity. libvpx has a good example -- it biases towards zero-runs in its "medium speed" quantizer. This can't be SIMD'd easily, but is still vastly faster than trellis. A nonlinear quantizer (be more likely to round up larger coefficients) might also be useful.  
+
**This is doubly important now, as CABAC trellis has been made way faster, but CAVLC hasn't.  Many of the CABAC trellis improvements can be backported.
 +
*There's room for something between trellis and deadzone in terms of complexity. libvpx has a good example -- it biases towards zero-runs in its "medium speed" quantizer. This can't be SIMD'd easily, but is still vastly faster than trellis. A nonlinear quantizer (be more likely to round up larger coefficients) might also be useful.
 +
**How useful is this with an entropy coder that doesn't really bias towards zero-runs, as in CABAC?
 
*Floyd-Steinberg for quantization? Try pushing quantization error to nearby DCT coefficients. Should this go from high to low or low to high?  
 
*Floyd-Steinberg for quantization? Try pushing quantization error to nearby DCT coefficients. Should this go from high to low or low to high?  
 
*Energy-preserving quantizer -- maintain L1 (or maybe L2? I'm not sure) energy. Should we maintain it in the spatial domain (post-iDCT) or residual domain? Probably the former.  
 
*Energy-preserving quantizer -- maintain L1 (or maybe L2? I'm not sure) energy. Should we maintain it in the spatial domain (post-iDCT) or residual domain? Probably the former.  
 +
**See [https://github.com/saintdev/x264-devel/compare/enquant-base...energy-quant saintdev's github] for one attempt at this.
 
*Decimation is currently just a ripoff of the JVT recommended algorithm. Can we do this more optimally? With RD?
 
*Decimation is currently just a ripoff of the JVT recommended algorithm. Can we do this more optimally? With RD?
  
Line 91: Line 105:
 
=== Ratecontrol ===
 
=== Ratecontrol ===
  
*VBV might be able to utilize the ability to re-encode a row of the frame for improved accuracy.
 
**Maybe re-encode everything in case of an underflow that row-reencoding can't fix? This might be better than underflowing.
 
 
*Current per-frame VBV is a hack. It only adapts per row and is O(N^2), where N is the number of rows. An O(N) solution would be able to react more often and thus be more accurate.  
 
*Current per-frame VBV is a hack. It only adapts per row and is O(N^2), where N is the number of rows. An O(N) solution would be able to react more often and thus be more accurate.  
 
*Make the frame size and row size predictors better. They currently are kind of crappy.  
 
*Make the frame size and row size predictors better. They currently are kind of crappy.  
*Ratecontrol code as a whole is a bit of a mess. It could be improved. There's a lot of cruft left over that is probably not needed now, like qblur.  
+
*Ratecontrol code as a whole is a bit of a mess. It could be improved. There's a lot of cruft left over that is probably not needed now, like qblur.
*2-pass VBV is actually more likely to underflow than 1-pass because it doesn't adapt as aggressively and trusts first pass data a lot. This trust is often misplaced if the first pass was a fast one. This should be improved.  
+
*1-pass ratecontrol often can't adapt fast enough when there are lots of threads (12, 16, 24, etc), especially with smallish VBV buffers.  Improve this?
*2-pass macroblock-tree: if we added the ability to do macroblock-tree on real encoded data, we'd get better results (particularly with repeating patterns and multiref, such as an anime character's mouth moving).  
+
*2-pass VBV is actually a bit more likely to underflow than 1-pass because it doesn't adapt as aggressively and trusts first pass data a lot. This trust is often misplaced if the first pass was a fast one. This should be improved.
 +
**2-pass is still better in the case of many threads, due to the above.
 +
*2-pass macroblock-tree: if we added the ability to do macroblock-tree on real encoded data, we'd get better results (particularly with repeating patterns and multiref, such as an anime character's mouth moving).
 
*Macroblock-tree: make it more psy-aware. Maybe we should cap how much it lowers the quantizer on extremely static scenes? This might tie into the "just-noticeable error" issue in RD.
 
*Macroblock-tree: make it more psy-aware. Maybe we should cap how much it lowers the quantizer on extremely static scenes? This might tie into the "just-noticeable error" issue in RD.
  
Line 113: Line 127:
 
***But potentially more useful...  
 
***But potentially more useful...  
 
*Other things?
 
*Other things?
 
=== x86 assembly ===
 
 
*Optimize more for the Phenom.
 
*Yell at holger to commit his local patches.
 
*Make a merged SA8D/SATD for the 8x8dct mode decision, since the two share most of their calculation. Hadamard_ac already does this, but slightly differently.
 
  
 
=== Other assembly ===
 
=== Other assembly ===
  
*NEON assembly is nowhere near complete.  
+
* A lot of ARM assembly is done. Missing is mostly for Hi-Depth bitrate.
**Chroma MC needs to be rewritten for NV12 support.  
+
* Altivec assembly is very lacking.
*Altivec assembly is very lacking.
 
*SPARC VIS assembly is only available when high bit-depth is disabled.
 
  
 
=== Other CPU optimizations ===
 
=== Other CPU optimizations ===
Line 134: Line 140:
 
=== Other features ===
 
=== Other features ===
  
*MPEG-2 encoding support  
+
*MPEG-2 encoding support
*VP8 encoding support
+
**[https://github.com/kierank/x262/wiki/TODO x262]
 
*Support for SMPTE timecodes  
 
*Support for SMPTE timecodes  
 
*Merge speedcontrol  
 
*Merge speedcontrol  
Line 143: Line 149:
 
=== x264CLI ===
 
=== x264CLI ===
  
*Finish audio support. Talk to Kovensky about this one.
+
*Finish audio support. Talk to Kovensky about this one.
*Make the filtering system aware of fullrange vs TV range.  
+
*Make the filtering system aware of BT.601 vs BT.709.
*Make the filtering system aware of BT.601 vs BT.709.  
+
*Use libavfilter instead of duplicating the filters in x264.
*Add more filters.
 
**Deinterlacers (YADIF).
 
**Denoisers (HQDN3D?).
 
**IVTC, decomb?
 
*Merge L-SMASH mp4 muxer.
 
*Add TS muxing support using HRD. Talk to kierank about this one.  
 
 
*Add --device support.  
 
*Add --device support.  
 
*Add automatic --level restriction support.
 
*Add automatic --level restriction support.
 +
 +
 +
[[Category:x264]]

Latest revision as of 08:35, 27 March 2019

← Back to Category:x264
This page contains an incomplete list of things available in x264 for you to do. It's organized into sections covering various parts of x264.

Some useful resources: Dark Shikari's pile of junk, Pengvado's pile of junk.

If you're interested in doing any of this, drop by #x264dev on Freenode IRC. There are no experience or educational requirements for doing any of this, though you are expected to know how to code.

Bolded features may have companies willing to sponsor or provide bounties. This is not complete either; just because it's not bolded doesn't mean there aren't resources out there. If your company is interested in offering a bounty, drop by IRC.

Motion Estimation

  • Sequential elimination (SEA), used for exhaustive search, might be more generally applicable to algorithms like UMH, by letting us skip a lot of SADs. The downside is we won't be able to use SAD_X4 anymore.
  • (T)ESA is currently wrong for motion searches done on weightp duplicates. This effect is miniscule, but it still should be fixed.
  • Hierarchical motion estimation might be a useful way to catch very long motion vectors without the cost of UMH or ESA. It might also help regularize motion.
    • I have a patch for this in the lookahead, but it didn't help much, since it only added predictors.
  • Somehow take into account the effect of motion vector decision on future blocks.
    • Hierarchical motion estimation
    • Approximations from lookahead MVs
    • Iterative ME (as per Snow)
    • Trellis motion estimation
  • We don't need to check all 11 predictors all the time for 16x16 fullpel motion search.
    • But how do we know which ones we can afford to skip, and when?
    • Xvid and libtheora have algorithms for this, but the former's is almost surely 100% useless and the latter doesn't seem impressive either.
  • libtheora does fullpel motion estimation on the source pixels instead of decoded pixels. Does this give a better starting point for the subpel search and discourage "weird" MVs?
  • With extremely fast encoding settings (subme 0), can we rip off lookahead MVs instead of doing a real search?
    • This seems to be awful from my testing, but maybe there's something we can do?
  • Try sub-8x8 partitions in B-frames. Is it at all useful?
  • Try bidir motion estimation for fullpel. That is, considering L1's MV when doing L0 (or vice versa). Xvid does this. How much does it help?
  • Fullpel chroma ME?
    • For TESA?

Intra Analysis

  • Make the early terminations smarter. Currently they're just hacks -- some statistical analysis might be useful.
    • With the SSSE3-based fast intra analysis, we no longer do any early terminations for different modes, at least in SAD/SATD analysis. But there might still be improvements to be made.
  • SAD (subme 1) i8x8 vs i4x4 decision is a bit bad. Can it be improved without significant speed loss?

Mode Decision

  • Can we find more ways to skip more motion searches in multiref?
    • A while back, I tried using weaker motion searches on older refs. This helped a bit for speed-vs-compression, but is ironically the opposite of what one wants; older refs will be harder to find good MVs in, and therefore really need better searches.
  • On extremely fast encoding settings, fast skip is actually kind of slow. But anything dumber (e.g. SAD) is completely useless. Is there some better balance that can be achieved here?
    • Can we do something smart by analyzing fenc? It's impossible to tell whether a block is motionless by looking at fdec, but looking at the source pixels is useful. There's still complexity such as lower-QP-than-reference though.
  • See the TODOs for deblock-aware RD in common/deblock.c.
    • I tried correcting weightp references for deblock RDO, but it didn't help.
    • I tried chroma, too, and again, it didn't help measurably.
  • Is there a faster way than SA8D/SATD to do 8x8dct vs 4x4dct mode decision? At very fast settings, the time this uses is nontrivial.
    • Doing a merged 4x4/8x8 SATD would help here, but would require new asm.
  • Is there a faster way than RD to do 8x8dct vs 4x4dct mode decision that's still better than SATD? RD takes over an order of magnitude more time than SATD, so it might be useful to have something in the middle.
  • Is there some value to swapping the mode decision metric from SATD to SA8D if we think that the macroblock will use 8x8dct? This has been tried before, but only helped if our guess was extremely good (better than we could get in reality).
  • With trellis 2, can we skip most of CABAC and CAVLC bit cost calculation?
  • How about saving CABAC state between each trellis call, rather than basing them all on the CABAC state at the start of the macroblock?
  • Make subme=11 not do thresholding in qpel RD and bidir RD.

Psy

  • Psy-RD is a hack. It works, but it's a hack. If you apply QNS with Psy-RD as the metric, it goes way overboard and gives terrible results. This means that Psy-RD only works because normal mode decision is limited in the way it can modify the image to better suit the metric. Is there a way to make it better?
  • Should RD be linear at all? Perhaps we should weight more heavily against low quality blocks and also try to ignore minuscule distortion that viewers can't see.
  • Psy-trellis (and maybe psy-RD?) are too strong at very high QPs.
  • Psy-trellis should be merged with Psy-RD. There are patches for this, but they probably won't be committed until psy-RD itself is fixed.
  • RD should take into account local variance.
  • Lambda should be varied on a per-DCT-block basis instead of a per-macroblock basis.
  • Lambda should be picked independent of quantizer (i.e. with greater precision).
  • Classic problem: a block is mostly high complexity but has a small area of low complexity. How do we judge whether that area is important? Good example: sharp text on background with film grain; grain gets blurred out because of the text.
    • If we think it's important all the time, we ruin the quality of many clips that rely on raising complexity on edges (Touhou).
  • Should motion estimation lambda be as high as it is at very high quantizers? There's some value to capturing "true motion"...
  • Macroblock tree correlates pretty well with visual perception in that its concept of a "high complexity" matches well with the visual concept. Except for local illumination changes. Talk to Dark Shikari for a patch.

Lookahead

  • Temporal MV predictors in lookahead? There's a patch for these somewhere, but they biased heavily in favor of B-frames, likely by improving the motion search.
  • Should lookahead use variable lambda based on quantizer (esp. due to adaptive quant)? If so, should it take into account estimated ratecontrol quantizer, too? If so, how?
  • B-adapt 1 could be made quite a bit better -- it's important because it's used on all the fast speed modes (and even the defaults). "Harbour 4CIF" is a good example of a clip where it does noticeably badly.

Quantization

  • CAVLC "trellis" is a hack. It works, but it's a hack. Make it better. See the TODOs in encoder/rdo.c.
    • This is doubly important now, as CABAC trellis has been made way faster, but CAVLC hasn't. Many of the CABAC trellis improvements can be backported.
  • There's room for something between trellis and deadzone in terms of complexity. libvpx has a good example -- it biases towards zero-runs in its "medium speed" quantizer. This can't be SIMD'd easily, but is still vastly faster than trellis. A nonlinear quantizer (be more likely to round up larger coefficients) might also be useful.
    • How useful is this with an entropy coder that doesn't really bias towards zero-runs, as in CABAC?
  • Floyd-Steinberg for quantization? Try pushing quantization error to nearby DCT coefficients. Should this go from high to low or low to high?
  • Energy-preserving quantizer -- maintain L1 (or maybe L2? I'm not sure) energy. Should we maintain it in the spatial domain (post-iDCT) or residual domain? Probably the former.
  • Decimation is currently just a ripoff of the JVT recommended algorithm. Can we do this more optimally? With RD?

Transform

  • Analyze the error characteristics of the fDCT. Is there any way to make it more accurate without much speed loss? Particularly at extremely low quantizers, this might help.
  • Before forward transform, run a "blocking filter" that acts as the approximate inverse of the deblock filter. See this paper.

Interlacing

  • Lookahead currently blend-deinterlaces to get the lowres. Is this a good idea? Is there something better that isn't much slower?
  • Constrained intra + adaptive MBAFF. Does anyone care about this?
  • PAFF + MBAFF adaptive - PAFF performs better than Adaptive MBAFF on high motion scenes because it can predict from the previous field.

Weighted Prediction

  • Make weightp work with interlacing. Preferably abuse reference duplication to make it useful for MBAFF.
  • Finish K-means decision for weightp. Talk to DylanZA about getting his current patch for this one.
  • Add explicit weighting for B-frames, too. This helps in nonlinear fades, among other cases.
  • Improve weighted prediction analysis to do more searching based on an estimated offset vs scale gradient.

Ratecontrol

  • Current per-frame VBV is a hack. It only adapts per row and is O(N^2), where N is the number of rows. An O(N) solution would be able to react more often and thus be more accurate.
  • Make the frame size and row size predictors better. They currently are kind of crappy.
  • Ratecontrol code as a whole is a bit of a mess. It could be improved. There's a lot of cruft left over that is probably not needed now, like qblur.
  • 1-pass ratecontrol often can't adapt fast enough when there are lots of threads (12, 16, 24, etc), especially with smallish VBV buffers. Improve this?
  • 2-pass VBV is actually a bit more likely to underflow than 1-pass because it doesn't adapt as aggressively and trusts first pass data a lot. This trust is often misplaced if the first pass was a fast one. This should be improved.
    • 2-pass is still better in the case of many threads, due to the above.
  • 2-pass macroblock-tree: if we added the ability to do macroblock-tree on real encoded data, we'd get better results (particularly with repeating patterns and multiref, such as an anime character's mouth moving).
  • Macroblock-tree: make it more psy-aware. Maybe we should cap how much it lowers the quantizer on extremely static scenes? This might tie into the "just-noticeable error" issue in RD.

GPU

  • Motion estimation?
    • Methods
      • Hierarchical?
      • 2D Wave?
      • Something else?
    • "Easy": lookahead motion estimation
      • Extremely high parallelism, hundreds of frame searches (each with thousands of searches) at once.
    • "Hard": main motion estimation
      • Difficult synchronization issues, not as heavily parallel in terms of number of macroblocks, but far more partition sizes and refs to search.
      • But potentially more useful...
  • Other things?

Other assembly

  • A lot of ARM assembly is done. Missing is mostly for Hi-Depth bitrate.
  • Altivec assembly is very lacking.

Other CPU optimizations

  • x264 needs more prefetching. How many L1 and L2 cache misses (particularly L1) can we get rid of via smart prefetching in the right places? Warning: this is often hard to benchmark.
  • Different CPUs take different relative times for some functions. Is this enough (particularly across architectures) to justify different encoding settings for different CPUs?

Other features

  • MPEG-2 encoding support
  • Support for SMPTE timecodes
  • Merge speedcontrol
  • Mixed lossless/lossy encoding.
  • Segment re-encoding

x264CLI

  • Finish audio support. Talk to Kovensky about this one.
  • Make the filtering system aware of BT.601 vs BT.709.
  • Use libavfilter instead of duplicating the filters in x264.
  • Add --device support.
  • Add automatic --level restriction support.