Difference between revisions of "X264 TODO"

Latest revision as of 08:35, 27 March 2019

← Back to Category:x264
This page contains an incomplete list of things available in x264 for you to do. It's organized into sections covering various parts of x264.

Some useful resources: Dark Shikari's pile of junk, Pengvado's pile of junk.

If you're interested in doing any of this, drop by #x264dev on Freenode IRC. There are no experience or educational requirements for doing any of this, though you are expected to know how to code.

Bolded features may have companies willing to sponsor or provide bounties. This is not complete either; just because it's not bolded doesn't mean there aren't resources out there. If your company is interested in offering a bounty, drop by IRC.

Sequential elimination (SEA), used for exhaustive search, might be more generally applicable to algorithms like UMH, by letting us skip a lot of SADs. The downside is we won't be able to use SAD_X4 anymore.
(T)ESA is currently wrong for motion searches done on weightp duplicates. This effect is miniscule, but it still should be fixed.
Hierarchical motion estimation might be a useful way to catch very long motion vectors without the cost of UMH or ESA. It might also help regularize motion.
- I have a patch for this in the lookahead, but it didn't help much, since it only added predictors.
Somehow take into account the effect of motion vector decision on future blocks.
- Hierarchical motion estimation
- Approximations from lookahead MVs
- Iterative ME (as per Snow)
- Trellis motion estimation
We don't need to check all 11 predictors all the time for 16x16 fullpel motion search.
- But how do we know which ones we can afford to skip, and when?
- Xvid and libtheora have algorithms for this, but the former's is almost surely 100% useless and the latter doesn't seem impressive either.
libtheora does fullpel motion estimation on the source pixels instead of decoded pixels. Does this give a better starting point for the subpel search and discourage "weird" MVs?
With extremely fast encoding settings (subme 0), can we rip off lookahead MVs instead of doing a real search?
- This seems to be awful from my testing, but maybe there's something we can do?
Try sub-8x8 partitions in B-frames. Is it at all useful?
Try bidir motion estimation for fullpel. That is, considering L1's MV when doing L0 (or vice versa). Xvid does this. How much does it help?
Fullpel chroma ME?
- For TESA?

Intra Analysis

Make the early terminations smarter. Currently they're just hacks -- some statistical analysis might be useful.
- With the SSSE3-based fast intra analysis, we no longer do any early terminations for different modes, at least in SAD/SATD analysis. But there might still be improvements to be made.
SAD (subme 1) i8x8 vs i4x4 decision is a bit bad. Can it be improved without significant speed loss?

Mode Decision

Can we find more ways to skip more motion searches in multiref?
- A while back, I tried using weaker motion searches on older refs. This helped a bit for speed-vs-compression, but is ironically the opposite of what one wants; older refs will be harder to find good MVs in, and therefore really need better searches.
On extremely fast encoding settings, fast skip is actually kind of slow. But anything dumber (e.g. SAD) is completely useless. Is there some better balance that can be achieved here?
- Can we do something smart by analyzing fenc? It's impossible to tell whether a block is motionless by looking at fdec, but looking at the source pixels is useful. There's still complexity such as lower-QP-than-reference though.
See the TODOs for deblock-aware RD in common/deblock.c.
- I tried correcting weightp references for deblock RDO, but it didn't help.
- I tried chroma, too, and again, it didn't help measurably.
Is there a faster way than SA8D/SATD to do 8x8dct vs 4x4dct mode decision? At very fast settings, the time this uses is nontrivial.
- Doing a merged 4x4/8x8 SATD would help here, but would require new asm.
Is there a faster way than RD to do 8x8dct vs 4x4dct mode decision that's still better than SATD? RD takes over an order of magnitude more time than SATD, so it might be useful to have something in the middle.
Is there some value to swapping the mode decision metric from SATD to SA8D if we think that the macroblock will use 8x8dct? This has been tried before, but only helped if our guess was extremely good (better than we could get in reality).
With trellis 2, can we skip most of CABAC and CAVLC bit cost calculation?
How about saving CABAC state between each trellis call, rather than basing them all on the CABAC state at the start of the macroblock?
Make subme=11 not do thresholding in qpel RD and bidir RD.

Psy

Psy-RD is a hack. It works, but it's a hack. If you apply QNS with Psy-RD as the metric, it goes way overboard and gives terrible results. This means that Psy-RD only works because normal mode decision is limited in the way it can modify the image to better suit the metric. Is there a way to make it better?
Should RD be linear at all? Perhaps we should weight more heavily against low quality blocks and also try to ignore minuscule distortion that viewers can't see.
Psy-trellis (and maybe psy-RD?) are too strong at very high QPs.
Psy-trellis should be merged with Psy-RD. There are patches for this, but they probably won't be committed until psy-RD itself is fixed.
RD should take into account local variance.
Lambda should be varied on a per-DCT-block basis instead of a per-macroblock basis.
Lambda should be picked independent of quantizer (i.e. with greater precision).
Classic problem: a block is mostly high complexity but has a small area of low complexity. How do we judge whether that area is important? Good example: sharp text on background with film grain; grain gets blurred out because of the text.
- If we think it's important all the time, we ruin the quality of many clips that rely on raising complexity on edges (Touhou).
Should motion estimation lambda be as high as it is at very high quantizers? There's some value to capturing "true motion"...
Macroblock tree correlates pretty well with visual perception in that its concept of a "high complexity" matches well with the visual concept. Except for local illumination changes. Talk to Dark Shikari for a patch.

Lookahead

Temporal MV predictors in lookahead? There's a patch for these somewhere, but they biased heavily in favor of B-frames, likely by improving the motion search.
Should lookahead use variable lambda based on quantizer (esp. due to adaptive quant)? If so, should it take into account estimated ratecontrol quantizer, too? If so, how?
B-adapt 1 could be made quite a bit better -- it's important because it's used on all the fast speed modes (and even the defaults). "Harbour 4CIF" is a good example of a clip where it does noticeably badly.

Quantization

CAVLC "trellis" is a hack. It works, but it's a hack. Make it better. See the TODOs in encoder/rdo.c.
- This is doubly important now, as CABAC trellis has been made way faster, but CAVLC hasn't. Many of the CABAC trellis improvements can be backported.
There's room for something between trellis and deadzone in terms of complexity. libvpx has a good example -- it biases towards zero-runs in its "medium speed" quantizer. This can't be SIMD'd easily, but is still vastly faster than trellis. A nonlinear quantizer (be more likely to round up larger coefficients) might also be useful.
- How useful is this with an entropy coder that doesn't really bias towards zero-runs, as in CABAC?
Floyd-Steinberg for quantization? Try pushing quantization error to nearby DCT coefficients. Should this go from high to low or low to high?
Energy-preserving quantizer -- maintain L1 (or maybe L2? I'm not sure) energy. Should we maintain it in the spatial domain (post-iDCT) or residual domain? Probably the former.
- See saintdev's github for one attempt at this.
Decimation is currently just a ripoff of the JVT recommended algorithm. Can we do this more optimally? With RD?

Transform

Analyze the error characteristics of the fDCT. Is there any way to make it more accurate without much speed loss? Particularly at extremely low quantizers, this might help.
Before forward transform, run a "blocking filter" that acts as the approximate inverse of the deblock filter. See this paper.

Interlacing

Lookahead currently blend-deinterlaces to get the lowres. Is this a good idea? Is there something better that isn't much slower?
Constrained intra + adaptive MBAFF. Does anyone care about this?
PAFF + MBAFF adaptive - PAFF performs better than Adaptive MBAFF on high motion scenes because it can predict from the previous field.

Weighted Prediction

Make weightp work with interlacing. Preferably abuse reference duplication to make it useful for MBAFF.
Finish K-means decision for weightp. Talk to DylanZA about getting his current patch for this one.
Add explicit weighting for B-frames, too. This helps in nonlinear fades, among other cases.
Improve weighted prediction analysis to do more searching based on an estimated offset vs scale gradient.

Ratecontrol

Current per-frame VBV is a hack. It only adapts per row and is O(N^2), where N is the number of rows. An O(N) solution would be able to react more often and thus be more accurate.
Make the frame size and row size predictors better. They currently are kind of crappy.
Ratecontrol code as a whole is a bit of a mess. It could be improved. There's a lot of cruft left over that is probably not needed now, like qblur.
1-pass ratecontrol often can't adapt fast enough when there are lots of threads (12, 16, 24, etc), especially with smallish VBV buffers. Improve this?
2-pass VBV is actually a bit more likely to underflow than 1-pass because it doesn't adapt as aggressively and trusts first pass data a lot. This trust is often misplaced if the first pass was a fast one. This should be improved.
- 2-pass is still better in the case of many threads, due to the above.
2-pass macroblock-tree: if we added the ability to do macroblock-tree on real encoded data, we'd get better results (particularly with repeating patterns and multiref, such as an anime character's mouth moving).
Macroblock-tree: make it more psy-aware. Maybe we should cap how much it lowers the quantizer on extremely static scenes? This might tie into the "just-noticeable error" issue in RD.

GPU

Motion estimation?
- Methods
  - Hierarchical?
  - 2D Wave?
  - Something else?
- "Easy": lookahead motion estimation
  - Extremely high parallelism, hundreds of frame searches (each with thousands of searches) at once.
- "Hard": main motion estimation
  - Difficult synchronization issues, not as heavily parallel in terms of number of macroblocks, but far more partition sizes and refs to search.
  - But potentially more useful...
Other things?

Other assembly

A lot of ARM assembly is done. Missing is mostly for Hi-Depth bitrate.
Altivec assembly is very lacking.

Other CPU optimizations

x264 needs more prefetching. How many L1 and L2 cache misses (particularly L1) can we get rid of via smart prefetching in the right places? Warning: this is often hard to benchmark.
Different CPUs take different relative times for some functions. Is this enough (particularly across architectures) to justify different encoding settings for different CPUs?

Other features

MPEG-2 encoding support
- x262
Support for SMPTE timecodes
Merge speedcontrol
Mixed lossless/lossy encoding.
Segment re-encoding

x264CLI

Finish audio support. Talk to Kovensky about this one.
Make the filtering system aware of BT.601 vs BT.709.
Use libavfilter instead of duplicating the filters in x264.
Add --device support.
Add automatic --level restriction support.

Difference between revisions of "X264 TODO"

Latest revision as of 08:35, 27 March 2019

Contents

Motion Estimation

Intra Analysis

Mode Decision

Psy

Lookahead

Quantization

Transform

Interlacing

Weighted Prediction

Ratecontrol

GPU

Other assembly

Other CPU optimizations

Other features

x264CLI

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Help / Documentation

Development

VideoLAN wiki

Tools

@@ Line 1: / Line 1: @@
+{{Lowercase}}
+{{Back to|Category:x264}}
 This page contains an incomplete list of things available in x264 for you to do. It's organized into sections covering various parts of x264.
@@ Line 11: / Line 13: @@
 *Sequential elimination (SEA), used for exhaustive search, might be more generally applicable to algorithms like UMH, by letting us skip a lot of SADs. The downside is we won't be able to use SAD_X4 anymore.
 *(T)ESA is currently wrong for motion searches done on weightp duplicates. This effect is miniscule, but it still should be fixed.
 *Hierarchical motion estimation might be a useful way to catch very long motion vectors without the cost of UMH or ESA. It might also help regularize motion.
+**I have a patch for this in the lookahead, but it didn't help much, since it only added predictors.
 *Somehow take into account the effect of motion vector decision on future blocks.
 **Hierarchical motion estimation
@@ Line 22: / Line 25: @@
 *libtheora does fullpel motion estimation on the source pixels instead of decoded pixels. Does this give a better starting point for the subpel search and discourage "weird" MVs?
 *With extremely fast encoding settings (subme 0), can we rip off lookahead MVs instead of doing a real search?
+**This seems to be awful from my testing, but maybe there's something we can do?
 *Try sub-8x8 partitions in B-frames. Is it at all useful?
 *Try bidir motion estimation for fullpel. That is, considering L1's MV when doing L0 (or vice versa). Xvid does this. How much does it help?
@@ Line 29: / Line 33: @@
 === Intra Analysis ===
 *Make the early terminations smarter. Currently they're just hacks -- some statistical analysis might be useful.
+**With the SSSE3-based fast intra analysis, we no longer do any early terminations for different modes, at least in SAD/SATD analysis.  But there might still be improvements to be made.
 *SAD (subme 1) i8x8 vs i4x4 decision is a bit bad. Can it be improved without significant speed loss?
 === Mode Decision ===
 *Can we find more ways to skip more motion searches in multiref?
-*On extremely fast encoding settings, fast skip is actually kind of slow. But anything dumber (e.g. SAD) is completely useless. Is there some better balance that can be achieved here?
+**A while back, I tried using weaker motion searches on older refs.  This helped a bit for speed-vs-compression, but is ironically the opposite of what one wants; older refs will be harder to find good MVs in, and therefore really need better searches.
-*See the TODOs for deblock-aware RD in common/deblock.c.
+*On extremely fast encoding settings, fast skip is actually kind of slow. But anything dumber (e.g. SAD) is completely useless. Is there some better balance that can be achieved here?
-*Is there a faster way than SA8D/SATD to do 8x8dct vs 4x4dct mode decision? At very fast settings, the time this uses is nontrivial.
+**Can we do something smart by analyzing fenc?  It's impossible to tell whether a block is motionless by looking at fdec, but looking at the source pixels is useful.  There's still complexity such as lower-QP-than-reference though.
+*See the TODOs for deblock-aware RD in common/deblock.c.
+**I tried correcting weightp references for deblock RDO, but it didn't help.
+**I tried chroma, too, and again, it didn't help measurably.
+*Is there a faster way than SA8D/SATD to do 8x8dct vs 4x4dct mode decision? At very fast settings, the time this uses is nontrivial.
+**Doing a merged 4x4/8x8 SATD would help here, but would require new asm.
 *Is there a faster way than RD to do 8x8dct vs 4x4dct mode decision that's still better than SATD? RD takes over an order of magnitude more time than SATD, so it might be useful to have something in the middle.
 *Is there some value to swapping the mode decision metric from SATD to SA8D if we think that the macroblock will use 8x8dct? [http://akuvian.org/src/x264/x264_dct8_guess.diff This has been tried before], but only helped if our guess was extremely good (better than we could get in reality).
 *With trellis 2, can we skip most of CABAC and CAVLC bit cost calculation?
-*How about a "brute force" mode decision that takes no shortcuts (no early ref termination in p8x8, no SATD thresholds, etc)?
+*How about saving CABAC state between each trellis call, rather than basing them all on the CABAC state at the start of the macroblock?
+*Make subme=11 not do thresholding in qpel RD and bidir RD.
 === Psy ===
@@ Line 59: / Line 70: @@
 === Lookahead ===
-*'''Lookahead should be multithreaded, either by splitting the frame (sliced threads) or running multiple frame analysis calls at once.'''
 *Temporal MV predictors in lookahead? There's a patch for these somewhere, but they biased heavily in favor of B-frames, likely by improving the motion search.
 *Should lookahead use variable lambda based on quantizer (esp. due to adaptive quant)? If so, should it take into account estimated ratecontrol quantizer, too? If so, how?
+*B-adapt 1 could be made quite a bit better -- it's important because it's used on all the fast speed modes (and even the defaults).  "Harbour 4CIF" is a good example of a clip where it does noticeably badly.
 === Quantization ===
 *CAVLC "trellis" is a hack. It works, but it's a hack. Make it better. See the TODOs in encoder/rdo.c.
-*There's room for something between trellis and deadzone in terms of complexity. libvpx has a good example -- it biases towards zero-runs in its "medium speed" quantizer. This can't be SIMD'd easily, but is still vastly faster than trellis. A nonlinear quantizer (be more likely to round up larger coefficients) might also be useful.
+**This is doubly important now, as CABAC trellis has been made way faster, but CAVLC hasn't.  Many of the CABAC trellis improvements can be backported.
+*There's room for something between trellis and deadzone in terms of complexity. libvpx has a good example -- it biases towards zero-runs in its "medium speed" quantizer. This can't be SIMD'd easily, but is still vastly faster than trellis. A nonlinear quantizer (be more likely to round up larger coefficients) might also be useful.
+**How useful is this with an entropy coder that doesn't really bias towards zero-runs, as in CABAC?
 *Floyd-Steinberg for quantization? Try pushing quantization error to nearby DCT coefficients. Should this go from high to low or low to high?
 *Energy-preserving quantizer -- maintain L1 (or maybe L2? I'm not sure) energy. Should we maintain it in the spatial domain (post-iDCT) or residual domain? Probably the former.
+**See [https://github.com/saintdev/x264-devel/compare/enquant-base...energy-quant saintdev's github] for one attempt at this.
 *Decimation is currently just a ripoff of the JVT recommended algorithm. Can we do this more optimally? With RD?
@@ Line 91: / Line 105: @@
 === Ratecontrol ===
-*VBV might be able to utilize the ability to re-encode a row of the frame for improved accuracy.
-**Maybe re-encode everything in case of an underflow that row-reencoding can't fix? This might be better than underflowing.
 *Current per-frame VBV is a hack. It only adapts per row and is O(N^2), where N is the number of rows. An O(N) solution would be able to react more often and thus be more accurate.
 *Make the frame size and row size predictors better. They currently are kind of crappy.
 *Ratecontrol code as a whole is a bit of a mess. It could be improved. There's a lot of cruft left over that is probably not needed now, like qblur.
-*2-pass VBV is actually more likely to underflow than 1-pass because it doesn't adapt as aggressively and trusts first pass data a lot. This trust is often misplaced if the first pass was a fast one. This should be improved.
+*1-pass ratecontrol often can't adapt fast enough when there are lots of threads (12, 16, 24, etc), especially with smallish VBV buffers.  Improve this?
-*2-pass macroblock-tree: if we added the ability to do macroblock-tree on real encoded data, we'd get better results (particularly with repeating patterns and multiref, such as an anime character's mouth moving).
+*2-pass VBV is actually a bit more likely to underflow than 1-pass because it doesn't adapt as aggressively and trusts first pass data a lot. This trust is often misplaced if the first pass was a fast one. This should be improved.
+**2-pass is still better in the case of many threads, due to the above.
+*2-pass macroblock-tree: if we added the ability to do macroblock-tree on real encoded data, we'd get better results (particularly with repeating patterns and multiref, such as an anime character's mouth moving).
 *Macroblock-tree: make it more psy-aware. Maybe we should cap how much it lowers the quantizer on extremely static scenes? This might tie into the "just-noticeable error" issue in RD.
@@ Line 113: / Line 127: @@
 ***But potentially more useful...
 *Other things?
-=== x86 assembly ===
-*Optimize more for the Phenom.
-*Yell at holger to commit his local patches.
-*Make a merged SA8D/SATD for the 8x8dct mode decision, since the two share most of their calculation. Hadamard_ac already does this, but slightly differently.
 === Other assembly ===
-*NEON assembly is nowhere near complete.
+* A lot of ARM assembly is done. Missing is mostly for Hi-Depth bitrate.
-**Chroma MC needs to be rewritten for NV12 support.
+* Altivec assembly is very lacking.
-*Altivec assembly is very lacking.
-*SPARC VIS assembly is only available when high bit-depth is disabled.
 === Other CPU optimizations ===
@@ Line 134: / Line 140: @@
 === Other features ===
 *MPEG-2 encoding support
-*VP8 encoding support
+**[https://github.com/kierank/x262/wiki/TODO x262]
-*'''4:2:2 colorspace support'''
 *Support for SMPTE timecodes
 *Merge speedcontrol
 *Mixed lossless/lossy encoding.
+*Segment re-encoding
 === x264CLI ===
 *Finish audio support. Talk to Kovensky about this one.
-*Make the filtering system aware of fullrange vs TV range.
+*Make the filtering system aware of BT.601 vs BT.709.
-*Make the filtering system aware of BT.601 vs BT.709.
+*Use libavfilter instead of duplicating the filters in x264.
-*Add more filters.
-**Deinterlacers (YADIF).
-**Denoisers (HQDN3D?).
-**IVTC, decomb?
-*Merge L-SMASH mp4 muxer.
-*Add TS muxing support using HRD. Talk to kierank about this one.
 *Add --device support.
 *Add automatic --level restriction support.
-=== SOCIS x264 Profile ===
-CPU: ARM V7 PMNC, speed 0 MHz (estimated)
-Counted CPU_CYCLES events (Number of CPU cycles) with a unit mask of 0x00 (No unit mask) count 100000
-<pre>samples  %        image name               symbol name
+[[Category:x264]]
-     17.8387  x264                     mc_chroma
-      5.7221  x264                     x264_pixel_avg2_w16_neon
-      4.9438  x264                     x264_me_search_ref
-      4.9274  x264                     refine_subpel
-      4.5492  x264                     x264_quant_4x4_trellis
-      3.8166  x264                     x264_pixel_avg2_w8_neon
-      3.6795  x264                     x264_pixel_satd_8x4
-      3.5791  x264                     get_ref_neon
-      2.3915  x264                     x264_pixel_sad_16x16_neon
-      2.0554  x264                     x264_macroblock_encode
-      1.9896  x264                     x264_macroblock_analyse
-       1.4744  x264                     x264_satd_8x4v_8x8h_neon
-       1.4250  x264                     x264_rd_cost_mb
-       1.3227  x264                     x264_satd_8x8_neon
-       1.2405  x264                     x264_pixel_sad_x4_16x16_neon
-       1.2277  x264                     x264_pixel_sad_x4_8x8_neon
-       1.1656  x264                     x264_pixel_satd_4x4_neon
-       1.1583  x264                     x264_macroblock_cache_load_progressive
-       1.1565  x264                     x264_pixel_satd_4x4
-       1.0980  x264                     x264_pixel_sad_8x8_neon
-       1.0432  x264                     x264_quant_4x4_neon
-       0.9646  x264                     x264_slicetype_mb_cost
-       0.9135  x264                     x264_mb_predict_mv
-       0.8989  x264                     x264_pixel_satd_8x8_neon
-       0.8642  x264                     x264_mb_analyse_intra
-       0.8605  x264                     x264_mb_encode_8x8_chroma
-       0.8295  x264                     x264_pixel_sad_x3_16x16_neon
-       0.8258  x264                     x264_macroblock_tree_propagate
-       0.7381  x264                     x264_pixel_sad_x3_8x8_neon
-       0.7217  x264                     x264_satd_16x4_neon
-       0.7052  libc-2.9.so              /lib/libc-2.9.so
-       0.6705  x264                     x264_mb_predict_mv_ref16x16
-       0.6449  x264                     x264_sub8x4_dct_neon
-       0.5645  x264                     x264_macroblock_cache_save
-       0.5353  x264                     x264_hadamard_ac_8x8_neon
-       0.5134  x264                     x264_analyse_update_cache
-       0.5116  x264                     x264_mb_analyse_inter_b8x8_mixed_ref
-       0.5042  x264                     x264_slice_write
-       0.4933  x264                     x264_cabac_encode_decision_c
-       0.4878  x264                     x264_mb_analyse_inter_b16x16
-       0.4622  x264                     x264_cabac_mb_mvd
-       0.4293  x264                     block_residual_write_cabac
-       0.4074  x264                     x264_decimate_score16
-       0.3910  x264                     __aeabi_fdiv
-       0.3818  x264                     x264_pixel_var2_8x8_neon
-       0.3727  x264                     __aeabi_fadd
-       0.3672  x264                     deblock_strength_c
-       0.3599  x264                     x264_pixel_avg2_w20_neon
-       0.3599  x264                     x264_pixel_avg_w16_neon
-       0.3416  x264                     x264_mc_copy_w16_aligned_neon
-       0.3362  x264                     x264_cabac_mb_type
-       0.3307  x264                     load_deinterleave_8x8x2_fenc
-       0.3252  x264                     x264_macroblock_write_cabac
-       0.3234  x264                     x264_mb_mc_01xywh
-       0.3069  x264                     x264_mb_mc_0xywh
-       0.3051  x264                     mc_luma_neon
-       0.3015  x264                     x264_pixel_ssd_16x16_neon
-       0.2941  x264                     x264_quant_dc_trellis
-       0.2905  x264                     x264_pixel_avg_w8_neon
-       0.2850  x264                     x264_mc_copy_w8_neon
-       0.2850  x264                     x264_pixel_satd_16x16_neon
-       0.2832  x264                     x264_mb_analyse_inter_p16x16
-       0.2759  x264                     x264_pixel_ssd_8x8_neon
-       0.2740  x264                     memcpy_aligned_8_16_neon
-       0.2704  x264                     __aeabi_fmul
-       0.2613  x264                     x264_mb_encode_i4x4
-       0.2448  x264                     mbtree_propagate_cost
-       0.2265  x264                     x264_pixel_sad_8x16_neon
-       0.2046  x264                     x264_mc_weight_w16_offsetsub_neon
-       0.2028  x264                     x264_pixel_sad_x4_8x16_neon
-       0.2010  x264                     x264_mb_predict_mv_direct16x16
-       0.1973  x264                     x264_frame_init_lowres_core_neon
-        0.1790  x264                     x264_plane_copy_interleave_c
-        0.1699  x264                     block_residual_write_cabac
-        0.1663  x264                     __floatsisf
-        0.1608  x264                     x264_mb_mc
-        0.1589  x264                     x264_cabac_mb_ref
-        0.1589  x264                     x264_dequant_4x4_neon
-        0.1571  x264                     __aeabi_l2f
-        0.1571  x264                     x264_satd_4x8_8x4_end_neon
-        0.1553  x264                     x264_pixel_var_16x16_neon
-        0.1498  x264                     store_interleave_8x8x2
-        0.1498  x264                     x264_ratecontrol_mb_qp
-        0.1462  x264                     x264_pixel_sad_x4_16x8_neon
-        0.1443  x264                     x264_mb_predict_mv_16x16
-        0.1443  x264                     x264_pixel_sad_16x8_neon
-        0.1389  x264                     x264_mc_weight_w8_offsetsub_neon
-        0.1315  x264                     x264_prefetch_fenc_arm
-        0.1297  x264                     x264_mb_analyse_b_rd
-        0.1279  x264                     x264_add8x4_idct_neon
-        0.1279  x264                     x264_frame_deblock_row
-        0.1279  x264                     x264_pixel_sad_x3_8x16_neon
-        0.1242  x264                     x264_pixel_hadamard_ac_16x16_neon
-        0.1114  x264                     x264_coeff_last16_neon
-        0.1041  x264                     deblock_v_chroma_c
-        0.1023  x264                     x264_predict_16x16_h_c
-        0.1005  x264                     x264_predict_4x4_hd_c
-        0.0987  x264                     x264_predict_8x8_vr_c
-        0.0968  x264                     x264_mb_encode_i16x16
-        0.0968  x264                     x264_predict_4x4_vl_c
-        0.0950  x264                     x264_hpel_filter_c_neon
-        0.0950  x264                     x264_hpel_filter_v_neon
-        0.0950  x264                     x264_predict_4x4_vr_c
-        0.0950  x264                     x264_predict_8x8_filter_c
-        0.0913  x264                     x264_mb_analyse_intra_chroma
-        0.0913  x264                     x264_mb_analyse_p_rd
-        0.0913  x264                     x264_ratecontrol_mb
-        0.0895  x264                     x264_mb_mc_8x8
-        0.0895  x264                     x264_predict_8x8_hd_c
-        0.0859  x264                     memcpy_aligned_16_16_neon
-        0.0859  x264                     x264_mb_mc_1xywh
-        0.0840  x264                     x264_pixel_satd_16x8_neon
-        0.0822  x264                     x264_cabac_encode_terminal_c
-        0.0822  x264                     x264_cabac_mb_mvd
-        0.0822  x264                     x264_pixel_sad_x3_16x8_neon
-        0.0822  x264                     x264_zigzag_scan_4x4_frame_neon
-        0.0804  x264                     deblock_h_chroma_c
-        0.0804  x264                     x264_predict_8x8_vl_c
-        0.0804  x264                     x264_predict_8x8c_p_neon
-        0.0786  x264                     x264_pixel_satd_16x16
-        0.0786  x264                     x264_predict_8x8c_dc_c
-        0.0749  x264                     x264_macroblock_deblock_strength
-        0.0731  x264                     x264_copy_column8
-        0.0731  x264                     x264_memcpy_aligned_neon
-        0.0694  x264                     x264_predict_8x8_ddl_c
-        0.0694  x264                     x264_predict_8x8_ddr_c
-        0.0676  x264                     x264_mb_analyse_inter_b8x16
-        0.0676  x264                     x264_mc_copy_w16_neon
-        0.0658  x264                     x264_deblock_h_luma_neon
-        0.0658  x264                     x264_predict_16x16_v_c
-        0.0658  x264                     x264_predict_4x4_hu_c
-        0.0658  x264                     x264_sub4x4_dct_neon
-        0.0658  x264                     x264_sub8x8_dct_dc_neon
-        0.0639  x264                     x264_intra_satd_x3_4x4
-        0.0639  x264                     x264_mb_analyse_inter_b16x8
-        0.0639  x264                     x264_me_refine_bidir_satd
-        0.0639  x264                     x264_pixel_satd_4x8_neon
-        0.0621  x264                     x264_cabac_mb_type
-        0.0621  x264                     x264_predict_16x16_dc_c
-        0.0603  x264                     x264_hpel_filter_h_neon
-        0.0603  x264                     x264_pixel_satd_8x16_neon
-        0.0585  x264                     x264_frame_expand_border_lowres
-        0.0566  x264                     x264_predict_4x4_ddr_armv6
-        0.0530  x264                     x264_macroblock_probe_skip
-        0.0530  x264                     x264_mc_weight_w8_neon
-        0.0530  x264                     x264_predict_16x16_p_neon
-        0.0512  x264                     memcpy_aligned_8_8_neon
-        0.0512  x264                     x264_mb_predict_mv_pskip
-        0.0512  x264                     x264_predict_8x8_hu_c
-        0.0512  x264                     x264_sub16x16_dct_neon
-        0.0493  x264                     x264_me_refine_qpel_refdupe
-        0.0493  x264                     x264_pixel_avg_w4_neon
-        0.0475  x264                     __fixsfsi
-        0.0475  x264                     x264_add4x4_idct_neon
-        0.0475  x264                     x264_cabac_mb_ref
-        0.0475  x264                     x264_intra_satd_x3_8x8c
-        0.0438  x264                     x264_intra_satd_x3_16x16
-        0.0438  x264                     x264_pixel_satd_8x4_neon
-        0.0438  x264                     x264_quant_2x2_dc_neon
-        0.0420  x264                     x264_weight_cost_luma
-        0.0402  x264                     x264_predict_8x8c_h_c
-        0.0384  x264                     __aeabi_fcmpgt
-        0.0384  x264                     x264_predict_4x4_dc_c
-        0.0365  x264                     x264_ac_energy_mb
-        0.0365  x264                     x264_slicetype_frame_cost
-        0.0347  x264                     x264_cabac_encode_bypass_c
-        0.0347  x264                     x264_deblock_v_luma_neon
-        0.0329  x264                     memcpy_aligned_16_8_neon
-        0.0311  x264                     x264_decimate_score15
-        0.0311  x264                     x264_intra_rd
-        0.0292  x264                     x264_cabac_mb_skip
-        0.0292  x264                     x264_var_end
-        0.0274  x264                     __cmpsf2
-        0.0274  x264                     x264_predict_4x4_h_c
-        0.0256  x264                     x264_pixel_avg_8x8_neon
-        0.0256  x264                     x264_pixel_avg_weight_w16_add_add_neon
-        0.0238  x264                     deblock_v_luma_intra_c
-        0.0238  x264                     x264_frame_expand_border
-        0.0238  x264                     x264_mc_weight_w8_offsetadd_neon
-        0.0238  x264                     x264_predict_4x4_ddl_neon
-        0.0219  x264                     x264_frame_expand_border_filtered
-        0.0219  x264                     x264_memzero_aligned_neon
-        0.0219  x264                     x264_pixel_var_8x8_neon
-        0.0201  x264                     x264_mb_cache_mv_b16x8
-        0.0201  x264                     x264_predict_4x4_v_c
-        0.0183  x264                     x264_cabac_encode_ue_bypass
-        0.0183  x264                     x264_macroblock_cache_load_neighbours_deblock
-         0.0164  x264                     idct_dequant_2x2_dconly
-         0.0164  x264                     x264_mb_analyse_transform_rd
-         0.0164  x264                     x264_pixel_avg_weight_w8_add_add_neon
-         0.0164  x264                     x264_predict_4x4_dc_armv6
-         0.0146  x264                     x264_prefetch_ref_arm
-         0.0128  x264                     x264_coeff_last15_neon
-         0.0128  x264                     x264_prefetch_fenc
-         0.0110  x264                     x264_add8x8_idct_dc_neon
-         0.0110  x264                     x264_add8x8_idct_neon
-         0.0110  x264                     x264_pixel_avg_16x16_neon
-         0.0110  x264                     x264_predict_8x8c_dc_neon
-         0.0110  x264                     x264_weight_scale_plane
-         0.0091  x264                     __aeabi_cfrcmple
-         0.0091  x264                     x264_adaptive_quant_frame
-         0.0091  x264                     x264_macroblock_tree_finish
-         0.0091  x264                     x264_mb_cache_mv_b8x16
-         0.0091  x264                     x264_pixel_avg_4x4_neon
-         0.0091  x264                     x264_predict_8x8c_v_c
-         0.0073  x264                     __aeabi_ui2f
-         0.0073  x264                     deblock_h_chroma_intra_c
-         0.0073  x264                     deblock_h_luma_intra_c
-         0.0073  x264                     x264_fdec_filter_row
-         0.0073  x264                     x264_frame_init_lowres
-         0.0073  x264                     x264_predict_16x16_h_neon
-         0.0073  x264                     x264_predict_4x4_h_armv6
-         0.0055  x264                     __aeabi_cfcmple
-         0.0055  x264                     __divdf3
-         0.0055  x264                     deblock_v_chroma_intra_c
-         0.0055  x264                     x264_dequant_4x4_dc_neon
-         0.0055  x264                     x264_encoder_encode
-         0.0055  x264                     x264_frame_filter
-         0.0055  x264                     x264_nal_escape_c
-         0.0055  x264                     x264_pixel_avg_8x16_neon
-         0.0055  x264                     x264_predict_16x16_dc_neon
-         0.0055  x264                     x264_rc_analyse_slice
-         0.0037  libpthread-2.9.so        /lib/libpthread-2.9.so
-         0.0037  x264                     __subsf3
-         0.0037  x264                     x264_analyse_init_costs
-         0.0037  x264                     x264_coeff_last4_arm
-         0.0037  x264                     x264_encoder_frame_end
-         0.0037  x264                     x264_predict_16x16_dc_top_neon
-         0.0037  x264                     x264_quant_4x4_dc_neon
-         0.0037  x264                     x264_sub8x8_dct_neon
-         0.0037  x264                     x264_weight_cost_init_luma
-         0.0018  libm-2.9.so              /lib/libm-2.9.so
-         0.0018  x264                     __aeabi_d2f
-         0.0018  x264                     __aeabi_f2d
-         0.0018  x264                     __aeabi_fcmplt
-         0.0018  x264                     __aeabi_uidivmod
-         0.0018  x264                     __cmpdf2
-         0.0018  x264                     __divdi3
-         0.0018  x264                     __muldf3
-         0.0018  x264                     __udivdi3
-         0.0018  x264                     bs_write_ue_big
-         0.0018  x264                     hpel_filter_neon
-         0.0018  x264                     optimize_chroma_dc
-         0.0018  x264                     x264_add16x16_idct_dc_neon
-         0.0018  x264                     x264_dct4x4dc_neon
-         0.0018  x264                     x264_frame_copy_picture
-         0.0018  x264                     x264_frame_push_unused
-         0.0018  x264                     x264_free
-         0.0018  x264                     x264_macroblock_cache_mv_4_2
-         0.0018  x264                     x264_macroblock_slice_init
-         0.0018  x264                     x264_pixel_avg_4x8_neon
-         0.0018  x264                     x264_pixel_avg_8x4_neon
-         0.0018  x264                     x264_predict_16x16_v_neon</pre>