[FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.
Ronald S. Bultje
rsbultje at gmail.com
Thu Jul 7 15:51:25 EEST 2016
On Thu, Jul 7, 2016 at 7:38 AM, Michael Niedermayer <michael at niedermayer.cc>
> On Thu, Jul 07, 2016 at 07:14:43AM -0400, Ronald S. Bultje wrote:
> > Hi,
> > On Thu, Jul 7, 2016 at 7:07 AM, Michael Niedermayer
> <michael at niedermayer.cc>
> > wrote:
> > > On Wed, Jul 06, 2016 at 07:28:27AM -0500, Dan Parrot wrote:
> > > > On Wed, 2016-07-06 at 09:07 +0200, Hendrik Leppkes wrote:
> > > [...]
> > >
> > > >
> > > > One other thing: why didn't this come up when the earlier patch was
> > > > submitted and applied?
> > >
> > > community patch review is not a reproduceable process, depending on
> > > who has time and does the review, different things can be found and
> > > pointed out, and people have also different oppinions.
> > > Real consistency can possibly only be achived by having an active
> > > maintainer that does all review ...
> > >
> > > To be more precisse the other patch was applied due to this comment
> > > IIRC:
> > > "If this patch works (FATE passes on ppc64) and is faster than
> > > the plain c functions then it can be committed as is"
> > How much faster was it?
> There where several benchmarks posted, one is here:
> it also contains some arguments why the speedup is less than on x86
I don't think these numbers are very convincing...
The arguments, on the other hand, are not facts, they are hunches, so they
are essentially meaningless.
I would suggest to revert the patch (it really didn't go through any solid
review TBH) so that a future contributor that wants to work on #5570 can do
it properly and get real gains. If people want to refer to this thread for
future directions (I can post this in the trac ticket also):
- start with one function. Take a really simple one. Don't do 20 at a time.
Especially if this is your first time writing ppc64 assembly.
- measure speedups on other archs with similar register width. Best
example: measure SSE2 vs. C.
- make sure you're measuring scalar C when measuring the base speed, since
x86 C vs. SSE2 is also scalar C vs. vector SIMD. There might be other
functions being picked up that we don't know about (some altivec is
BE-aware; your compiler might be auto-vectorizing C code.
- optimize your one function. Start with ideas taken from the x86 SSE2
code. Use all things learned from x86 basics (do aligned loads where
possible, limit shuffles/data rearrangements, load constants outside loop,
- measure. Use START/STOP_TIMER, nothing else, around the caller
with/without -cpuflags 0 and look only at the last reported cycle count
- make changes. Measure again. Repeat. Do this with all suggestions from
code review also. Your test should be ultra-fast, something that takes 10
seconds but invokes the function millions of times. If unsure, write a test
in checkasm, but usually one invocation from a fate test is good enough.
- if this is your first time writing assembly, you'll get tons of review
comments. This is normal, and we've all been through it. You'll become a
better coder for it, so learn from it, deal with it and keep submitting
patches until it's done. A few years from now, you'll be the expert
reviewer and an ever newer contributor will not yet know that he's about to
get learn some extremely important lessons from an experienced expert - you.
- once your first few individual functions are in, it may make sense to
submit sets of functions that are somehow related. However, this increases
review load so only do this once we know that you know what you're doing.
More information about the ffmpeg-devel