[FFmpeg-devel] [PATCH] VP8 luma(16) inner-MB H/V loopfilter MMX/SSE2
Ronald S. Bultje
Sun Jul 18 20:21:12 CEST 2010
On Sun, Jul 11, 2010 at 2:47 PM, Loren Merritt <lorenm at u.washington.edu> wrote:
> On Sun, 11 Jul 2010, Michael Niedermayer wrote:
>> On Sun, Jul 11, 2010 at 04:52:04PM +0000, Loren Merritt wrote:
>>> On Sun, 11 Jul 2010, Ronald S. Bultje wrote:
>>>> You'll notice that the sse2 is significantly slower here, my rough
>>>> guess is that this is because of my shitty CPU which pretty much
>>>> emulates xmm-ops through mmx-ops, so it doesn't add a lot of benefit
>>>> other than not having to setup the loop for doing the second 8 pixels,
>>>> combined with the added complexity of a 8x16 transpose before the
>>>> actual filter. I'm betting that on an actual sse2-supporting CPU
>>>> (Jason?), this would still be faster, but we might want to put this
>>>> under a FF_MM_SSE2_NOT_SHITTY flag or something along those lines. If
>>>> you think my code is shitty, comments are welcome also. ;-).
>>> Rather than special-casing most of the functions, we at x264 declared
>>> Core1 doesn't have sse2, and changed the cpuid parser accordingly.
>>> If you want to support the few cases where sse2 is slightly faster than
>>> mmx, I recommend picking a different flag for that and applying it only
>>> when you've tested on Core1, so that FF_MM_SSE2 can be trusted to dwim in
>>> the usual case.
>>> --Loren Merritt
>>> ?cpuid.c | ? 14 +++++++++++++-
>>> ?1 file changed, 13 insertions(+), 1 deletion(-)
>>> 7ba0916766645e2de9330e9ba8f30d815da14c91 ?cpuid.diff
>> do we have any float SSE2 code that this could affect negatively?
>> if not iam ok with this patch
Attached patch implements FF_MM_SSE2/3SLOW for this purpose.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 4935 bytes
Desc: not available
More information about the ffmpeg-devel