[sldev] Optimizing OpenJPEG (oprofile kicks ass)

Stefan Westerfeld stefan at space.twc.de
Thu Mar 29 10:05:27 PDT 2007


   Hi!

On Thu, Mar 29, 2007 at 10:26:08AM -0500, Callum Lerwick wrote:
> So a bunch of memcpy-ing tops the list (wonder where that's coming
> from), followed by OpenJPEG t1_decode_cblks as expected, then the i915
> drivers, then the OpenJPEG dwt, followed by memset and the i915 drivers
> again.
> 
> Lets take a closer look at t1_decode_cblks:
> 
>                :  /* Changed by Dmitry Kolyadin */
>    673  0.1498 :  for (j = 0; j <= h; j++) {
>  27823  6.1940 :     for (i = 0; i <= w; i++) {
>    144  0.0321 :        t1->flags[j][i] = 0;
>                :     }
>                :  }
>                :
>                :  /* Changed by Dmitry Kolyadin */
>   2103  0.4682 :  for (i = 0; i < w; i++) {
> 156170 34.7666 :     for (j = 0; j < h; j++){
>  52543 11.6971 :        t1->data[j][i] = 0;
>                :     }
>                :  }
> 
> I don't know what Dmitry Kolyadin was trying to accomplish, but for some
> reason that second loop is the opposite way around and you can see how
> it thrashes the cache. And look at what its doing. The t1 is spending an
> awful lot of time JUST ZEROING ARRAYS! What the hell??
> 
> Lets flip that second loop around and let gcc4's autovectorizer loose on
> it:
> 
> gcc -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
> -fstack-protector --param=ssp-buffer-size=4 -m32 -march=pentium3
> -fasynchronous-unwind-tables -ftree-vectorize
> -ftree-vectorizer-verbose=5 -ffast-math -fPIC -Ilibopenjpeg -c
> libopenjpeg/t1.c -o libopenjpeg/t1.o
> 
> libopenjpeg/t1.c:659: note: Alignment of access forced using peeling.
> libopenjpeg/t1.c:659: note: LOOP VECTORIZED.
> libopenjpeg/t1.c:666: note: Alignment of access forced using peeling.
> libopenjpeg/t1.c:666: note: LOOP VECTORIZED.
> libopenjpeg/t1.c:1057: note: vectorized 2 loops in function.
> 
> And see what that gets us:
> 
> samples  %        linenr info                 image name               symbol name
> -------------------------------------------------------------------------------
> 1032663  20.3752  (no location information)   libc-2.5.so              memcpy
> 439716    8.6759  t1.c:1001                   libopenjpeg.so.1.0.0     t1_decode_cblks
> 321558    6.3446  intel_tex.c:754             i915_dri.so              intelUploadTexImages
> 271098    5.3490  dwt.c:524                   libopenjpeg.so.1.0.0     dwt_decode_real
> 252458    4.9812  t_vb_lighttmp.h:239         i915_dri.so              light_rgba
> 228712    4.5127  t_vb_lighttmp.h:239         i915_dri.so              light_rgba_material
> 170216    3.3585  dwt.c:181                   libopenjpeg.so.1.0.0     dwt_interleave_v
> 147816    2.9165  dwt.c:285                   libopenjpeg.so.1.0.0     dwt_decode_1_real
> 138798    2.7386  tcd.c:1231                  libopenjpeg.so.1.0.0     tcd_decode_tile
> 99387     1.9610  mct.c:111                   libopenjpeg.so.1.0.0     mct_decode_real
> 88111     1.7385  (no location information)   libc-2.5.so              memset
> 74694     1.4738  light.c:599                 i915_dri.so              _mesa_update_material
> 
>                :  /* Changed by Dmitry Kolyadin */
>   1589  0.5284 :  for (j = 0; j <= h; ++j) {
>   4952  1.6466 :     for (i = 0; i <= w; ++i) {
>  14814  4.9258 :        t1->flags[j][i] = 0;
>                :     }
>                :  }
>                :
>                :  /* Changed by Dmitry Kolyadin */
>   5198  1.7284 :  for (j = 0; j < h; ++j) {
>  21078  7.0086 :     for (i = 0; i < w; ++i) {
>  23117  7.6866 :        t1->data[j][i] = 0;
>                :     }
>                :  }

Using memset() should also do the trick.

> Nice. Our hot spot has moved down here:
> 
>     70  0.0233 :  w = tilec->x1 - tilec->x0;
>     51  0.0170 :  if (tcp->tccps[compno].qmfbid == 1) {
>     73  0.0243 :     for (j = 0; j < cblk->y1 - cblk->y0; j++) {
>   6770  2.2511 :        for (i = 0; i < cblk->x1 - cblk->x0; i++) {
>    841  0.2796 :           tilec->data[x + i + (y + j) * w] = t1->data[j][i]/2;
>                :        }
>                :     }
>                :  } else {    /* if (tcp->tccps[compno].qmfbid == 0) */
>    447  0.1486 :     for (j = 0; j < cblk->y1 - cblk->y0; j++) {
>  79057 26.2872 :        for (i = 0; i < cblk->x1 - cblk->x0; i++) {
>  28888  9.6055 :           if (t1->data[j][i] >> 1 == 0) {
>   2348  0.7807 :              tilec->data[x + i + (y + j) * w] = 0;
>                :           } else {
>    405  0.1347 :              double tmp = (double)((t1->data[j][i] << 12) * band->stepsize);
>   5086  1.6911 :              int tmp2 = ((int) (floor(fabs(tmp)))) + ((int) floor(fabs(tmp*2))%2);
>    626  0.2082 :              tilec->data[x + i + (y + j) * w] = ((tmp<0)?-tmp2);
>                :           }
> 
> Which is a bit more sensible. I guess. t1->flags and t1->data are huge
> static 1024x1024 arrays, eating 8mb(!) ram total between them if I'm
> doing my math right. Christ. So, I'm looking in to making them
> dynamically allocated, I don't see slviewer ever using more than 64x64
> (33kb!).

The spec imposes a limit on the total number of flags, not on the
dimensions, if I understand it correctly. So you cannot assume that the
flag field may not be for instance 32x128, whereas it will never be
1024x1024.

> That should eliminate quite a bit of cache thrashing...

Look into my version for changes that do both (and more):

http://space.twc.de/~stefan/quickOpenJPEG/

However, the root of the remaining work that is done in the T1 code of
OpenJPEG is harder to optimize: essentially my current idea is that in
some situations, you can skip whole blocks in the flag field for certain
passes. I.e. values are only being REFINEd once they got their SIG bit
decoded. That may allow skipping the REFINE pass for quite some cases.

However, I haven't yet found enough time to produce a really good
implementation of this. The problem for block skipping is finding a
representation that allows us easily to find out where no work needs to
be done, without being so costly to maintain that using it makes the
whole thing more expensive.


But maybe it would be good if the trivial modification (removing cache
trashing) would already be backfolded upstream, without the more
sophisticated optimizations being done yet. There is a google group for
open jpeg optimization these days:

http://groups.google.com/group/openjpeg

   Cu... Stefan
-- 
Stefan Westerfeld, Hamburg/Germany, http://space.twc.de/~stefan


More information about the SLDev mailing list