AffineTransform Variant

kramulous · ‎05-04-2011

Hi,

I've been speeding up some rendering code of mine and have reduced the problem to a function that takes 98% of the time.

[cpp]    ippsMulC_32f(dataset.xpoints, a, temp1, n);
    ippsAddProductC_32f(dataset.ypoints, b, temp1, n);
    ippsAddProductC_32f(dataset.zpoints, c, temp1, n);
    ippsAddC_32f_I(d, temp1, n);
    
    ippsMulC_32f(dataset.xpoints, e, temp2, n);
    ippsAddProductC_32f(dataset.ypoints, f, temp2, n);
    ippsAddProductC_32f(dataset.zpoints, g, temp2, n);
    ippsAddC_32f_I(h, temp2, n);
    
    ippsMulC_32f(dataset.xpoints, i, temp3, n);
    ippsAddProductC_32f(dataset.ypoints, j, temp3, n);
    ippsAddProductC_32f(dataset.zpoints, k, temp3, n);
    ippsAddC_32f_I(l, temp3, n);
    ippsDivCRev_32f_I(1.0, temp3, n);
    
    
    for (int p = 0; p = ca + ca*temp1*vd*temp3
;
        ytrans
 = cb - cb*temp2
*vda*temp3
;
    }[/cpp]

Eventually I have to get rid of the temporary variables (temp1,temp2, temp3) ... memory requirements will not allow them for too much further.

Now, I see the closest contender for what I need is the "ippmAffineTransform3DH_mva_32f",but it does:

X' = t₁₁*X + t₁₂*Y + t₁₃*Z + t₁₄

Y' = t₂₁*X + t₂₂*Y + t₂₃*Z + t₂₄

Z' = t₃₁*X + t₃₂*Y + t₃₃*Z + t₃₄

W = t₄₁*X + t₄₂*Y + t₄₃*Z + t₄₄

X' = X'/W

Y' = Y'/W

Z' = Z'/W

But I need

X' = t₁₁*X + t₁₂*Y + t₁₃*Z + t₁₄

Y' = t₂₁*X + t₂₂*Y + t₂₃*Z + t₂₄

W = t₄₁*X + t₄₂*Y + t₄₃*Z + t₄₄,

and then the following vector transformation:

X' = X'/W

Y' = Y'/W

So I guess my question is this, are there any internal optimisations that will flagt_31,t_32,t₃₃and t₃₄

as zero and not include them in the calculation?

It would be really, really nice to sort this last bit out so I can get balanced parallel (multi socket/core) and full vectorisation.

Currently getting 2 billion points rendering at 20 frames per second at 9600x4800 resolution (20x1920x1200). The interaction is just a fraction too slow for fluid manipulation. By the end of the year the resolution will expected to increase 6x more so milking every bit of performance is essential.

Pointers from the experts?

kramulous · ‎05-05-2011

Nevermind ... found a way to get force full vectorisation anyway. Works well.