I remember reading a whitepaper I can't now find that argued that you shouldn't over-align, specially for Pentium 4. Image stride should be a multiple of 64 but not 128, to prevent columns fighting for the same cache lines. Otherwise most of the cache is useless. For example, if a 32KB cache is 4-way associative and each line 64 bytes long, only 4x64 = 256 bytes are used.
iw doesn't take this precaution. Maybe it should? Any pointers to read more on this and why Pentium 4 had it worse than other CPUs?
Do you mean the cache bank conflict, generally the architecture and software were discussed in IA software developer manual: