On Haswell if data is aligned on a 32-byte boundary and loaded/written using a __m256/i/d register, is the read/write atomic?
I have not seen Intel claim that 32-bytes can be written atomically, even though the fact Haswell's L1D cache bandwidth is 256 bits would suggest it is. I have also been informed by a colleague the the 256 bit load instructions require a single uop, suggesting is it atomic?
Although it is likely that aligned 256-bit loads are atomic on Haswell, based on past experience I think it unlikely that Intel will commit to this as a supported feature.
The problem with supported features is that there is an expectation that they will be continued indefinitely. The history of computer implementations is full of examples of features that were easy to implement at one time, but which became serious burdens to support after a few generations. (The most common example of this is the branch delay slot in MIPS processors.) The discussion in section 8.1.1 of Volume 3 of the Intel SW Developer's Guide provides some hints about areas that cause problems.
Extending the guarantee of atomicity from 64-bits (fully contained in a cache line) to 128 bits or 256 bits does not provide a lot more capability for software. If one wanted to support this sort of atomicity as a first-order mechanism, then it would have to be extended with other complementary features. I would recommend support full cache line (64 Byte) atomicity at a minimum, but unless the architecture goes all the way to supporting memory accesses with side effects, I don't think it would be worth the effort.
Although it has been slow getting started, Intel's TSX extensions are a more complete approach to exploiting atomicity.