Please correct me I am mistakten, but I guess that alignment stems from the time when cache lines in the caches where 16 bytes long. Having the stack aligned to 16 bytes as well provides a better alignment of the stack to the caches.
Today, the cache line size is a multiple of 16 bytes.
Oh, and with SSE, which uses 128bit registers, the 16-byte aligment is the most natural one, too.
this is due to a calling convention in x64 which requires the stack to be 16 bytes aligned before any call instruction. This is not (to my knwoledge) a hardware requirement but a software one. This provides a way to be sure that when entering a function (that is, after a call instruction), the value of the stack pointer is always 8 modulo 16. Thus permitting simple data alignement and storage/reads from aligned location in stack.
In particular, 16-byte stack alignment avoids the need to insert conditional code to align SSE objects, both when allocating stack, and when entering SSE loops. This avoids the run-time failures seen on 32-bit systems when a gcc compiled function is called by one compiled by another compiler. Also, it avoids the alignment-dependent numerical differences incurred by differing loop alignment adjustments.
32-bit linux in the past provided 8-byte alignment (pointers set to 8 modulo 16) so as to handle 64-bit objects efficiently. Certain 32-bit compilers for Windows "optimized" for varying alignment by avoiding the use of 64- and 128-bit moves.