i dont think there is a complete list anywhere to instructions and port mapping.
As i said in previous post. Document is incomplete as it is not listing all the instructions and it is not written to list all this information as it can change from processor to processor. Tool is giving the complete listing.
Regarding your performance gap, you may need to look into more detail. if you are comfortable with code sharing you can post here and someone can tell you why there is a performance gap.
Try differnet compiler, may intel compiler if you are not using that one. it will point out if this is due to compiler.
Assuming those are right. IACA is telling you optimal throughput but you want to collect more from IACA with -analysis PERFORMANCE or -analysis DATA_DEPENDENCY.
You want to see the instructions marked as "CP" these are the instructions on critical path.
These analysis will print Latency in beginning for each port. Which will give you a little idea about how many cycles are taking place for one loop.
Secondly, looking at your code, your performance is limited by port 1 and also port0. if you can somehow break that dependency or choosing different instructions you may get better performance. You may need to look at the assembly generated - as you are using lot of registers there may be chance of register spills to stack. That will add more delay. you want to avoid it by reusing some already defined registers. Compilers usually take care of it but sometime compiler is not clear about the scope and keep the register alive little longer.
other option is SDE if you have installed. it has a tool got xed which dumps disassembly. i beleive it does show also that instruction is AVX or SSE.