I agree with Max that AES New Instructions are not likely to be the root cause of the issue. Without seeing the actual implementation in question, the following list contains some areas for consideration:
Ensure you are not confusing the use of kernel_fpu_begin/end primitives with the use of SMP safe synchronisation primitives such as spin_lock_.... You may need both.
When moving to a NUMA system, it is best if your BIOS is configured for a NUMA system, your memory management code is NUMA aware and that you use a NUMA aware Network Interface Card driver.
Reduce the amount of locking contention on global variables. Consider refactoring code to using more per-cpu variables with appropriate levels of pre-emption & SMP protection.
Take note of the mapping of logical cores in Linux to physical cores/packages. This mapping could be different to what you may expect between the single and dual processor configurations.
To help with debugging it is worth trying the following:
Confirm that the remote side of the VPN tunnel is not silently dropping corrupted packets. This could be another pointer to an implementation issue on the transmitting side.
Try checking the multi-core/processor scaling ability with a regular C code implementation.
Consider using the Linux kernel crypto implementations.