The main factor would be the amount of mismatch in the rate of writing into the channel and reading from it. If these rates are expected to be similar in both of your kernels, then a shallow depth of a few indexes (<20) will suffice and will not use Block RAMs either. If, however, the rate is expected to be very different, then you should keep increasing the depth and measure the performance to see when the performance will become stable.
When you have a similar amount of stalling on both the read and the write side, it means the source of stall is not the channel but something else. Indeed increasing channel depth will not help if you only have stalls on the read side but not on the write side.
I have no idea. You can check "Section 4.3. Interpreting the Profiling Information" from Best Practices Guide which has extensive information describing how to interpret profiling results and finding sources of bottleneck.