I faced problem when implemented OpenBALS and MKL. Sizes of task were 16000 - 18000, step = 64 (i.e. 16000, 16064, 16128.......18000). The task was implemented on Cluster with 24 nodes of haswell architecture (two sockets, cache = 30MB). The question is: why does performance has deep drop when size is 16384? Both of application have the same drop in performance when size is 16384. I do not have big experience in programming and I ask about any thoughts. The miss rate also significantly increased in this size (this is why performance is decreased). Also, why does it happen in this size?