In general you should first understand at theoritcal level how to parallelise your algorithmand identify the synch points between the different kernels. Ofcourse, You should look for an approach that will eliminate such synchs or at least reduce them to a minimum. Once you figure this out you should decide what is the role of a Work Item inside a specific kernel,how many work item exist inside a work group,and how many work groups are required to solve your algorithm. You should bare in mind that OpenCL provide synch mechanismsbetween work items within a work group but does not provide a way to synch between work items from 2 different work groups.
This is all very general stuff and is probably not new to you. If you could provide more information (if possible) regarding the dilemmas you have at the moment, my answer could be a bit more specific (and probably more helpful)?