Your first loop has unresolvable data dependency problems. Even optimizing unroll may do little for it, in view of the expense of exp(). A lot of work has been done on parallelizing loops a bit like that just to show it might be done, without considering whether there is any gain in it.
The second loop is barely hiding a lot of redundant work. If you would simply reverse it and break when you find the match, it would be a simple search, except that you may need to expand it explicitly since the linear search auto-vectorization of icc may not work backwards. gcc wouldn't auto-vectorize it anyway. There is no point in thinking about parallel until you have optimized it in single thread and determined whether it remains a bottleneck in your application. The search would need to be quite long before a binary search could show an advantage over vectorized linear search. Back in the early days of vectorization (e.g. Cray) it was fairly common to write loops which did not break, as the compilers could not optimize with break, but that was 20-40 years ago, and people sometimes ignored the fact that making a loop 10 times as long might not be justified by vectorization. The computers which lost 99% of performance without vectorization or parallelization were rapid market failures.
Thank you for your answer. I provided more detailed comment to your post on another site.
Actually I am sorry that I listed the above two code samples, as the most replies I get relate to this code when the main problem I like to resolve was and still is:
looking for benchmarks that may not be directly parallelized by using tools for generations of parallel code.
I suggest you examine Elusive Algorithms - Parallel Scan to see if the technique used there can be exploited by your code requirements.
Note: exp(x + y) = exp(x) . exp(y)
There may be some hope for you in parallelizing the code listed in first part of #1
Actually calling 'exp' is not a point - the point is how to resolve the data dependency between "c = exp( ret );" and "ret += a;" ( what if instead of 'exp' I will use another function that doesn't behave "nicely" as exp ). And in the context of my post, the point is "if OpenMP can handle this code".
I wrote autoparallelizer that easily parallelizes this code ( without relying on properties of exp ) - search for "not parallel on OpenMP" on www.dalsoft.com/documentation_auto_parallelization.html.
The article I referenced in #4 addresses how to vectorize and parallelize a class of codes that appear to have loop order dependencies. In the article case and your #1.1 case, they both have carry dependencies.