Solved: Benchmarking Threading accross 16/32 bit code pages

dgunter · ‎06-16-2009

Has anyone benchmarked/explored the mixing of 16/32 bit code pages in the same process but in different threads?
Can I runa thread in 16 bit code page and another in a 32 bit code page in the same process?
If so, will the intel microcode behind the scenes and the hyper threading technology run my 16 bit code faster than the same code in 32 bits or slower?
Willthe 16 bit code suffer from alignment lag when accessing the 3/4 bytes in the upper half of what would be 32 bit alignment?
FinallyWhat cache and other penalties do I need to be aware of when testing the mixing of code pages?
Reason: We are prototyping a process for decreasing the run time complexity or energy usage of an applicationby using unused bit widths for parallel processing. i.e If you only use the lower 16 bits, use the darn upper 16 bits to paralell process other datum. Testing on Intel finalize on FPGA

jimdempseyatthecove · ‎06-18-2009

Quoting - dgunter

Has anyone benchmarked/explored the mixing of 16/32 bit code pages in the same process but in different threads?
Can I runa thread in 16 bit code page and another in a 32 bit code page in the same process?
If so, will the intel microcode behind the scenes and the hyper threading technology run my 16 bit code faster than the same code in 32 bits or slower?
Willthe 16 bit code suffer from alignment lag when accessing the 3/4 bytes in the upper half of what would be 32 bit alignment?
FinallyWhat cache and other penalties do I need to be aware of when testing the mixing of code pages?
Reason: We are prototyping a process for decreasing the run time complexity or energy usage of an applicationby using unused bit widths for parallel processing. i.e If you only use the lower 16 bits, use the darn upper 16 bits to paralell process other datum. Testing on Intel finalize on FPGA

>>Testing on Intel finalize on FPGA

Since your intentions is to run on FPGA I would assume the "CPU" architecture is not that of an Intel architecture. You will likely be using your own design or that of one of the usual architectures for FPGA such as ARM. This being the case, any benchmarking you perform using wall clock on Intel platform will not be suitable to ascertain the performance on (in) the FPGA. Instead, your best bet would be to write an emulator of your eventual instruction set, including registers, cache, memory and I/O. Then account for the ticks through each path.

If you roll your own "CPU" it can be any bit width and/or in a large FPGA with simple processor core you can cram 32 or more into one FPGA. Also, in FPGA, the processor cores need not be all the same. You can have different width, functionality (FPU/Integer/other), instruction set and even a blend of digital and analog computations. So benchmarking threading across 16/32 bit code pages is not productive.

Jim Dempsey

View solution in original post

gaston-hillar · ‎06-17-2009

Quoting - dgunter

Has anyone benchmarked/explored the mixing of 16/32 bit code pages in the same process but in different threads?
Can I runa thread in 16 bit code page and another in a 32 bit code page in the same process?
If so, will the intel microcode behind the scenes and the hyper threading technology run my 16 bit code faster than the same code in 32 bits or slower?
Willthe 16 bit code suffer from alignment lag when accessing the 3/4 bytes in the upper half of what would be 32 bit alignment?
FinallyWhat cache and other penalties do I need to be aware of when testing the mixing of code pages?
Reason: We are prototyping a process for decreasing the run time complexity or energy usage of an applicationby using unused bit widths for parallel processing. i.e If you only use the lower 16 bits, use the darn upper 16 bits to paralell process other datum. Testing on Intel finalize on FPGA

Hi dgunter,

I'm not an expert on this topic. Are you talking about mixin 16 bits code with 32 bits code in the same application / in the same process? I don't think that's possible. A process runs on 16 or in 32 bits. BTW 16 bits isn't be available in 64 bits operating systems, you have to virtualize.
16 bits applications running on modern 32 bits operating systems run really slow. I don't think that it makes sense to test parallelism in 16 bits... It's weird.

As I always say, just my opinion.

jimdempseyatthecove · ‎06-18-2009

Quoting - dgunter

Has anyone benchmarked/explored the mixing of 16/32 bit code pages in the same process but in different threads?
Can I runa thread in 16 bit code page and another in a 32 bit code page in the same process?
If so, will the intel microcode behind the scenes and the hyper threading technology run my 16 bit code faster than the same code in 32 bits or slower?
Willthe 16 bit code suffer from alignment lag when accessing the 3/4 bytes in the upper half of what would be 32 bit alignment?
FinallyWhat cache and other penalties do I need to be aware of when testing the mixing of code pages?
Reason: We are prototyping a process for decreasing the run time complexity or energy usage of an applicationby using unused bit widths for parallel processing. i.e If you only use the lower 16 bits, use the darn upper 16 bits to paralell process other datum. Testing on Intel finalize on FPGA

>>Testing on Intel finalize on FPGA

Since your intentions is to run on FPGA I would assume the "CPU" architecture is not that of an Intel architecture. You will likely be using your own design or that of one of the usual architectures for FPGA such as ARM. This being the case, any benchmarking you perform using wall clock on Intel platform will not be suitable to ascertain the performance on (in) the FPGA. Instead, your best bet would be to write an emulator of your eventual instruction set, including registers, cache, memory and I/O. Then account for the ticks through each path.

If you roll your own "CPU" it can be any bit width and/or in a large FPGA with simple processor core you can cram 32 or more into one FPGA. Also, in FPGA, the processor cores need not be all the same. You can have different width, functionality (FPU/Integer/other), instruction set and even a blend of digital and analog computations. So benchmarking threading across 16/32 bit code pages is not productive.

Jim Dempsey

gaston-hillar · ‎06-18-2009

As Jim mentioned, you're going to use FPGA... Now, I understand. You're going to work with embedded systems. I hadn't understood that focus. Therefore, it seemed weird to do this tests ondesktopmodern 32 or 64 bits operating systems.

dgunter · ‎06-22-2009

Quoting - Gastn C. Hillar

As Jim mentioned, you're going to use FPGA... Now, I understand. You're going to work with embedded systems. I hadn't understood that focus. Therefore, it seemed weird to do this tests ondesktopmodern 32 or 64 bits operating systems.

Yes Sir, Very Weird. I meant "simulating" on the intel... I was attempting to see if I could increase the simulation speed. Or I would if the intel would run mixed 16.32 bit code. I just wantedto see how fast 16 bit code is in relation to 32 bit. Apparently, I can't mix different width code pages :-( on Intel.

Thank you for your time.

dgunter · ‎06-22-2009

Quoting - jimdempseyatthecove

>>Testing on Intel finalize on FPGA

Since your intentions is to run on FPGA I would assume the "CPU" architecture is not that of an Intel architecture. You will likely be using your own design or that of one of the usual architectures for FPGA such as ARM. This being the case, any benchmarking you perform using wall clock on Intel platform will not be suitable to ascertain the performance on (in) the FPGA. Instead, your best bet would be to write an emulator of your eventual instruction set, including registers, cache, memory and I/O. Then account for the ticks through each path.

If you roll your own "CPU" it can be any bit width and/or in a large FPGA with simple processor core you can cram 32 or more into one FPGA. Also, in FPGA, the processor cores need not be all the same. You can have different width, functionality (FPU/Integer/other), instruction set and even a blend of digital and analog computations. So benchmarking threading across 16/32 bit code pages is not productive.

Jim Dempsey

DARN!!!! Thanks, it's what I figured. I hate to run a 8 and 16bit simultion on a 32 bit cpu, it seems like such a waste of simulation time. There must be a way on the Intel to use the unused bit width to gain speed advantage from bit width reduction. Thanks.

jimdempseyatthecove · ‎06-23-2009

Quoting - dgunter

DARN!!!! Thanks, it's what I figured. I hate to run a 8 and 16bit simultion on a 32 bit cpu, it seems like such a waste of simulation time. There must be a way on the Intel to use the unused bit width to gain speed advantage from bit width reduction. Thanks.

Look at the SSEn.m instruction sets. SSE provides for single instruction multiple data (SIMD) whereby you can manipulate multiple like data objects in one instruction.

up to:

16-bytes
8-shorts (word)
4-dwords
2-qwords
4-floats
2-doubles

Note, the Intel Atom supports SSE. You might want to consider a design built around an (some)Atom(s) or hybrid with Atom + smaller FPGA.

Jim Dempsey