Zenith: My experiences from multi core enabled coding

If you are looking to exploit the new processor technology then multi-core enabled coding is not the only method.

You can easily exploit the power of multiple cores by MPI, openmp, threading etc, these technologies will help you get good gains on not only dual core but also on machines with more than 8 cores. I myself have tested my some machines with as much 20 cores (hyperthreading enabled.)

From my personal experience :

I took part in a code contest which took place at my workplace where we were supposed to write CPU-AWARE code which required lots of redundant data manipulations. We were given with very powerful machines to test our codes. One of the machines had 40 cores and was based on nehalem microarchitecture. I was only aware about multi-threading at that time so I wrote a code which created 40 threads (using pthreads ,unix style multi-threading) and these thread worked independently no communication or resource sharing was required for my problem I was able to do the task in 2 second(approx) when single thread code needed about 1 minute(approx) to complete the task. I was expecting a good rank for my code but to my suprise I was nowhere among toppers. Toppers extensively used AVX and SSE instruction along with multi-threading to write codes. The person who won was able to complete the task in 0.1 second.

In short, there are other areas too to exploit than just using multiple cores.

You can refer folllowing optimisation manual for more details :

http://www.intel.com/content/www...

The latest microarchitectures from intel offers great scope for improvements in code if we write CPU-AWARE programs.

These new microarchitectures offers the following innovative features:

Intel Advanced Vector Extensions (Intel AVX)

— 256-bit floating-point instruction set extensions to the 128-bit Intel
Streaming SIMD Extensions, providing up to 2X performance benefits relative
to 128-bit code.
— Non-destructive destination encoding offers more flexible coding techniques.
— Supports flexible migration and co-existence between 256-bit AVX code,
128-bit AVX code and legacy 128-bit SSE code.

Enhanced front-end and execution engine

— New decoded Icache component that improves front-end bandwidth and
reduces branch misprediction penalty.
— Advanced branch prediction.
— Additional macro-fusion support.
— Larger dynamic execution window.
— Multi-precision integer arithmetic enhancements (ADC/SBB, MUL/IMUL).
— LEA bandwidth improvement.
— Reduction of general execution stalls (read ports, writeback conflicts, bypass
latency, partial stalls).
— Fast floating-point exception handling.
— XSAVE/XRSTORE performance improvements and XSAVEOPT new
instruction.

2-2

Cache hierarchy improvements for wider data path

— Doubling of bandwidth enabled by two symmetric ports for memory
operation.
— Simultaneous handling of more in-flight loads and stores enabled by
increased buffers.
— Internal bandwidth of two loads and one store each cycle.
— Improved prefetching.
— High bandwidth low latency LLC architecture.
— High bandwidth ring architecture of on-die interconnect.

System-on-a-chip support

— Integrated graphics and media engine in second generation Intel Core
processors.
— Integrated PCIE controller.
— Integrated memory controller.

Zenith

Saturday, April 6, 2013

My experiences from multi core enabled coding

No comments:

Post a Comment