Concurrency, was Re: Doh! Stupid Programming Mistakes <humor>

Levi Pearson levi at
Fri Oct 27 12:29:34 MDT 2006

On Oct 27, 2006, at 11:36 AM, Michael L Torrie wrote:
> Besides all this, computing is evolving to be distributed nowadays,  
> with
> a non-unified memory architecture.  Nodes do not share memory; they
> communicate with a protocol.  There's a reason why in super-computing
> MPI and other message-passing protocol schemes are king.  Threads
> obviously don't make sense in any kind of distributed  
> architecture.  Now
> I believe that OSes and computing systems will be designed to hide  
> this
> fact from the programs, allowing normal programs to be spread
> dynamically across nodes.  Maybe through some system that emulates
> shared memory and local devices (mapping remote ones).  Even in a  
> system
> that emulates shared memory (say by swapping pages of memory across
> nodes), your threads may think they are not copying memory  
> (accessing it
> directly) but are not.  Besides that fact, I think it's probably a bad
> idea to code with any particular assumptions about the underlying
> machine architecture (vm or not).

There have been efforts to build distributed shared memory systems,  
but I think they are fundamentally misguided.  Even with today's high  
speed, low-latency interconnect fabrics, remote memory access is  
still significantly slower than local memory access to the point that  
hiding it behind an abstraction layer is counterproductive.  In order  
to predict the performance of your system, you still need to know  
exactly when an access is local and when it is remote.  Considering  
that the point of these systems is high performance, abstracting away  
an important factor in performance is not particularly wise.

This is especially true the less tighly connected your compute nodes  
get.  A multi-processor computer with a Hypertransport bus can  
probably get away with abstracting away local vs. remote memory  
access.  In a multi-node cluster connected by an Infiniband fabric,  
latency differences between local and remote access become  
significant, but one can typically assume fairly low latency and  
fairly high reliability and bandwidth.   A cluster with gigabit  
ethernet moves to higher latency and lower bandwidth, and a grid  
system consisting of nodes spanning multiple networks makes treating  
remote operations like local ones downright insane.

Add these details to the increased difficulty of programming in a  
shared-state concurrency system, and it starts to look like a pretty  
bad idea.  There are plenty of established mechanisms for concurrency  
and distribution that work well and provide a model simple enough to  
reason about effectively.  Letting people used to writing threaded  
code in C/C++/Java on a platform with limited parallelism carry their  
paradigms over to highly-parallel systems is NOT a good idea in this  
case.  Retraining them to use MPI, tuple space, or some other  
reasonable mechanism for distributed programming is definitely worth  
the effort.


More information about the PLUG mailing list