Parallel Programming

647e430ca6b2c38d008dc55b1c3a7ecc
0
karligula 101 Feb 23, 2007 at 12:13

Hi everyone,

With the recent news about Intel showing off its Terascale 80 core chip, and the latest NVidia GPUs having 128 cores, all of which are becoming more general purpose by the week (so it seems), I decided to have a look at the parallel programming literature… seems like nobody has really found a convincing way yet of automatically parallelising an arbritrary program.

I’ve read about stuff like SDKs that provide pre-programmed parallelism, and transactional programming at a hardware level… although both sound useful, neither of these approaches seem to fit the bill when considering systems that may contain thousands or even millions of processing cores.

I’ve also read something somewhere that said progress in computing (at least from a software perspective) could grind to a halt five years from now if nobody can find a solution to this problem. There wouldn’t seem to be much point in providing thousands of processing cores if nobody can figure out how to program them.

So what do you guys think about such things? Do you think it’s insurmountable? Are most programs just inherently serial and can’t be realistically parallelised? Or has someone got some cunning ideas?

17 Replies

Please log in or register to post a reply.

A9102969e779768e6f0b8cb87e864c94
0
dave_ 101 Feb 23, 2007 at 13:15

@karligula

the latest NVidia GPUs having 128 cores

Err right? Where did you hear that? Probably a sloppy journalist.

They have 128 “Stream Processors”. They are not equivalent to a core at all!
Even the groups of them (of which it only has 8 of) aren’t equivalent to a separate core because of the lack of control.

The 80 intel chip isn’t really usable now either, just a proof of concept.

I dont understand why you post these vague (shallow) questions. You’d get more insight reading a book, or looking on wikipedia or just reading the first few search results on google.

647e430ca6b2c38d008dc55b1c3a7ecc
0
karligula 101 Feb 23, 2007 at 14:06

I ask these ‘vague shallow questions’ in a vague shallow attempt to get people’s opinions and provoke people to think about the question, not to get reply like ‘read a book’ or ‘search google.’ OF COURSE I could read a book or search google if I wanted some dry factual information. I wanted your opinion on the topic. Where else am I supposed to go to get an experienced human opinion? What is this forum for if not to ask questions and engage in debate? If it’s purely here as a database of programming techniques then there are a million places on the web where people can get much better information than a forum.

Dave I think we’re just here for different reasons and if you don’t like my questions then please just don’t reply to them.

C24eb7e6aaefba78b94c831ddc7b4d0b
0
donBerto 101 Feb 23, 2007 at 15:03

Someone once said…

give a little, get a little

So in order for you to fish out opinion, first you must ‘seed’ an opinion. True, you have ‘provoked’ some of us with your question(s) but what about you? what do you think is going to happen in 5 years in respect to parallel processing? Do you think such problems are insurmountable?

I don’t think anything is insurmountable, given time. I don’t think everything can be parallel-ized to the point that there would be 1 core per 1 binary instruction. I believe that we’re moving in the right direction with distributed systems, among other ways to achieve parallelism.

A9102969e779768e6f0b8cb87e864c94
0
dave_ 101 Feb 23, 2007 at 15:48

Well put donBerto…

By shallow I mean you haven’t shown any depth in understanding the subject. Not least with the “facts” you presented. The questions you ask aren’t really debatable. They’ve been solved a long time ago. Sure, ‘average’ programmers don’t know to do it but they need to learn.
This isn’t a new topic highly parallel machines have existed for decades. They just weren’t available on the desktop.
Its well know that its difficult to achieve more than 25% efficiency on a parallel machine. The software effort to make things go faster often isn’t worth it. By the time you’ve made it more efficient, you could have just bought a faster computer. So far this hasn’t changed. It might in future.

Here’s a interesting read about valve’s approach

A8433b04cb41dd57113740b779f61acb
0
Reedbeta 168 Feb 23, 2007 at 15:56

@karligula

seems like nobody has really found a convincing way yet of automatically parallelising an arbritrary program.

Nope, and I doubt this will ever (well. for a long time) be possible to do automatically for arbitrary programs…it would require too much ‘intelligence’ on the part of a tool. Programmers will have to write their programs for parallelism, not write them serially and depend on tools to make them parallel. It’s kind of like writing a program in procedural C and wanting a tool to convert to well-designed object-oriented code…not going to happen.

Are most programs just inherently serial and can’t be realistically parallelised?

Many programs that exist today are indeed inherently serial. However, the problems that those programs solve frequently aren’t. It’s possible to write many, many things in a parallel way, but you may have to spend some time thinking about it to see how.

There wouldn’t seem to be much point in providing thousands of processing cores if nobody can figure out how to program them.

You hit the nail on the head here. Most programmers still aren’t used to thinking parallel and this is one of the major obstacles to concurrent program development. I expect concurrent thinking will eventually be taught in intro CS courses and programming books as a basic concept like loops and functions. (And built-in language support for concurrency, like Java has to some extent, will go a long way toward making this easier.)

Of course…most computer users still just run one application at a time and use their computer for word processing, browsing the web, watching videos, and so forth. Multiprocessors are never going to be a great advantage here; they can help a bit, but in these cases the majority of the CPU time is spent waiting for the user anyway. (At least until we get AI operating systems capable of speech and gesture processing!)

647e430ca6b2c38d008dc55b1c3a7ecc
0
karligula 101 Feb 23, 2007 at 17:43

Dave… you say these things were solved a long time ago. Perhaps for just a handful of parallel threads they were, but I was wondering about the difficulties involved when you have thousands of cores available. Also, the NVidia GPU might not be cores in the literal sense right now, but sooner or later they WILL be fully programmable general purpose processing cores, so it seems a bit pedantic to labour that particular point.

DonBerto… give a little get a little…. that’s a good point well taken, I’ll try to elucidate my thinking a bit more.

Right, so given an arbritrary program, what are the difficulties involved in having the compiler scan the code, figuring out which bits are dependent on others, which bits are independent and thus capable of being parallalised, and inserting sequence points to make sure everything stays synchronised?

Would creating programs using a much finer granularity of objects help, where the objects can run independently and the cores would communicate the object mesages? Could that be automatically synchronised?

Perhaps there might be language constructs to help, eg:

parallel for(x=0;x<1000;x++)
{
parallel for(y=0;y<1000;y++)
{
dosomehorrendouslycomplicatedtask(x,y)
}
}

which is a simple example that can obviously be done manually by launching threads but a language construct would make it simpler.

Anyway I’m writing this at the office, my brain is shutting down faster than Windows ever has and it’s time for home… ;-)

19a0c386a729b505b64aa790c15ab60e
0
Cowboy_Coder 101 Feb 23, 2007 at 23:36

I don’t think the issue of increasing numbers of cores is a “problem” at all. It’s more of an opportunity. CPUs are not going to get slower, they are just going to have more cores. Since it’s relatively easy to parallelize large chunks of game engines, more cores = more speed. The challenge will be in how closely you can approach 100% utilization of your multiple cores. That is some some senses an academic issue. If you’ve got 80% of your code running in parallel (assuming it’s fine grained enough), you’ll always get speed-ups by adding more cores. There are many other issues in game development that contribute to the bottom line besides CPU utilization.

I wrote a couple of articles on multi-core for games, but focusing more on the current generation of 2-4 core CPUs.

http://cowboyprogramming.com/2007/01/05/multi-core-processors/

http://cowboyprogramming.com/2007/01/05/multithreading-particle-sytems/

19a0c386a729b505b64aa790c15ab60e
0
Cowboy_Coder 101 Feb 23, 2007 at 23:55

@dave_

Its well know that its difficult to achieve more than 25% efficiency on a parallel machine. The software effort to make things go faster often isn’t worth it. By the time you’ve made it more efficient, you could have just bought a faster computer. So far this hasn’t changed. It might in future.

Your 25% number is a little specious. Game engines generally process a large number of atomic objects, which is a great candidate for parallization, especially in physics and graphics. AI is harder, but you can also do the approach of pipelining when moving from 1 to 2 cores. You could easily get a 50% improvement at least on a game engine in less that a month’s work.

8e9aaea714077ea62a0c7f81cfb7672a
0
Razor 101 Feb 24, 2007 at 01:28

Um. It doesn’t seem that hard to make things automatically parallel, in fact I think functional languages support this already (though I haven’t tried one yet, so I could easily be wrong). Without all those pesky side effects, what’s to stop you running a loop on several processors at once? It’s just like the case mentioned earlier about object orientation. You can’t automatically upgrade a program that wasn’t built for it.

B91eae75cd6245bd8074bd0c3f1cc495
0
Nils_Pipenbrinck 101 Feb 24, 2007 at 03:42

@karligula

Perhaps there might be language constructs to help, eg:

parallel for(x=0;x<1000;x++)
  {
  parallel for(y=0;y<1000;y++)
    {
    dosomehorrendouslycomplicatedtask(x,y)
    }
  }

The DSP compiler I use at work does something like this to some extent. The DSP has 8 alus that can execute code in parallel. Simplified, the compiler detects loops that have no or just simple dependencies between different trips and executes multiple instances of the same loop in the 8 alus. This could be extended to multicore-architectures as well.

However, using this feature has a cost: The compiler needs ages to compile if you enable this optimization feature, and to get good result out of it you have to hint the compiler about the loop in various ways (minimum tripcount, tripcount is always a multiple of x ect).

Nils

19a0c386a729b505b64aa790c15ab60e
0
Cowboy_Coder 101 Feb 24, 2007 at 16:56

OpenMP does something like that.

http://msdn.microsoft.com/msdnmag/issues/05/10/OpenMP/

#pragma omp parallel    
{
#pragma omp for
for(int i = 1; i < size; ++i)
        x[i] = (y[i-1] + y[i+1])/2;
}
66cf72d07de447c933f5354fd382b5cc
0
nassimamar 101 Feb 25, 2007 at 15:45

My input on this, as I recently had to write a paper on the CBE (Cell Processor), is to rework compilers. Along the lines of OOO (Out of Order) or In order compilation, and vectorized programming, they should think about redesigning compilers in a way where it would analyze your code and best divide how each process in your code will best be dealt with that including to which core or processor it’ll be sent to. Ofcourse this is all theoritical…. :\^)

A9102969e779768e6f0b8cb87e864c94
0
dave_ 101 Feb 25, 2007 at 18:10

@Cowboy Coder

Your 25% number is a little specious.

Its a number I was quoted last time I visited a data centre.
Thats 25% of max theoretical through-put.
Even the very fastest super computers in the world with specially crafted software only achieve around 80%.
I wasn’t talking about 1-2 processors I was talking about thousands, like the OP wanted to talk about.

A9102969e779768e6f0b8cb87e864c94
0
dave_ 101 Feb 25, 2007 at 18:33

@karligula

Dave… you say these things were solved a long time ago. Perhaps for just a handful of parallel threads they were, but I was wondering about the difficulties involved when you have thousands of cores available. Also, the NVidia GPU might not be cores in the literal sense right now, but sooner or later they WILL be fully programmable general purpose processing cores, so it seems a bit pedantic to labour that particular point.

The trouble is those facts where what you used to formulate your hypothesis so its not just pedantry.

Well, I’ve done a few courses on parallel processing, even worked on projects (in a paid job) where the server was on a cluster and worked on PS2/PS3, Xbox360, so I do have a little insight.

The trouble is, this is a huge subject. What can I hope to say of any worth in a few sentences here?

With many good books. I’d suggest you read one. For starting I’d recommend Tanenbaum but it is a bit dry. Its a good reference when combined with some other forms/sources of learning.

GPGPU stuff is great… I seriously doubt they’ll become fully programmable cores. If they were why would you not just use a second main CPU? More likely is that main CPU will gain special vector units like 286/386 gained integrated FPU.

I’m sure they’ll keep their specialist pipelines and complement traditional CPUs for many years to come.

However what if we were to take in further and put an FPGA into a PC. Now that would be interesting. You could re-configure it for every app for a special pipeline and get some interesting parallel computation done.

19a0c386a729b505b64aa790c15ab60e
0
Cowboy_Coder 101 Feb 25, 2007 at 19:44

@dave_

Its a number I was quoted last time I visited a data centre.
Thats 25% of max theoretical through-put.
Even the very fastest super computers in the world with specially crafted software only achieve around 80%.
I wasn’t talking about 1-2 processors I was talking about thousands, like the OP wanted to talk about.

Data center code is not the same as game code. And it’s worth noting that even essentially single CORE processor systems like the PS2 don’t ever do anything like their theoretical FLOPS throughput, mostly due to cache issues.

Of course there are problems trying to reach the maximum efficiency. It will vary by application. 25% just seems rather low - perhaps that is reflecting just attempting to scale an existing single threaded application. A rendering engine targeting a 1024+ core machine should be able to get way higher than that.

8e9aaea714077ea62a0c7f81cfb7672a
0
Razor 101 Feb 26, 2007 at 00:47

I don’t claim to know much about this, so take this with a table spoon of salt, but:
@dave_

I seriously doubt they’ll become fully programmable cores.

It looks to me like things are heading that way. You must know about CUDA right? And it wouldn’t make them just another processor by any means. They’re stream processors, with their own strengths and weaknesses.
@dave_

More likely is that main CPU will gain special vector units like 286/386 gained integrated FPU.

You mean other than MMX, 3dnow, SSE1, 2 and 3…?

A9102969e779768e6f0b8cb87e864c94
0
dave_ 101 Feb 26, 2007 at 09:03

@Razor

It looks to me like things are heading that way. You must know about CUDA right? And it wouldn’t make them just another processor by any means. They’re stream processors, with their own strengths and weaknesses.

Yes, its a solves a set of specialist problems well. I’m sure the pipelines will stay specialized. but what do I know, only time will tell.
@Razor

You mean other than MMX, 3dnow, SSE1, 2 and 3…?

Again yes SSE/MMX/3dnow are SIMD floating point (although they were never seperate), but a GPU is best at matrix/vector ops which are much more complicated.