EdEarl Posted June 20, 2016 Posted June 20, 2016 Phys.org New chip design makes parallel programs run many times faster and requires one-tenth the code In the May/June issue of the Institute of Electrical and Electronics Engineers' journal Micro, researchers from MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) will present a new chip design they call Swarm, which should make parallel programs not only much more efficient but easier to write, too. In simulations, the researchers compared Swarm versions of six common algorithms with the best existing parallel versions, which had been individually engineered by seasoned software developers. The Swarm versions were between three and 18 times as fast, but they generally required only one-tenth as much code—or even less. And in one case, Swarm achieved a 75-fold speedup on a program that computer scientists had so far failed to parallelize. I think this is huge.
Sensei Posted June 20, 2016 Posted June 20, 2016 (edited) I think they should come up with example source code for review... And whether they meant C/C++ or asm.. Intel is not very optimal, as it has not many registers. Motorola 68000, 30 years ago, had 16 registers with data/address, 15 general purpose... If CPU has to constantly move some data from regular memory/stack to registers back and forth, to do simple calc on them, just by adding dozen/hundred registers, there will be gain in length of code (in asm, because in C/C++ it's the same). All local function variables land in CPU registers, instead of memory/stack. Edited June 20, 2016 by Sensei
Strange Posted June 20, 2016 Posted June 20, 2016 Swarm intelligence is a fairly old idea (and claims of easier parallel programming are even older). I couldn't find any detail of the thing mentioned in the OP so will have to reserve judgment. EdEarl, do you have a link?
EdEarl Posted June 21, 2016 Author Posted June 21, 2016 No, I don't have a link. Parallel programs process multiple streams of data through multiple pipelines; thus, data is spread through time (1st dimension) and across multiple pipelines (2nd dimension). Data conflicts when one pipeline is processing data that will shortly be saved into memory, and another pipeline asks for the data before it has been stored. It sounds like the time-stamp and priority-queue assure the conflict is resolved with accurate data. Need some time to think about it.
TakenItSeriously Posted June 21, 2016 Posted June 21, 2016 I've simulated parallel processing with using timers with AHK scripts when creating HUDs for online poker, which was kind of cool. It wasnt for efficiency but for continuous scanning for a number of trigger events over multiple tables. It just qued up the commands in a single pipe though. Just curious, are the number of pipes limited to the number of cores you have available? Also, I don't understand how it could require 1/10 the code.
Strange Posted June 21, 2016 Posted June 21, 2016 The paper is here: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=7436649 And the news story: https://www.csail.mit.edu/parallel_programming_made_easy So it doesn't sound like there is anything very novel here, in terms of concepts. Similar methods have been implemented on standard hardware before . But with modern technology you can put a lot of these processors (and a lot of memory) on one chip. Also, I don't understand how it could require 1/10 the code. Because in parallel programming there is typically a large overhead for creating tasks, synchronizing and scheduling tasks, communication between tasks, etc. Also, explicitly parallel programs are more often more complex because they have to handle splitting the work or data up, and then recombining the results. Where a processor has hardware support for these things, then you can write much simpler code.
EdEarl Posted June 21, 2016 Author Posted June 21, 2016 Just curious, are the number of pipes limited to the number of cores you have available? Also, I don't understand how it could require 1/10 the code. My experience with running business data processing programs on parallel computers suggests about 10% additional code is necessary. Reducing that 10% by 1/10th is significant but not remarkable. Pipes per core for Multiple Instruction Multiple Data (MIMD) systems. I think the term core was invented because processors became faster than memory and one processor with two sets of registers can process two streams of data. If one processor works on two streams of data by time slicing, is that two pipes per core or one? A programmer doesn't need to know pipes per core; it is an issue that hardware designers manipulate to maximize processor performance. If a processor can run 10x faster than memory, then it can process ten threads of data per core (effectively ten pipelines per core). In addition, a processor may have one or more asynchronous processor to do floating point, which means there may be both integer and floating pipelines run by one core. Thus, one core may process several threads and multiple data types in parallel pipelines. Hardware can make parallelism decisions better than people, due to real time data conflicts, and the added complexity of parallel programming can be too difficult for people.
TakenItSeriously Posted June 21, 2016 Posted June 21, 2016 My experience with running business data processing programs on parallel computers suggests about 10% additional code is necessary. Reducing that 10% by 1/10th is significant but not remarkable. Pipes per core for Multiple Instruction Multiple Data (MIMD) systems. I think the term core was invented because processors became faster than memory and one processor with two sets of registers can process two streams of data. If one processor works on two streams of data by time slicing, is that two pipes per core or one? A programmer doesn't need to know pipes per core; it is an issue that hardware designers manipulate to maximize processor performance. If a processor can run 10x faster than memory, then it can process ten threads of data per core (effectively ten pipelines per core). In addition, a processor may have one or more asynchronous processor to do floating point, which means there may be both integer and floating pipelines run by one core. Thus, one core may process several threads and multiple data types in parallel pipelines. Hardware can make parallelism decisions better than people, due to real time data conflicts, and the added complexity of parallel programming can be too difficult for people. It actually said requres 1/10 the code . I could see it as being a typo, but the way it and emphasized easier programming, I'm not so sure. I thought that it was finding efficiencies by using the same piece of code in 10 pipelines instead of ten seperate threads? But really, if its doing the same process, then all ten threads could call one subroutine. I'm not sure I understand how a slower memory could provide greater efficiencies for parallel pipelines. It seems like more of limitation of effectiveness if the memory cant handle the extra bit rates the parallel pipes can process The paper is here: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=7436649 And the news story: https://www.csail.mit.edu/parallel_programming_made_easy So it doesn't sound like there is anything very novel here, in terms of concepts. Similar methods have been implemented on standard hardware before . But with modern technology you can put a lot of these processors (and a lot of memory) on one chip. Because in parallel programming there is typically a large overhead for creating tasks, synchronizing and scheduling tasks, communication between tasks, etc. Also, explicitly parallel programs are more often more complex because they have to handle splitting the work or data up, and then recombining the results. Where a processor has hardware support for these things, then you can write much simpler code. Oh. I see, its putting much of the code into the hardware. Thanks.
EdEarl Posted June 21, 2016 Author Posted June 21, 2016 Slower memory makes everything run slower, but fetch times have often been considerably slower than processors, just the way it is. If they could make multi gigabyte memory with a short fetch time, they would.
Recommended Posts