Jump to content

Recommended Posts

Posted

With electric signals on silicon, we can make the interconnection matrix, full, and reacting at the speed of individual data packets. The power consumption for a 500*500 switch matrix fits a single silicon chip - only the number of in-out pins limits the chip.

 

Do you know a routing method for light that is as good?

Posted (edited)

With electric signals on silicon, we can make the interconnection matrix, full, and reacting at the speed of individual data packets. The power consumption for a 500*500 switch matrix fits a single silicon chip - only the number of in-out pins limits the chip.

 

Do you know a routing method for light that is as good?

Maybe: Manycore Processor with On-Chip Optical Network

Engineering a Bandwidth-Scalable Optical Layer for a 3D Multi-core Processor with Awareness of Layout Constraints

 

Edited by EdEarl
Posted

Thanks for the links!

 

The first, called ATAC design, uses wavelength division multiplexing to send messages on a single light carrier that joins all nodes.

- It is limited to ~100 nodes by the carrier separation

- Good throughput thanks to the fast modulators

- The authors give a chip area for the modulators and the photodiodes but NOT the filters, which I feel just dishonest. WDM filters are typically "racetrack" loops which must be long to separate close wavelength. Just 64 nodes need 64*64 big filters, and the authors hide this.

- Over 100 nodes, the connection matrix cannot be complete. In contrast, silicon switches make 1000 nodes with one chip and are expandable.

 

I can't access the payfor second paper. The abstract tells "network partitioning options" which sounds opposite to "full connection matrix".

 

The third, again the ATAC design, describes a 100 cores paperwork but grouped as 64 clusters of 16 cores, that is, not a flexible full matrix. It's a hierarchical design with all associated bottlenecks.

 

As compared with these designs, a simple silicon chip with N2 switches or gates to route the data is better: uniform routing, throughput, absence of interlocks and constraints...

Posted

Having worked on NonStop computer systems for over ten years, I can attest that non of the software systems I worked on were limited by the hierarchical processor interconnect therein used. On the other hand, some problem sets will be limited by it.

 

There is concern among chip manufacturers that power dissipation will limit the number of cores in microprocessor chips, which is one reason optical interconnects are being investigated. Thus, I think dismissing them as being inferior is not necessarily true; often one must live with engineering trade-offs.

Posted

In a supercomputer, interconnection IS the limiting factor. Very different from a file server where the tasks are nearly independent from an other.

 

In that sense, the optical interconnects described in teh linked papers, which limit the number of nodes and put constraints on simultaneous communications, are inferior to a silicon chip as I suggest in message #22.

Posted

There is a really good reason that supercomputers are switching to things like Juniper's QFabric. Not only are the interconnect speeds (throughput as well as pps, etc) relevant, latency is just as much of an issue, especially when you start spanning over a datacenter.

Posted

Very nice for data centers, Juniper's dense interconnect for Ethernet, since data centers treat independent requests.

For a supercomputer, their 5µs latency is unbearable. 50ns would be a strong programming constraint, 5ns a reasonable one, 500ps excellent.

The silicon chip suggested in message #22 would be in the 5ns to 50ns range.

Posted (edited)

For a supercomputer, their 5µs latency is unbearable. 50ns would be a strong programming constraint, 5ns a reasonable one, 500ps excellent.

 

Let's examine this hypothesis:

 

Lets see http://www.theregister.co.uk/2010/05/25/cray_xe6_baker_gemini/?page=2

 

Cray Titan is interconnected with the Cray Gemini interconnect:

 

 

The Gemini interconnect has one third or less the latency of the SeaStar2+ interconnect, taking just a hair above one microsecond to jump between computer nodes hooked to different Gemini chips, and less than one microsecond to jump from any of the four processors talking to the same Gemini. Perhaps more significantly, by ganging up many pipes using the high radix router, the Gemini chip can deliver about 100 times the message throughput of the SeaStar2+ interconnect - something on the order of 2 million packets per core per second, according to Bolding. (The amount will change depending on packet size and the protocol used, of course.)

 

 

So, while the jump between the four processors is less than a microsecond (and in my world of high performance networking equipment, that statement means that it will take up to a microsecond), the Gemini chip by itself introduces around 500 nanoseconds of latency internally (thats 1 second divided by 2 million), thats without any of the jumps to and from the chips, hardware latencies, table lookups, routing, and also the hops in between the cores to get to the correct device. I suspect end to end latency to be within 10-20 microseconds, this is an uninformed guess though.

 

Lets look further, lets not even look at Cray

 

From http://www.hpcwire.com/hpcwire/2013-06-02/full_details_uncovered_on_chinese_top_supercomputer.html

 

Tianhe-2 (thats the fastest supercomputer in the world right now)

 

As Dongarra notes, “This is an optoelectronics hybrid transport technology and runs a proprietary network. The interconnect uses their own chip set. The high radix router ASIC called NRC has a 90 nm feature size with a 17.16x17.16 mm die and 2577 pins.”

He says that “the throughput of a single NRC is 2.56 Tbps. The network interface ASIC called NIC has the same feature size and package as the NIC, the die size is 10.76x10.76 mm, 675 pins and uses PCI-E G2 16X. A broadcast operation via MPI was running at 6.36 GB/s and the latency measured with 64K of data within 12,000 nodes is about 85us.

 

Just to conclude: they feel that an 85 microsecond delay (end to end) is reasonable for the fastest supercomputer in the world.

Edited by AtomicMaster
Posted

Microseconds and a 48 ports matrix capable of builing only a hypertorus, that' YUK.

An expandable full matrix of 1000 ports with 50ns latency, like in message #22, is obviously better.

This has immediate consequences on the computer's efficiency and the way it can be programmed.

Posted

A fully parallel-connected 32bit x 8 i/o x 2 op / clock running at 1.25Ghz would be even better for supercomputers, especially using optical interconnects, its just that you can build a much cheaper alternative that will for less money in a matter of months perform better out of commodity hardware...

 

the problem with your approach by the way is the throughput. with small microcontrollers you may be able to get away with using i2c for messaging, but if you look at new gpus for example, to interconnect those at any reasonable rate you would need to look at about 256-512bit io/clock at over a gigahertz, unless you are thinking of using i2c as a job messaging and tracking bus and then utilizing a different back-bone for data? in which case, you could indeed message faster than traditional systems, but at the cost of higher complexity, and new sync issues.

Posted

I2C is for a demo. A supercomputer would use fast links.

 

Presently, parallel computers have serial interconnects, and their network is hollow. Their software can more or less exploit these computers because the parallel tasks exchange far less data than is computed. This is a heavy constraint on software and is a known heavy limit to performance presently.

 

What I propose in message #22 is a similar throughput, that is serial links at maximum possible speed, but with a full interconnection matrix. This is feasible for thousand nodes with a chip, and is expandable, while a parallel bus doesn't allow a good number of nodes.

 

Fortunately, even a serial bus improves over the present networks. A bus with graphics Ram throughput would be nice but the interconnection network is impossible.

Posted

I am confident that one might design a matrix multiplication unit for arrays of 10002 x 10002 (or any other size) that theoretically operates slightly slower than a single floating point multiply-add. Matrix multiply is the kind of algorithm that requires the highest amount of data movement among multiple processing units. The communications hardware needed for moving this data requires many transistors, significant energy to operate, and corresponding heat dissipation. These costs are significant and irrelevant if one is doing many other algorithms, for example vector add.

 

Consequently, parallel processors are designed in various ways, including single instruction stream with either single pipeline data processing (SISD) or multiple pipeline data processing (SIMD), and multiple instruction stream with either single pipeline data processing (MISD) or multiple pipeline data processing (MIMD) architectures. Each of these architectures efficiently processes a variety of algorithms, and does no efficiently process other algorithms.

 

For cost and efficiency reasons, parallel processors are built to solve a particular set of problems, engineering tradeoffs are often considered, and there is no real low-cost parallel processor to solve all algorithms efficiently. It seems likely that most multiprocessor technology will be tuned for high volume applications, such as client-server. Moreover, as design and manufacturing costs become smaller, more special purpose multiprocessors will be developed for special applications. It seems unlikely that a general purpose multiprocessor will ever be practical.

Posted

It goes without saying, but better if I say it: this is how the full switch matrix can look like.

post-53915-0-82118400-1384703415.png

 

A present-day chip package has 1500 or 2000 pins. Serial data links, differential, without separate signalling, take 2 pins each. 256 data inputs and 256 data outputs fit in one package. The full switch matrix on the chip can be as small as 64k gates plus glue, tiny by present processes. Coupling by capacitance rather than pins may allow more inputs-outputs.

Some data buffers would fluidize the data transit, since several data sources may compete for a destination. Present chips would permit plethoric 5kb per matrix node for maximum flexibility, but I prefer the data buffers near the inputs and outputs to keep the matrix tiny, since chip area increases much the propagation time - including for conflict resolution. A 2562 matrix has the size and consumption of a 64b multiplier, and it shall reach the same 3 to 5 GHz for asymptotic 1Tb/s.

Trees can spread data to and gather from the matrix to reduce the gate load but not the line length. Repeaters, or regenerative receivers as in a dram, let the propagation time increase as the line length rather than its square. Registers, for instance at the tree branches, allow systolic operation; maybe the conflict resolution can propagate as data does, provided a systolic chain can be stopped.

At 5GHz, 20 destination address bits and 150 data bits need then about 40ns to cross a chip.

The adapters shall speak the processing nodes' protocol (often multibit and with separate signal lines, like Pci-E or the interconnect links of existing processors), adapt the data rates, buffer some data, translate to unidirectional signals for the matrix... I wouldn't let them interconnect the processing nodes, as this would bring holes in the switch matrix. But they might manage a small redundancy in the matrix.

The matrix chip's data links could be bidirectional to permit a full matrix for 512 processing nodes, but this multiplies the conflicts, and they get more difficult to signal through special voltages. The links might even speak natively the nodes' protocol if (made) suitable, suppressing the adapters.

A small supercomputer, or maybe an Internet or data server, affordable to many companies, can have one matrix chip and 512 processing nodes... but we want also big supercomputers, don't we?

 

post-53915-0-74070100-1384703490.png

 

More matrix chips make a bigger matrix, as expected. Again, data buffers would fluidize the traffic, especially if propagation times make conflict signalling and resolution slower than the data rate. Again, trees can spread data to and gather from the matrix chips; repeaters, regenerative receivers, systolic operation can apply as well. The propagation time increases little.

256*256=64k matrix chips, each with 256 inputs and outputs, make a full matrix for 64k processing nodes. Since the matrix chips are but a package, we can exaggerate further. One limit is cabling among the chips, but the matrix can split reasonably among several printed circuits.

----------

Even at the matrix chips, the data links could be several bits wide. This reduces the message duration hence propagation time, including for the destination address, but the number of switch points on each chip falls as the width squared: for smaller computers maybe.

To increase the data rate, one better puts more complete matrices as parallel layers. This takes fewer chips, and since the matrix layers connect only at the processing nodes, cabling is easier. Each layer needs a full destination address, so it better routes messages than individual bits.

The matrix, full and with moderate delays, makes a multiprocessor much easier to program efficiently. Would it usefully interconnect many processing nodes within one chip? Maybe: it's a matter of propagation delays within the chip, and these get bad in the recent years.

Presently, processing nodes have few packages with several cores each, connected by a limited bus to a multichip Dram. Fabrication processes begin to mix Dram and cores, claiming to increase the cache size. I imagine instead that the full Dram capacity should be on such mixed chips, to take full advantage of the wider and more reactive Dram bus. Rather than, say, 32 Dram chips feeding one Cpu with 8 Sse or Avx crunchers, we could have 32 mixed chips with Dram and one scalar Mac each, and a good flat switch matrix between all these chips, for a simpler programming model. If chips carry several cores each, these can communicate through the external general matrix.

Marc Schaefer, aka Enthalpy

Posted

Ram modules offer 8GB over 16 chips at end 2013, so a future Dram+Cpu chip can contain 4Gb=0.5GB=64MWord, a scalar Cpu and some cache. 0.5GB per scalar Cpu is as much as 4GB per quad Sse core.

The Dram on the same chip can be accessed over 32kb = 4kB = 512Word lines, matching the cache's pages. A 2048b bus at 2GHz would add 8ns. A read or write time of 30ns would provide 1Tb/s = 136GB/s = 17GWord/s, enough to feed a 5GHz multiplier-accumulator: the "golden rule" again permits single-loops at Cpu speed across the whole Dram, not just within a cache. Back to efficient and reasonably simple software. Maybe registers and a single cache level suffice. Only the interprocessor network is slower - and virtual memory if any.

----------

Important algorithms need a fast access to nonconsecutive but evenly spaced data, for instance this row k in a product of matrices:

 

post-53915-0-51087700-1384894677.png

 

Workarounds exist, especially for this example, but can be cumbersome and their multiple Dram accesses still inefficent.

Though, if the data spacing and the number of words in the Dram line are mutually prime, every needed word is in a different row of the Dram, allowing its full throughput. The old BSP had 17 memory banks - a prime number - for maximum chances. Description in:

"Parallelism - The Design Strategy for the BSP" on pages A-24 and A-25

Our Dram can also be 521 words wide instead of 512. If some matrix has a multiple of 521 rows, the compiler shall add a dummy one.

Now, each word row in the Dram must compute its scaling product and Euclidean division on the address. Though, far from one multiplier per row, the successive multiples need only one adder per row, still fast if choosing well the paths, and the modulo 521 (0x209) is a narrow conditional subtraction aided by the many zeroes; or a special instruction precomputes all multiples. This extra hardware needs only be faster than the Dram and is common to all registers for scaled indexed access. A Dram+Cpu chip achieves this, separate ones not easily.

----------

The BSP was a vector processor, operating to and from the main Ram - but many programmes need local random access, hence from cache memory. With cache pages that map consecutive addresses, and also scaled vector registers, cache coherence is difficult...

With extra hardware:

  • Each word of a scalar register could memorize its Dram address and its cache page if any
  • On detecting a collision between cache and vector registers, the complete older one would be flushed to the Dram
  • Or any new word would be written to both the cache and the vector register

Not so good. Writes to the cache while a vector register slowly reads the Dram mean collisions hard to detect. It also forbids to read and write a vector register at different locations. I prefer instead to have a special "scaled copy" instruction:

 

double tmp [N]; // Prefer N=512
#pragma GiveHintToOptimizer
for (n) tmp [n] = matrix [n,k]; // the optimizer sniffs the ScaledRead instruction.
DoWhatYouNeed (tmp);
for (n) matrix [n,k] = tmp [n]; // if useful. ScaledWrite instruction.

 

Then, changes made on matrix [,k] before writing tmp [] back are the programmer's bug, not my hardware's fault.

 

The instruction works conceptually at the memory on tmp [], but can target instead a cache page and just mark the Dram range as dirty. Copying long data at once would write the cache to the Dram in tmp [] as usual, with limited overhead, but a smart programmer would instead have a 512 word tmp [] and use the data read from Dram before overwriting it with new scaled reads.

----------

Fourier transformation is a common activity of supercomputers and needs bit-reversed addressing. This too can access data at full Dram width, with the Dram and some hardware on the Cpu chip, like:

 

double tmp [N]; // Prefer N=512
#pragma GiveHintToOptimizer
BitReversedVectorCopy (signal, tmp, wherebegin, howmany); // The library calls the instruction
ButterflyChunk (tmp);

Again with favourable chunk size, and maybe alignment, chosen by a smart programmer.

----------

The old Cyber-205 and few others offered a computed addressing, where an address vector defines where to read or write a data vector in Ram:

 

double DataInRam [many]; int where [N]; double tmp [N];
BizarreAddressComputation (where);
VectorPeek (DataInRam, where, tmp, N);
ArbitraryProcessing (tmp);
VectorPoke (tmp, where, DataInRam, N);

 

This gives zero guarantee against access collisions among the words in Dram, so either the Dram access can last very long, or it must fill what it can in one Dram cycle and mark the meaningful data... Then BizarreAddressComputation () could keep the pending requests and complete where [] with new ones while the Alu/MAdd processes the available data.

Complicated, and this isn't the full picture. The Alu/MAdd or sequencer must propagate the "meaningfuldata" flag, possibly wait before writing back when the cache is full - and what about multitasking, async interrupts...? Computed addressing can be useful, but not everyday, and is an effort. Worth it?

----------

As quad Sse processors draw presently <130W with caches, busses and graphics, 20W per Dram+Cpu chip must suffice: easy to cool and supply. Without Ram bus nor Pci-E, the package can be small, like three dozen pairs of serial links if building a hypertorus, and take little routing area on the printed circuit board.

This suggests that a small supercomputer or server with 512 processing nodes fits on one printed circuit board around the lone switch matrix, with some busbars and heat pipes.

Putting a single Sse or Avx core together with the Dram on one chip is tenting, but I prefer several scalar processing nodes then, passing through the general interconnection network, to keep the flat model that makes software easier and efficient. The small power and pin count suggest multi-chip packages.

Marc Schaefer, aka Enthalpy

Posted

Scientific models need usually 64 bits floats, but signal processing can often run on 32 bits floats, and an Internet server may want just bytes. Mmx-like hardware capability and instructions are more efficient for them, but these need to shuffle data - not often, but an application would stall without the corresponding hardware capability.

 

This shuffle is often proposed only within one processor width, but spanning words would be useful when it's the data's nature. Offering it within a data vector, not just a register, would be more flexible and would accelerate the access to Dram in such cases.

 

This could be one more Read and Write set of instructions that operate between a complete Dram access and a cache page. In addition to ScaledRead (including on 32 bits etc) and BitReversedRead, the machine could offer various ByteShuffleRead and Write, to be used by the programmer on explicitly distinct variables before operations one the reshuffled data.

 

Well, if this shuffle mechanism can provide efficient operation on 64 bits with a 128 bits or 256 bits processor, then it could make sense to integrate one Sse or Avx core on a chip with the Dram, not just one 64 bits MAdd. Of course, the Dram access width and the cache pages must increase accordingly, in order to deliver the throughput.

 

Marc Schaefer, aka Enthalpy

Posted

For a connector or a chip socket conveying many differential signals, this layout reduces crosstalk because each pair sees its neighbours at their middle or from its middle:

post-53915-0-10758900-1385324570.png

I didn't check if it's already known. The area is well used: if contact spacing is d, this layout uses 0.93*d2 per contact, compared with 1.00*d2 for a square pattern and 0.87*d2 for a hexagonal one.

Wiring is a brutal limit to supercomputers. I find it hard to wire just the boards of a hypercube with one million nodes, and this decides the shape and ground area of the computer - so wiring may be an obstacle to a full switch matrix, more than the number of switching points in silicon.

Marc Schaefer, aka Enthalpy

Posted

As a small supercomputer, a file or Internet server, a database... I had suggested 512 Cpu+Ram chips and one switch chip on one board. We can increase it without difficult wiring.

Existing Pci-E 3.0 connectors pass 16 lane pairs at 5Gb/s, so daughter cards can carry 16 Cpu+Ram chips, exceeding a hypercube design. A main board of ~0.7m*0.5m can carry 32*4 daughters on a single side, and the 4000 differential signals keep reasonable to route. The computer totals 2048 Cpu+Ram chips for 20TFlops peak, already a funny toy.

Cpu+Ram chips directly on the main board wouldn't quite achieve this density, especially because repairs need sockets then. Daughters on both faces of the main board would improve further; the connectors can be offset, or to ease routing, be well-hold Smd. Connectors denser than Pci-E would be immediately useable.

Pessimistic 20W per Cpu+Ram chip let the machine draw 40kW. The well spread dissipation still permits air cooling if inlets and outlets alternate closely, but deserves liquid cooling.

The main board can have 16*16=256 switch chips of 128 inputs and 128 outputs for a full matrix. Fewer matrix in-out per chip permit a cheaper package with additional functions at each switch chip: Sata and Sca raid controller, Ethernet, Internet point...

Marc Schaefer, aka Enthalpy

Posted

Fibre optics seems to use one connector per fibre still: too bulky for a supercomputer with millions of signals. To carry many signals between two boards, we may use flexible printed circuits, or if you prefer flexible flat cables, in order to save some volume over cables of twisted pairs.
http://en.wikipedia.org/wiki/Flexible_electronics
http://en.wikipedia.org/wiki/Flexible_flat_cable

For several hundred signals, a cable would comprise several layers, glued on an other at the ends and some points, but kept separate in between to be flexible. I'd have ground planes against cross-talk, and even one at each side, since cables cross an other with any orientation. Ground planes are so efficient within a ribbon as well that differential signals are unnecessary: a 125µm film and 750µm signal spacing leave around 16pH/m coupling inductance, so a 25mA/100ps edge induces 4mV over 1m. Differential signals, 95% symmetric, would achieve it with 500µm spacing but are themselves wider and must be twisted. The unsymmetric cable better carries a waveform symmetric around zero, with symmetric power supplies for the transmitter and the receiver.

 

post-53915-0-99309700-1386006183.png

 

The insulating film can be a fluoropolymer if PI and PET are too lossy at a few Gbit/s. Copper conducts over 1.2µm skin depth at 5Gb/s taken as 3GHz, introducing some 80ohm/m loss resistance for an estimated 40ohm wave impedance; this attenuates the voltage by acceptable 6.6 at 2m, but by 10,000 at 10m, so long cables carry integral repeaters in stiff, enclosed sections - easy with a flexible printed circuit; such repeaters can also spread signals to several directions, or merge them, preferably with data buffers then.

Superconductors would bring a lot to (super-) computer wiring, at room temperature if available some day, in liquid nitrogen meanwhile: imagine compact cables with tiny conductors which keep signal shape and strength over any distance... Easier and more desireable than the often-cited uses in electricity production, transport an storage.

At 750µm spacing, a ribbon just over 48mm wide carries 64 signals, so one 8-layer cable carries 512 signals and is little more than 3mm thick.

----------

The connection I propose with a printed circuit board presses firmly each contact individually as usual - but I'd press the flexible printed circuit directly against the printed circuit board. This keeps the signals shielded (including from their neighbours), minimizes the impedance mismatch and its length, and saves parts. One ground plane stops on the drawing: no consequence to shielding over this distance, and wider tracks there keep the wave impedance; the planes contact an other at the cable sides.

 

post-53915-0-54303400-1386006215_thumb.png

(Click to see full sized)

 

Signal layers are separated at the contact zone to be flexible. They are soldered to an additional layer that puts the contacts at accurate positions versus the position index. This additional layer is thin to be flexible, but of material sounder than the fluoropolymer, say polyimide or polyester. If flexibility demands it, the layers of UHF/EHF materials can stop before the contacts to the printed circuit board, and the additional layer carry the signals over the short distance. At the contacts, anything less than gold would be bad, as usual.

To assemble the plug, the additional layer can be layed or sucked on a very flat surface, the signal layers soldered and glued to it. Then the signal layers are curved (possibly held curved by an additional item), the force spreader inserted, and the signal layers glued together, and the additional layer to them at the front. Then the position index item is drilled or inserted (or milled if pushing against the board, directly or not), with reference to the contacts or to a former centerhole in the additional layer. Late in the assembly, if the contacts at the additional layer can be molten, the very flat surface brings them to even height - if any better; deposit gold after.

A deformable force spreader distributes press force on each contact despite residual misalignments. It can be an elastomer that doesn't set over time, like natural rubber, preferably stopped at the sides; I'd prefer very flexible communicating bellows or cushions of reinforced elastomer or maybe nickel. Soft silicone rubber can replace the liquid if any better.

Indirect sliding wedges could press the contacts through the spreaders, but if bellows are used, I'd rather bring pressure in them, say with a plunging screw. Both versions are manoeuvreable from a position aligned with the printed circuit board, nice when boards are stacked. Callipers resist the strong push of hundreds of contacts. Two plugs at the board's opposite faces can share the callipers if manoeuvred simultaneously, or one plug can contact both board's faces: such a plug would protect its contacts better.

Two flexible signal layers are drawn, and then the plug could be designed with just a flexible loop, or even simpler for a single signal layer, but a big computer needs many signals. With 5mm length for the contact zone of each layer, 512 signals in 8 flexible layers fit in a reasonable area. The thin connector permits 15mm board spacing, maybe 10mm.

----------

How well a connector works is experimental... This one has a single moving contact per signal. For cables carrying hundreds of signals of the same nature, I'd introduce redundant contacts at the connectors, and redundant lanes at the cable, to be selected by the transmitting and receiving electronics.

Marc Schaefer, aka Enthalpy

Posted

The dense cables and connectors ease heavy wiring. A full switch matrix between 16384 processors shall illustrate it.

Boards 500m wide and 600mm long carry 2*16*16=512 processors+Dram chips or 2*8*8=128 matrix chips of each 256 inputs and outputs. 32 processor boards and 32 matrix boards fit stacked in a person-sized bay.

The 50mm wide, around 4mm thin, unidirectional cables carry 512 signals in 8 layers of 64 tracks. Each processor board has one cable input and one cable output. Each matrix board has 8 cable inputs and 4 cable output.

All the wiring between the boards consists in 32 cables to, and 32 from, one processor board each. Each cable from a processor board spreads actively to 8 matrix boards. Each cable to a processor board merges data from 4 matrix boards - actively, with collision detection and data buffers.

Can we exaggerate a full matrix further? It would be easy, but unwise. Take just 64k processors on 128 boards: a full switch matrix would take 512 boards. In the same volume, one gets 256k processors in an acceptable hypercube or hypertorus network - already a better choice I'd say.

Marc Schaefer, aka Enthalpy

Posted

The hypercube networks I suggest now keep feasible where a full matrix is too big.
http://en.wikipedia.org/wiki/Hypercube
With less wiring than a full matrix, they offer about the same throughput. Their latency may be worse, or maybe not, if smart serial links manage the destination addresses and the collisions.

A hypertorus would have been possible as well; 3 processing nodes per loop outperform a hypercube, 5 save wiring.
http://en.wikipedia.org/wiki/Hypertoroid#n-dimensional_torus
Present supercomputers are hypertori with 100 nodes per loop, and users complain about data transfers.

----------

131k processors for 1 petaflops fit in 4 person-sized cabinets. Each cabinet holds, stacked with 30mm spacing, 64 boards of 512 = 2*16*16 processors+Dram chips. The chips have at least 17+17 serial links out and in, of which 9+9 serve within the board and 8+8 go to connectors and, via cables, to other boards.

The hypercube (or a hypertorus) lets group on a board one link from each of the 512 processors to a 512 signals connector and have one pair (for full duplex) of 512 signals cables between two board. Each of the 256 cards bears 2*8 connectors and communicates directly with 8 network neighbours - which can be geometrically remote, since we have 3 dimensions to arrange a board hypercube of dimension 8. There are no network boards, no cable spread nor confluence, and here no integral repeaters.

The 25mm*10mm cables have 17 layers of 32 signals. The boards are >400mm wide, double the cumulated cable width, to accommodate the connectors and permit cables to cross an other. At the worst height, 84 vertical cables run within a cabinet, so 20% filling factor needs 0.26m wiring thickness. 2*2*64 = 256 horizontal cables run between the 1.9m tall cabinets, taking 0.17m wiring thickness. The 1m cables cost 5ns, as much as transmitting the destination address twice over a serial link.

Connectors facing an other at one board's edge could carry the in and out signals to one other board, and then half of the boards could be upside down depending on the parity of their hypercube position. Or we could route the up- and down-bound cables clockwise, right- and left- anticlockwise, with all boards up, and their outbound connectors up left and down right. Arranging the boards in binary sequence according to their position in the hypercube seems better than in Gray sequence. Connector positions on a board would better not correspond to one dimension of the hypercube, but rather be allocated on each board to ease interboard wiring; the processors could sense it at powerup and adapt. The destination address of a message could begin with the chip address within a board, but with the board in the cables. All this mess is imperfectly clear to me, but must be known if big hypercubes have been built.

At 3.8GHz, each 64b scalar Cpu+Dram chip shall draw 6W, because six 256b 3.6GHz cores need 130W presently - without the Dram but with wide fast interfaces. This sums to 800kW for the computer, 200kW per cabinet whose supply at the top can receive three-phase mains and distribute 48Vdc through 50mm*10mm aluminium bars. Quiet liquid cooling, managed at the bottom, is easy; transformer oil suffices. Signal cables can run from the boards' rear edge, while power and cooling can connect to the front edge and run at the cabinet's sides.

----------

1M processors for 7 petaflops fit in 32 similar cabinets, here as 4 rows of 8.

The 2048 boards widened to >550mm have now 11 pairs of connectors. 8 cabinets side-by-side need now 5*2*64 = 640 horizontal cables, or 0.34m thickness filled to 25%. 2*2*8*64 = 2048 cables plunge straight to the floor and run in it between the rows; they take 0.29m thickness filled to 40%; integral repeaters look necessary for these cables. This sums to 0.92m wiring thickness for 0.5m long boards.

At 3.5GHz, each scalar Cpu+dram swallows 5W, and the computer 5MW.

----------

Tianhe-2 brings 55 petaflops: 8M scalar processors at 3.2GHz make the equivalent.

A different arrangement accommodates the wiring: the boards are vertical, 1024 pieces in a row, and 8 pairs of rows achieve 16k boards. Personnel have alleys between the row pairs, power can run at the middle of the row pairs, cooling circuits below, and all wiring passes over the board rows and, between the rows, overhead.

Rows are 31m wide for 30mm board spacing, and (gasp) 700mm wide for 2*14 connectors. Two rows, power, and an alley take 2.2m, so the machine is 18m long. Many cables have an integral repeater. 100ns in some cables are as much as 50 bytes data, or 320 Cpu cycles: special messages, like scaled indexed or bit-reversed access to 2kB of remote data, can be useful.

Within each row, 682 cables fill 1.2m height to 20%. Start from hip height, add an access gap, and your reach the adequate height for the 10240 overhead cables, which fill 0.21m thickness to 40%.

4W per 3.2GHz scalar Cpu+dram let the computer draw 33MW, gasp. More Cpu running slower would improve; scalar ones are better used, but the network swells. A full matrix only at the boards would save Cpu pins, emulate a hypercube if wanted, and access disks if any useful; between the boards, a hypertorus would save cables, say as 510 nodes.

----------

How does the hypercube of scalar Cpu+dram compare with existing supercomputers, say at 55PFlops?

The floor area is similar. The power consumption is almost double at 3.2GHz. The Dram capacity is similar. If a Cpu+Dram costs a bit more than a Dram chip in a module, then the computer price is similar. But what really changes:
- 300PB/s Ram throughput (peta=1015) for easy programming. Addressing modes.
- 2.5PB/s network throughput between any machine halves.
- 300ns network latency. Existing supercomputers have workstation boards as nodes, but Ethernet is not a supercomputer network.
- Scalar Cpu are more often efficient. For instance plain Lisp programmes are supermassively multitask but can't use the Avx.
- Simple programming model! This machine would be just task-parallel, with no other limit nor difficulty.

I'd say: easier to use well by more varied programmes. In short, more capable.
Marc Schaefer, aka Enthalpy

Posted

The cables of flexible printed circuits can carry more than 5Gb/s in each lane. For instance 20Gb/s would just halve the maximum distance between repeaters. Crosstalk voltage increases as the front steepness, nothing tragic with two ground planes. Accepting a reduced wave impedance, thinner insulating layers reduce crosstalk, or permit signal lines closer to an other: the resulting narrower cables would permit narrower boards at the 8M nodes hypercube.

 

A supercomputer draws a huge current that induces ground noise. To protect itself, the asymmetric transmission can for instance connect its cable ground at the transmitter side, let it float at the receiver side where it receives just the line terminations, and have the receiver detect the difference between the signal and the cable's ground.

 

Package inductance is always a worry. Asymmetric transmitters pollute their internal ground when many outputs make the same transition. With one ground contact per signal output, ball grid array packages look numerically good at 20Gb/s.

 

Optical cables would be fantastic... The current technology is Vcsel emitters at 850nm, multimode 50µm/125µm fibers, GaAs detectors. 5mA*2V let transmit >10Gb/s to 300m. It's just that connectors for a single fibre are still as big as a finger, and presently, people try to develop hardware, including collective connectors, for twelve (12) fibres at once. As soon as one can have 500 signals in a transmitter, connector, cable, connector, receiver - all that for 10,000 cables, shooting horizontally from the boards, 10mm thin - they'll be perfect.

 

Marc Schaefer, aka Enthalpy

  • 1 year later...
Posted

The memory throughput of video cards mismatches their computing capability presently. For instance, the R9 290X (the GTX Titan Black is similar) computes 2816 billion multiply-adds per second (2816Gmadd/s) on 32 bits numbers, but its Dram ouputs "only" 80 billion numbers per second; depending on how many new numbers each Madd takes, that's 70 to 140 times less, and it gets worse with newer models.

The Gpu (video processors) have faster internal caches. Maybe the emulation of directX 9 instructions for my older games on directX 10 Gpu doesn't use the caches well, but my tests tell that the Dram throughput determines 90% of a video card's speed, and the computation power 10%. Here's hence how spreading the processing cores among the Dram - often in several chips - would improve this, compared at similar computing capability.

Take 1024 multiply-adders (Madd) at 2GHz for 2000Gmadd/s with individual 4MB of Dram close on the chip. The small Dram units can cycle in 15ns, so accessing 512 bytes (4kb or 128 words) for each transfer to and from the cache provides 4.2 words per Madd cycle, totalling pleasant 35000GB/s. An 8kB cache would store 16 transfer chunks for 3% of the Dram's size.

A hypercube shall connect the 2^10 nodes since I haven't found advantages to hypertori and others. Package pins limit to 64 nodes per chip; 4+4 serial monodirectional chip-to-chip links per node take already 512 signals (and the 6+6 links within the chip take none). At 5GHz like Pci-E 2.0 or Gddr5, the hypercube transports 640GB/s through any equator, twice the Dram of existing cards.

In this example we have 2*64 link between chip pairs. A few redundant links could replace faulty chip i/o or Pcb lanes.

With 3 billion Dram transistors and 64 nodes, the chips are "small". As extrapolated from both 2800 Madd Gpu or quad-core Avx Cpu, they would dissipate <10W per package, far easier than presently, and permitting the 2GHz. Though, wiring 16 large packages per card needs denser Pcb than present video cards; package variants with mirrored pinout would help. Cards with less computing power and memory can have fewer chips. The cards may have a special chip for the Pci-E and video interfaces.

While the very regular programming model with few restrictions must run many algorithms well, it differs from dX9, dX10 and followers, so whether existing software can be emulated well remains to see. How much the data must be duplicated in the many Dram, as well.

How to exaggerate further?

  • The nodes can exceed 2GHz, to compute faster or to reduce the number of nodes and links.
  • Transmit faster than 5GHz: Pci-E 3.0 does it.
  • The links could have been bidirectional, or less bad, several nodes could share some links.
  • If general enough for a Gpu, the nodes can multiply quadrivectors instead of scalars.

Marc Schaefer, aka Enthalpy

Posted

Enthalpy, please attach source codes of your benchmarking applications so we will be able to compile them by ourself and check our gfx cards.

Posted

This is a propagation line termination without steady losses. Other people may well have known it, but I've seen no description up to now. Related with connections in computers, but not especially with supercomputers nor with processors spread among the memory.

post-53915-0-15447900-1420675550.png

The data transmitter is a push-pull, on the sketch a Cmos. The line's opposite end has no resistor but two fast clampers. There, the incident wave rebounds and exceeds the logic levels Vs+ or Vs+, just slightly thanks to the clampers. The small reflected wave comes back to the transmitter where it approximately extinguishes the transmitted current. Some small oscillations happen but are acceptable. Next, current flows nowhere and no power is lost; the only consumed energy charges the line, which is optimum.

As fast low-voltage clampers, silicon Schottky spring to mind. For 0/5V logic they can divert the clamping current directly in the supply rails, but at 1V logic they better have they own clamping rails Vc+ and Vc+ a bit within the logic levels to minimize the overshot. The chip's built-in electrostatic protection is too slow for this use, and Schottky's low voltage avoid to turn it on. Bipolars (chosen strong against base-emitter breakdown) can avoid the additional rails, as sketched; if the diodes conduct at 0.4V and the bipolars at 0.7V, they clamp at 0.3V outside the supply.

A bidirectional bus with tri-state buffers can use this scheme, as sketched. Additional clampers on the way would neither serve nor hurt. Pin capacitance has the usual effects.

A differential line could use this scheme; though, push-pull buffers inject noise in the supplies.

Marc Schaefer, aka Enthalpy

Posted

Some details about the video card suggested on 07 January.

The individual 4MB Dram must cycle faster than 15ns, like 4ns. The transfer size to the cache can then be smaller, say 128B.

The Sram cache is fast, and it can be smaller than 8kB. E8600's 32kB takes 3 cycles at 3.33GHz, so 8kB must achieve 1 cycle at 2GHz. A single cache level suffices, both for a load-store architecture or for operands in the cache which could optionally be a set of vector registers. Prefetch instructions with the already described scaled indexed and butterfly memory addressing keeps interesting for Gpgpu.

The network within a chip is easy. Two layers for long-range signals, of 200nm thick copper and insulator, diffuse the data at roughly 25µs/m2. The lines can have two 15ps buffers per 1.5mm long Dram+Macc node, plus some flipflops; they transport >10Gb/s each and cross a 12mm chip in 500ps - just one Macc cycle. One 20% filled layer accommodates 90 lines per node over the full chip length, so

  • Each path can have many lanes;
  • The on-chip network isn't restricted to a hypercube;
  • Each node can reach any position and any bonding pad on the chip.

So all nodes in one chip can group their links to one other chip at some subset of the bonding pads, and routing the links on the board is easy - here a sketch with a 2^4 hypercube. Better, some permanent or adaptive logic can re-shuffle the links before the pads to

  • Change the pinout among the chips to ease the board;
  • Replace faulty pads or board lanes with redundant ones;
  • Allocate more links to the more needy Macc;
  • Spread optimally the data volume over the cube's multiple paths.

Now the number of Dram+Macc nodes per chip depends less on the number of pins.

post-53915-0-30470800-1421001841.png

As a model, the 131mm2 Core i3-2100 has 1155 pads, which permits ~250 bidirectional links and needs for instance 3 signal layers with 4 lanes/mm and 4 ground or supply layers just to reach the pads. A ceramic board could do it, but the Lga-1155 package achieves it with "resin and fibre".

100µm wide lanes would introduce 20ohm losses over 100mm distance at 10GHz (~20Gb/s), so a 4 to 32 chips module could have one single ceramic or printed circuit board (Pcb) substrate with 4 lanes/mm, needing few external pins if it concentrates all nodes of a video card.

Alternately, a first fine-pitch substrate can carry one or few chips in a package and a coarser board connect the packages. A Pcb with 2 lanes/mm on 3 signal layers and 4 ground or supply layers can connect a 32mm*32mm Bga and connect 16 chips on a video card as sketched, for 1TB/s between hypercube halfs. 5 signal layers would save space and 7 permit 32 chips. This Pcb with 200µm wide lanes transports 20Gb/s over >200mm and is cheaper.

The same Pcb could even stack finer layers used near the chips over coarser ones for longer distance. Probably more expensive than Lga distinct from the Pcb.

Beyond video cards, also parallel computers can have several computing nodes per chip as far as the Dram suffices, and the proposed solutions apply to them as well.

Marc Schaefer, aka Enthalpy

 

Guest
This topic is now closed to further replies.
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.