Enthalpy Posted October 25, 2015 Posted October 25, 2015 A database or artificial intelligence engine on a Pci-E board with crossbar can resemble the previous number cruncher, with adapted computing packages.3 stacked chips in the 352 package combine Dram and data processing for 528GB Ram per board. This defines the quickly searchable volume of the database - text, not pictures - or the work memory of an artificial intelligence program.At least the database gets Flash memory too. Slc is more reliable, swifter, and present 32GB chips read and write 200MB/s. One chip stacked with the computing chips sums 11TB and 70GB/s over the board. 20 times the text contain the images of a typical Web page, while Mlc would quadruple the capacity, and more chips can fit in the (optionally separate) packages.Databases and artificial intelligence have similar needs: branches and data access badly predicted, many conditional jumps, simple integer operations, but few multiplications nor float operations. This fits neatly with many simple low-power nodes, each with a smaller hence faster Dram, a slower clock and a short pipeline, while complicated operations can take several cycles.Here under artificial intelligence, I mean programs that combine fixed knowledge in myriads of attempts, often in Lisp or Prolog. These are naturally many-task and fit nicely the parallel machine. Other artificial intelligence programs, especially machine learning, look much more like scientific computing and fit rather the previous number cruncher.To estimate the power consumption of a node, I compare the number of complex gates (full adder equivalents) with existing chips which are mainly float multipliers: not quite accurate. The reference is a 28nm process graphics chip, whose 1000+ gates draw 67mW per 32b multiplier. A 14nm process for Cpu would supply the sequencer too and everything from that power, but the present machine emphasizes Dram, whose process is suboptimal for a Cpu.Shift-add would multiply too slowly to my taste, hence the 32b*8b multiplier-accumulator. Useful by itself, it also makes the bigger products in several steps, quite decently on 32b. It consumes more, but for random jumps and data access I want anyway the condensed flow control hardware (add, compare and jump) and multiple scaled indexed addressing, so the node has 3 of the other execution units.The 3 Alu can operate on distinct data or be chained at will within one cycle, like s ^= (a << 3) & b. This nearly keeps the cycle time defined by other units. The 3 cascade bitlogic units do each a 10-bit arbitrary operation I proposed here on 26 February 2015 and are independent. The 3 find string units search for substrings at any byte offset, their output can be coupled like "react if any finds something".The sequencer shall permit flow control and addressing in parallel with computations, but stall for instance the Alu when the multiplier draws power.36mW each permit 12670 computing nodes at 1GHz as 36 nodes per package on 3 chips. The board's Alus and cascade logic provide peak 38Tips and 380T logic/s (PC *103 and 104) and the multiplier decent 2.5TMacc/s on 32b floats (PC *101).Each node has exclusive 40MB Dram. The small units shall operate in 4ns. Exchanging 16 eigth-bytes words with the cache provides the mean 4 words per cycle, while 17 banks avoid collisions at the butterfly and scaled datamoves described here on 19 November 2013, useful to process one field from a database. The Drams cumulate 400TB/s and 3T random accesses per second (PC *104 and 105). A Flash chip reads or writes one Dram in 0.2s so swapping gets conceivable, but I'd prefer one Flash port per node.The network keeps one link per package, dimension and direction as for the number cruncher: it carries 530GB/s peak unidirectional between network halves.Marc Schaefer, aka Enthalpy
Enthalpy Posted October 26, 2015 Posted October 26, 2015 Aggiornamento to the node's computing capabilities for a database or artificial intelligence engine.Outside signal processing and linear algebra, most programs spend the time on small loops that contain condition(s), so a second flow control unit is a must. To save power, this one can compare and branch but not add. Consuming 86 units more than the previous 800 is extremely profitable: 5% slower clock but 1 cycle less in most loops.An instruction "Subtract One and Branch" (if no underflow) can often replace more complicated loop controls. Special hardware used by the compiler when possible instead of the heavy flow control would save M$ electricity on a giant machine.During a lengthy multiplication, flow control instructions are less useful. Excluding them gives power for a bigger 32x14 multiplier that saves 4-6 cycles on big products.Marc Schaefer, aka Enthalpy
Enthalpy Posted November 1, 2015 Posted November 1, 2015 A Pc's coprocessor installed in a worker's office should fit in a box and draw <2kW - blow the heat outside. A Pci-e board in the Pc can launch the links to the separate box.---------- Number cruncher5184 nodes with one 0.4W double precision mul-acc at 1GHz provide 10TFlops and 1TB from 200MB at each node. Exchanging with the cache sixty-four 64b words in 15ns from 127 Dram banks provides 4 words per cycle to the processor, or cumulated 177TB/s.A two-dimensional crossbar of 722 nodes is still feasible on a single Pcb. The 72x72 switch components fit in a (4+4)*40 Bga, the network has 2*72 of them. A 12mm*12mm package can host two nodes on one chip or two. Line amplifiers every 200mm enable 20Gb/s on a link to transport 5.2TB/s peak unidirectional between network halves.The network brings one word in 4 cycles to a node. Imagine an algorithm where a node imports N words to make just N*Log2(N) complex mul-add on 32+32b in as many cycles: N=16 chunks suffice to keep the node busy - that's comfort.Two nodes per package at both Pcb sides need only 36*36 sites split in four 18*18 sectors (one sketched here) by the switches cut again by line amplifiers. The rows can connect complete packages and stay at one Pcb side while some lines connect the A nodes of packages at both Pcb sides and the other lines the B nodes.Nearer to the Pcb centre, the line and row pitch widens to route the more abundent links. The last crossing passes 144 links (from both Pcb sides) in 25mm pitch over 4 signal layers plus 4 layers for the cross direction, so <12 signal and 6 supply layers must suffice. At the switches, minimum 16mm pitch permit to route 72 links in 8 signal layers over 10mm available width. The Pcb is 0.85m*0.85m big.Supplying estimated 0.6V * 3456A isn't trivial, but with converters on all 4 edges, 3+3 supply layers, 105µm thick, drop 8.5+8.5mV. Busbars can do the same, at one Pcb side per direction. Or aluminium heat spreaders, as big as the board and 10mm thick including the fins, can bring Vss at one side and Vdd at the other, dropping 5+5mV if fed at one edge. Or a flat supply can hold the Pcb by many spacers and screws that inject the current; in that case, stacking two chips usefully brings all computing packages to the unique heat spreader at the opposite side.---------- Database and artificial intelligence engineKeeping the network and Pcb designs, each computing package can host 24 simpler nodes consuming 30mW instead of 2*0.4W. Two stacked computing chips hold each 12 nodes with individual 40MB while an Slc flash chip adds nonvolatile 32GB for total 62k nodes, 2.5TB Dram at 2PB/s and nonvolatile 83TB at 500GB/s (many ports would improve). 5ns Dram latency would permit 12T random accesses per second and three units for cascaded logic 2P bit operations per second.Though, the network keeps 2 in + 2 out links for 24 nodes per package now, and while this should suffice for a database, I feel it would restrict the style of artificial intelligence programs, so a special machine would look rather like the multi-board ones I plan to describe.Marc Schaefer, aka Enthalpy
Enthalpy Posted November 6, 2015 Posted November 6, 2015 This is a 2kW machine for artificial intelligence - and databases if much network throughput is wanted. Without awaiting the drawings because the spooks have already stolen this idea from me, codeword "en même temps".The 3D crossbar accessed individually by the 65k nodes consuming 30mW each makes the machine bigger than the nodes would need. Each compute package holds just one chip of 12 nodes optimized for AI & DB with 40MB Dram per node, the compute boards have packages on one side only, and the compute boards hold only 12*12 compute packages or 1728 nodes; 52W per board are easy to supply and air-cool.At 18mm pitch, 6 compute packages send 72 and receive 72 links to the 72x72 switch component; the 2*12 switch components are at the two midlines of the array as previously. A standard 2D crossbar would have approximately 2*42 switch components of 21x21 capacity, but the 72x72 ease routing and offer more paths. The board has 6+6 signal layers plus 3 Gnd and 3 Vdd - more layers shink the machine. A compute board is 250mm high, a bit wider for the supply at one end.The compute boards bear 3 data vessels (see 08 February 2015) per line of compute chips, or 36 vessels carrying each 48 links in and 48 out. The vessels are 7mm wide so 8 signal layers and 8 power planes suffice. Capacitive components (see 22 February 2015) couple the links at the vessels' ends, for instance 2*3 chips of 16 signals each, or more for redundancy.The machine comprises 38 vertical compute boards and 36 independent horizontal crossboards that make the matrix' third dimension that switches homologous nodes from each compute board.Each crossboard has 48+48 links with each of the 38 compute boards. It can carry 24 switch components of 38x38 capacity - or for instance 12 components of 76x76 capacity which can be the same small chips as for the compute boards, repackaged if useful. The D=200mm crossboards could have 12 to 16 layers, more difficult to estimate.Round crossboards surrounded by compute boards are more difficult to service but they ease routing, shorten the latency and avoid repeaters here too. One fan can cool the spectacular design.With 2GB/s links, the network carries 65TB/s between machine halves. 4 cycles per 64b word for each node, that's comfort for Lisp-Prolog-Inference engines.----------If someone feels a 2kW database needs this network throughput, fine. The 2.7TB Dram capacity can then increase if useful, with 6-nodes chips on both sides of the compute boards, or stacking three 2-nodes chips per package for 16TB... Plus the Flash chips, necessary for a big database but optional for AI.Or an AI+DB machine can keep this network but have 12 nodes per chip, stack 3 chips per package on both board sides. The 394k nodes consume 12kW, are rather liquid-cooled but offer 16TB Dram and compute 1.2Pips. Ill-defined machine size, limited neither by the power nor the connections.A number cruncher with two 256MB nodes per chip would stack 3 chips per package at each board side, since I feel one word in 4 cycles is enough from the network. The liquid-cooled 65k nodes machine draws 26kW and offers 130TFlops, 16TB Dram: 1000 times a Pc in a small box.Marc Schaefer, aka Enthalpy
MigL Posted November 6, 2015 Posted November 6, 2015 Haven't discussed crossbars and parallel computing since the late 80s, when the Transputer platform was going to be a co-processor for the Atari ST system. Wonder what ever happened to that ?
Enthalpy Posted November 6, 2015 Posted November 6, 2015 Hi MigL, thanks for you interest! Answering soon. ---------- Drawing for the 2kW AI machine (and database, web server and so on).It's compact: D=0.8m h=0.25m plus casing, supply, fan.For a 2kW machine, 35µm supply planes distribute the current, and 0.2m/s air speed cools the machine. But a 26kW number cruncher better has liquid cooling, and the pipes distribute the current as well (see 01 March 2015); two data vessels per line fit more naturally then.When power isn't a limit, the cross boards can easily grow to accommodate more compute boards. The compute boards can grow too, but they lose density.Marc Schaefer, aka Enthalpy
Enthalpy Posted November 7, 2015 Posted November 7, 2015 Haven't discussed crossbars and parallel computing since the late 80s, when the Transputer platform was going to be a co-processor for the Atari ST system. Wonder what ever happened to that ? The Transputer was too advanced for its time - both versus the designers' habits and versus the technology needs. 30 years laters, its choices would be the good ones. ---------- Then, external Ram delivered data at the processor's pace and without delay, so the Transputer's integrated Ram wasn't seen as an advantage. Quite the opposite, it was felt as a loss of flexibility, and the alliance of fabrication processes that would make neither a good Ram no a good Cpu. Meanwhile, Cpu have accelerated muuuuuch more than the Ram, which delivers neither a decent access time (150 cycles for a word!) nor even the throughput. A four-core Avx256 that just accumulates at 3.5GHz many products from two vectors needs 112G words/s while a four-port Pc4 27200 delivers only 109GB/s, eight times less. On-chip cache improve the latency, but they give throughput only if the program accesses repeatedly the same small data. If linear algebra software (which is already a very favourable case) is programmed the traditional way, going linearly through the data, then it keeps steppig out of every cache level, and gets limited by the Dram throughput. To get performance, the programmer must cut each loop in smaller ones and reorganize the computation to reuse many times the data already in L3, and possibly do the same at L2 size, L1 and the registers. The old "golden rule" of equal throughput at every memory level has been abandoned, and this has consequences. Then you have applications like databases which have to sweep through the Dram and not even at consecutive locations. They just stall on the present memory organization. Add to it that each core needs vector instructions to run as it can, and that the many cores need multitask programming, and you get the present situation at Pc, where very few applications (nor OS) make a somewhat decent use of the available capability. Some video games, image and video editing software, a few scientific applications - that's all more or less. Present supercomputer are worse, as their network is horribly slower than the Cpu, so multitasking among the sockets must be thought accordingly. Also, they often add hardware accelerators that are binary incompatible with the general Cpu and introduce their own hard limits on data interchange. You get a many-million toy that is extremely difficult to use. I believe something new is needed - and users seem to push too. Then you have the emerging market of databases, web servers, proxies, search engines, big data. Present Pc are incredibly inadequate for them, but the servers are only a bunch of Pc put loosely together. These may be the first customers to wish different machines, and accept to change the software accordingly, even more so than supercomputer users - that's why I describe such machines every time. Remember the worldwide eBay fitting in an Atx tower. ---------- One design feature here is to put the Dram and the Cpu on the same chip. The Dram fabrication process may be suboptimal for the Cpu, it may need some development, but this is the way to an access width like 500 bytes that provides the throughput for just one scalar core. As well, a smaller individual 250MB Dram for a 1GHz scalar core is faster (shorter lines) than a 1GB chip shared among 4 Avx cores; its protocol is also simpler. And once on the chip, the Dram easily offers much-needed access capabilities like scaled and butterfly addressing at full speed. I also suppose (to be confirmed) that Dram manufacturers feel presently no pressure to reduce latencies. Within the L1 L2 L3 Dram scheme, their role is to provide capacity, which is also the (only) data customers care about. But it also means that progress can be quick if a computer design puts emphasis on Dram latency; maybe it suffices to cut the Dram area in smaller parts, accepting to waste a little bit of chip area. Intel has recently put a separate Dram chip in the package of a mobile Cpu. The benefit isn't huge, because the Cpu fabrication process made a small and power-hungry Dram serving only as L4. Intel has also revealed plans to integrate Micron Dram on the socket; this solves the capacity and power issues, but not the throughput nor latency, which need the bus on the chip. Also, the present organization with several vectorial cores on one chip need several Dram chips to achieve the capacity per MFlops. My proposal instead is to have just 1 or 2 scalar Cpu per chip, and then all the corresponding Dram fits on one chip. The equivalent Dram capacity and processing power fit on the same number of chiips, just better spread. While Intel can certainly develop a Dram process, alone or with a partner, going back to a single scalar Cpu seems to imply a loss of compatibiilty, so I'm not sure (but would be glad) that the combination I propose will evolve from the 8086 architecture. Mips and Arm are contenders - or something new. ---------- The transputer had 4 serial links - if I remember - like present server Cpu, to connect directly 5 Cpu, or make a 16-node hypercube, etc. Later Inmos proposed a 32x32 crossbar among the serial links for a 32-node machine with a good network. One could have made a 324 = 1M multidimensional crossbar from these components, but I didn't read such a dscription then. One possible obstacle then was that big machines had slice processors that cleanly outperformed microprocessors. The already older Cray-1 provided 80M mul-acc per second from one pipelined Cpu; this was achieved just from vectorial programming, already on short data. As opposed, microprocessors took dozens of cycles at 10MHz just for a double multiply. Why would have users of linear algebra gone to heavy multitask parallelism then? The Dec Microvax was the fist (monster) microprocessor to outperform MSI circuits in a minicomputer (not a supercomputer) and it appeared years after the Transputer. Presently the only way are complete Cpu integrated on a chip, and the only way to get more processing power is to put more processors. Possibly vector ones, with a faster or slower clock for consumption, but basically there is no choice. And because data transfers a slow among the chips, extensive parallelism results only from loose multitasking - no choice neither. That's why I propose that parallelism results from loose multitasking only. Not from small vector capability, not from a complicated scheme of caches shared or not with difficult consistency. Instead, have just scalar Cpu with a good Dram each, and a good network, so that massive multitasking is the only programming difficulty. ---------- What shall be the network topology? Well, anything feasible and efficient. Because in a machine with one million or half a billion nodes, some hardware will be broken at any time and the tasks reallocated to other nodes, so the application can't depend on topology details: the application defines a bunch of tasks, the hardware or the OS defines where they run and finds which node has the data an other seeks. So the conceptual ease of use by the programmer isn't important; only the performance is. Though, a dense network is difficult to build for many million nodes. Supermachines are essentially a matter of cabling - hence the thread's title. 1D crossbars are excluded. Hypercubes have been built but are difficult at big machines. Present supercomputers have a fat tree - which is a multidimensional crossbar that abandoned its ambition - or more often a hypertorus, which wouldn't be bad by itself, but is always far too long because of cabling constraints. One hard limit of present supercomputer is that they use fibre optics, and even Ethernet, for connections. Because the connectors are huge for just a single signal, the networks offer a ridiculous throughput. Pc technology is just inadequate. That's why I propose: The cables made of flexible printed circuits and their connectors (see 02 December 2013) that transport 512 signals; The compact capacitive connectors among crossed printed circuits (see 22 February 2015); The signal vessels (see 02 February 2015) to drain data to the board interfaces. I too would love to have cables with 1000 fibres and connectors 40mm wide and 10mm high, but up to now I've found credible solutions for electric cables only. With such routing hardware, topologies become possible beyond the cramped hypertorus of a few fibres. With hypercubes I could go nearly to the Tianhe-2 size, and very cleanly (printed circuits only) to 2TFlops. With multidimensional crossbars I hope to go farther, as they need fewer contacts at the chips and boards, but the routing density of printed circuits limits the machine's compactness. At Tianhe's capacity, a good network is possible. For the exaflops machine, let's try.
Enthalpy Posted November 9, 2015 Posted November 9, 2015 (edited) Here's a one-cabinet machine with crossbar. It carries 240 000 computing packages, so with a two-node chip per package it crunches 0.96PFlops in 120TB Dram while the 3D crossbar carries 480TB/s peak between machine halves.As a DB-AI-Web machine, it depends on how many nodes shall share the limited network. Three 12-node chips and 32GB Flash per package (sharing two link pairs per dimension) offer 360TB and 280PB/s Dram, 26Pips, 260P bitlogic/s, 7PB and 48TB/s Flash. The cabinet is about 1.5m wide, tall and long. No repeater is drawn here, the data vessels neither. At the computing boards' edge opposite to the cross boards, the cooling pipes are insulated and converge to the board's fluid feed at the lower corner, while low voltage injected in the pipes is made near this edge from 48Vdc received at the upper corner. The compute boards carry two network dimensions. At 20mm*24mm pitch it takes ~16 layers. A board is 1.2m tall and 0.9m long without the supply, matrices, repeaters. Strong Mylar foils protect the components when inserting and extracting the boards.Two 4mm*7mm data vessels per package line carry each 80 signal pairs in 14+15 layers. 2*6 chips on each vessel exchange the data capacitively with the cross boards.The cabinet has two crossboards per package line. They're 1.2m long, and 25 layers make them 0.3m wide. Each face carries 40 matrices, for instance in two rows.Through 1.2m*0.3m section, the crossboards carry 240k pairs of quick signals. That's 0.75mm2 per signal, and improvement is desired. Fibres would outperform with dense wavelength multiplexing, but then the filters shall be tiny, thanks.The hypercube of 01 March 2015 brings 2PFlops but the crossbar machine could have 2GHz clocks too. Instead of 0.9MW, the crossbar machine takes 200kW and is smaller, with better latency and fewer collisions.Marc Schaefer, aka Enthalpy Edited November 9, 2015 by Enthalpy
Strange Posted November 11, 2015 Posted November 11, 2015 Haven't discussed crossbars and parallel computing since the late 80s, when the Transputer platform was going to be a co-processor for the Atari ST system. Wonder what ever happened to that ? It was redesigned as an embedded processor for application-specific microcontrollers. For many years, the vast majority of set-top boxes were based on the transputer/ST20 - as well as applications in DVD players, GPS, etc. I don't think it is in production any more.
Enthalpy Posted November 12, 2015 Posted November 12, 2015 (edited) Chips can transmit signals capacitively, as I described on 22 February 2015 (second message). These chips can also transmit differential signals if any better, and optionally use the pattern I described on 24 November 2013:I proposed that my data vessels use capacitive coupling; the cables made of flexible printed circuits I proposed on 02 December 2013 can also use capacitive coupling instead of electric contacts. As capacitances accept some distance between the electrodes, the connectors for hundreds of contacts are easier to make and more reliable. I'd keep active redundancy. The ends of the cable can be stiff: stack glued together, or a piece of stiff printed circuit or ceramic hybrid circuit there.On 23 February 2015, I proposed capacitive coupling chips that are adaptive to transmit hundreds of signals over a small area and accept imperfect alignment. Such chips look useful at connectors for many-signal cables, and at data vessels maybe too. Marc Schaefer, aka Enthalpy ============================================================================== Integrating the Dram and the processors on one chip is the good method but adapting the process may be costly. As an intermediate step, two chips made by different existing processes can be stacked in one package:For the machines I describe, the processor(s) are smaller than the Dram. If Intel or Amd add one Dram cache chip to their processors, this Dram will be smaller.This needs many signals. For instance, supplying 4* 64b at 1GHz from a 15ns Dram needs to access 480 bytes at a time. Stacking the chips passes many signals more easily than cabling them side-by-side on the package. The signals can be faster than 15ns. At 1GHz "only" 256b per node are necessary. Intel and Micron have such plans, for several Dram chips per package. Two nodes with 256 signals plus Gnd still need 1.3mm*1.3mm at 40µm pitch. Twelve nodes need more. Possible but not pleasant. The signals can pass capacitively. A PhD thesis exists on the Web on this topic. This won't ease the alignement of the chips but improves the production yield and reliability as it avoids to make hundreds of contacts simultaneously. My scheme for adaptive connections between two chips, described on 23 February 2015 (drawing reproduced above), does ease the alignment of the chips, hence permits denser connections. It can use capacitances or contacts similar to flip chip methods, with reflow material brought on the lands (pass the Gnd elsewhere). Between two chips, have redundant signal paths and choose them after assembly or after reset. Marc Schaefer, aka Enthalpy Edited November 12, 2015 by Enthalpy
Enthalpy Posted November 14, 2015 Posted November 14, 2015 [...For Intel], going back to a single scalar Cpu seems to imply a loss of compatibiilty [...]. Intel can keep the compatibility while shrinking the consumption. A single core without hyperthreading can run multitasked software as well. A scalar core can run Sse and Avx instructions, just in several cycles. Or even, the Cpu could have Avx hardware but stall the execution for some cycles to limit the consumption when vector instructions are executed. The wide registers don't cost power, only decoding the complicated instruction does, a bit.
Enthalpy Posted November 15, 2015 Posted November 15, 2015 How to build an exaflops supercomputer? By interconnecting thousand petaflops cabinets.The compute cabinets follow the 09 November 2015 description, but with horizontal compute boards, vertical crossboards, and additional connectors for the two new dimensions. The clock climbs to 1.04GHz for 1PFlops. More later.512 switch cabinets in each new dimension contain the corresponding matrix chips and shuffle the signals. Cables as described on 02 December 2013 connect the cabinets, possibly with capacitive connectors as on 12 November 2015. More later. Feel free to let the cables superconduct in their floor.0.5G Cpu in 0.25G chips and packages carry 125PB and 16EB/s Dram and the network transports 500PB/s peak unidirectional between machine halves. Though, present supercomputers have 6 times less GB/GFlops than this Pc ratio, so if reducing the network throughput too, the machine can be smaller just with more nodes and chips per package.A DB-server-AI machine with 1.5GB, 36 nodes and 32GB Flash per package would have offered 370PB, 300EB/s, 28Eips, 280E bitlogic/s, 8EB and 50PB/s but the Web must cumulate 50PB text and index data is smaller, so a Web search engine or proxy would have one dimension less.On a single floor, the cabinets would spread over 150m*150m, so several floors would improve the latency and be more convenient. The usual Cpu figures let the machine consume 200MW. Though, more recent processes improve that, as the Xeons already show. And if the MFlops/W improve further below 1GHz, it must be done. The supercomputer better resides near a supplying dam owned by the operator. The OS must ramp the Cpu up and down in several seconds so the turbines can react.Power distribution will follow the usual scheme for the usual reasons, for instance 200kV, 20kV, 400V three-phase to the cabinets, 48Vdc to the boards, <1V to the chips. The transformers can sit all around the lighter building but this wastes some power.Cooling by liquid, as the previously described boards, makes the machine fast as hell and silent as well. With horizontal compute boards, the cabinets' design is to linder the spills - more later. The fluid can be a hydrocarbon for insulation, long-chained for high flash point, similar to phytane or farnesane; hints for synthesis therehttp://www.chemicalforums.com/index.php?topic=56069.msg297847#msg297847A half Tianhe-2 would have only one new dimension, with <32 compute cabinets around 16 switch cabinets, like a Web search engine.Marc Schaefer, aka Enthalpy
Enthalpy Posted November 22, 2015 Posted November 22, 2015 Petaflops cabinets connect already their cores together, over two dimensions on each compute board and a third dimension over the cross boards. A Google server or Tianhe-2 must connect through matrix chips the homologous nodes of all compute cabinets, adding one crossbar dimension: if 48 cabinets contain each 0.5G nodes, then the matrix chip xyz in a switch cabinet connects the 48 nodes located on board x, row y, column z of all cabinets, and there are 0.5G matrix chips in several switch cabinets.The signals, cables and matrix chips can be grouped in the big dimension without constraint from the three smaller ones. For instance, since the cross boards connect already the (horizontal) compute boards, the compute boards at height x in all compute cabinets can connect to a board at height x in the switch cabinet through cables running between height x only (and that's my plan), or one could have several switch boards per compute board or conversely.The nodes on a compute board can also be chosen at will to run their signals in a common cable to a remote switch board, provided that all compute boards choose the same nodes and the target switch board carries enough switch matrices and means to route the links. It is my plan to group 500 links in and 500 out per cable, have 500 switch matrices per switch board, and run the links from the 500 compute nodes of 5 half-rows towards a connector at the side edge of a compute board.The exaflops computer does the same with two additional dimensions of 32 compute cabinets each, as sketched on 15 November 2015. For each group of 500 nodes in every compute cabinet in a cabinet line or row, a cable connects to the common switch board, and since the compute boards carry 16 groups of 500 nodes, there are 16 switch cabinets per line and row of compute cabinet. Now, two big dimension need two connectors and cables per group of 500 nodes. Figures below refer to this option.The 40 rows of compute chips on a board (see 09 November 2015) grouped by 5 half-rows converge to 8 pairs of connectors per side edge, which the operators must plug or unplug when moving a board. Room was available at these edges, and some routing space too because the onboard row dimension runs more links at the board's center while the big dimensions run more links at the edges.Routing the compute board goes nicely for the Google server or Tianhe-2 with one big dimension, adding no layer to the printed circuit. Two big dimensions for the exaflops need more links near the edges that demand 8 more signal layers plus power, climbing to uncomfortable 28+. Printed circuits denser than 2 lines per mm, or even bigger than 1.3m, would need fewer layers.The connectors would better be capacitive. The clamping force is small, reliability is insensitive to small displacements, and the chips can manage redundancy. I prefer one electrode per link, printed on Fr4 and with several smaller chips per connector (22 February 2015), over the adaptive connector chip: the interface with the compute board or the cable defines the size anyway, and a socket for 1000 links is expensive.2*500 links over a 40mm*40mm connector leave 1.2mm pitch per electrode, easy to align by keys and accepting some play. A sketch may follow. 20 flexible printed circuits, loosely stacked in a protective sleeve, 40mm wide to carry 50 links each, can make one cable.The boards can have the upper side's connectors for one big dimension and the lower side's for the other big dimension. All cables can bend to climb to the routing floor; 60 stacked compute board let run 120 cables superimposed, which at 6mm thickness, makes a 0.72m thick bunch; or better, have a second routing floor beneath, and send there the cables from the 30 lower compute boards. Expect 10m+ cabling thickness on each cabling floor; repair by laying a new cable. A Google server is much easier.The bunch of cables isn't flexible to take or give length to one cable, so when plugging and unplugging the cables to move a board, I plan to slip them to the side. The cables and connectors are 40mm wide to permit that in a 100mm pitch at the compute boards' edges.A description of the switch cabinets should follow.Marc Schaefer, aka Enthalpy
Enthalpy Posted November 27, 2015 Posted November 27, 2015 Here's a cut through a computing or switch board, cables and connectors. Not to scale, nor actual numbers of components and flex circuits. Coupling is capacitive as suggested on 22 February 2015. Several smaller chips ease the sockets for 500 links in and 500 out per 40mm*40mm connector, but redundancy eases with chips not too small. A connector can consist of several sockets too; if flatness were a worry, a flat tool can hold the sockets during reflow. At the connectors, the cables can have a piece of stiff printed circuit for flatness.The flexible printed circuits are loose in a protective sleeve, except at the connectors where the soldered contacts hold the flex together, helped by glue put by vacuum impregnation or in advance on the flex and reflown.Some shape at the board and connectors hold these at accurate position. A small clamping force keeps the connectors in place. The connectors can be completely smooth and have a protective insulating layer.Marc Schaefer, aka Enthalpy
Enthalpy Posted November 29, 2015 Posted November 29, 2015 Switch cabinets for the exaflops computer contain 60 stacked switch boards like the compute cabinets do. From 32 compute boards in the same line or row of compute cabinets, sets of 500 nodes converge their links over 32 cables to one switch board with 32 connectors.A Google server or Tianhe-2 equivalent would instead have one big dimension of 48+ compute cabinets, and the switch board 48+ connectors. Numbers here refer to the exaflops computer.The switch boards shuffle subsets of 32 link pairs from nodes at identical positions in the 32 compute cabinets to send them to one of the five-hundred 32x32 switch matrices - or rather, it sends 96 link pairs to one of hundred and sixty-eight 96x96 switch matrices as these exist aleady.The main board carries the 16,000 link pairs to switch vessels. At 100mm pitch, two connectors or 2000 links take >10 signal and 10 supply layers. In the other direction, 84 switch vessels carry each 192 link pairs to two switch components; the vessels are 13mm tall and comprise >10+10 layers.The line amplifiers on the main board and switch vessels are not sketched. Switch boards for the Tianhe-2 equivalent are 1.2m*1.2m instead of 0.8m.Two stacked or opposed switch vessels could lay flat on the main board. Boards with 30+ layers would avoid the switch vessels. More cables with fewer links each would ease the switch boards. Maybe the links of a few compute boards can be partly shuffled elsewhere before landing on smaller switch boards.Marc Schaefer, aka Enthalpy
Enthalpy Posted December 27, 2015 Posted December 27, 2015 (edited) ---------- Bytes-to-Flops updateUp to now I estimated it from PC figures, but supercomputers have a smaller Dram for a given computing capability. Ratio Bytes Flops Machine-----------------------------------------0.026 1.4P 55P Tianhe-20.026 0.7P 27P Titan0.094 1.5P 16P Sequoia0.077 0.8P 10P Mitra-----------------------------------------0.032 512M 16G 8 nodes 1GHz-----------------------------------------0.14 16G 112G 4-core PC-----------------------------------------Ratio Bytes Flops Machine The top-4 machines could have addressed more Ram, so their capacity is a designer's choice. >512MB Dram chips hence permit 8 nodes each, needing 4x fewer chips than previously estimated. >64MB per node made Windows Nt4 comfortable. A (hyper-) crossbar eases the packages over a cube or torus.---------- Dram updateDdr4 chips offer 1GB=8Gb capacity in 2015, which didn't stall as I alleged on 13 September 2015. Organizing the Dram for speed reduces the density a bit, so I take 512MB+ per chip in this thread since 13 September 2015.A picture of a 1GB chip is there; in the 10mm*6mm, I believe to see 32 subgroups of 32MB, possibly 8192b*32768b.http://www2.techinsights.com/l/8892/2015-03-27/kbhjkTake this with mistrust, since capacities were 1000x smaller when I was in the job: Bit and word lines, of 25nm*75nm tungsten, explain a 40ns access time very easily by their resistance and capacitance over 2mm. This access time scales as a length squared or a subgroup capacity. 17 banks (easy Euclidean division) at each of the 8 nodes are 8.5x smaller than the 32 subgroups of the pictured 1GB Dram hence respond in 5 cycles at 1GHz. A L1 suffices. Accessing 16 of them delivers 3+ words per cycle, enough for a number cruncher. A database-Lisp-Prolog-Inference engine would have 6 to 12x more nodes (slower mult and float save power as on 25 October 2015) per Dram. Then, 17 banks respond in 1 cycle and deliver 16 words per cycle, enough for three Alu. ---------- Cpu update14nm Finfet consume far less than my estimate on 31 January 2015. The Knights Landinghttps://software.intel.com/en-us/articles/what-disclosures-has-intel-made-about-knights-landinghas 144 vector processors at 1.3GHz that mul-acc 512b = 8 doublefloats per cycle, or 1152 mul-acc on 64b per cycle.It's said to consume 160-215W, or 0.19W per 64b mul-acc at 1.3GHzhttps://en.wikipedia.org/wiki/Xeon_Phiwhich extrapolates as F2 to 0.11W per 64b mul-acc at 1GHz.Since power saving is vital to supercomputers, but Finfet and Dram processes are highly specialized, stacked chips made by different processes are better, as I describe on Nov 12, 2015. A number cruncher chip whose 8 nodes exchange each three 64b words with the Dram at 1GHz pace needs 1536 signals, and a database engine more: the small adaptive connection I describe on Nov 12, 2015 achieves it, be it by contact (sketch, electrochemical deposition can make small bumps) or capacitive. By the way, capacitive coupling is feasible (with effort) between separately packaged chips pressed against an other, permitting the Dram to evolve. The compute chip is small, estimated under the 10mm*6mm of a 1GB Dram. Comparing with 32mm*21mm for Intel's Xeon E5-2699v3:http://ark.intel.com/products/81061/Intel-Xeon-Processor-E5-2699-v3-45M-Cache-2_30-GHz?q=E5-2699%20v3https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#Xeon_E5-16xx_v4_.28uniprocessor.29http://www.chiploco.com/intel-haswell-ep-xeon-e5-2600-36020/ (die picture) The L3 vanishes. 2.2G transistors over 5.6G. The L2 vanishes too. Let's say, area *0.60. Fewer registers are necessary because the Dram provides the throughput to feed the execution unit. Faster context switching. The E5-2699v3 has 18 Avx2 cores that mul-acc 4 double floats per cycle. That's 9x more than 8 scalar cores. A 14nm process could slash the area by 2 over the 22nm Haswell - maybe. The scalar sequencer is simpler but each mul-acc has one. Neutral? Is hyperthreading needed? It obtains 30% more from the same execution units but doubles the registers, caches, and more than doubles the sequencer. 30% more cores is simpler, maybe it saves power and money. The 30x smaller compute chip takes then 22mm2. A database-Lisp-Prolog-Inference chip has more sequencers and caches hence is bigger. The tiny dissipation would enable more nodes per package, which makes sense only if abandoning some network throughput. Two Dram and compute chips fit without through-silicon vias (Tsv), or a bigger database chip with Tsv can carry several Dram. More companies offer 14nm or 16nm finfets besides Intel:http://www.globalfoundries.com/technology-solutions/leading-edge-technology/14-lpe-lpphttp://www.tsmc.com/english/dedicatedFoundry/technology/16nm.htmhttp://www.samsung.com/global/business/semiconductor/file/media/Samsung_Foundry_14nm_FinFET-0.pdfSamsung also manufactures Dram, Flash and offers the Arm Cortex A7.---------- Supercomputer updateThe exaflops machine draws 55MW "only", not 200MW - excellent news.If one accepts 1/4 the previous network throughput, then the whole machine is 4x smaller. That's 16 cycles per double float. Depending on how much remote data an algorithm needs: If it makes 2*N2*Log2(N) mul-acc (2*32b complex) from N2 data points, 2562=N2=0.5MB chunks keep the execution unit busy. But N3 mul-acc from N2 data points need just 162=N2=2kB chunks. Marc Schaefer, aka Enthalpy Edited December 27, 2015 by Enthalpy
nec209 Posted December 29, 2015 Posted December 29, 2015 i am MS student... I wana do my final year thesis in optcal computers.. I just wana know is it a hot research area can I proceed with this topic.. I need some starti up help material to know what are major areas in optical computer to do work .. please help me... regardz I'm sorry but it will be long time before optical computers replace circuit boards probably 20 to 40 years out or more. Well lasers are getting better every year there is long way to go before a working prototype optical computer is built in lab. ok ... if this topic is out of fashioned then let me know what is going on new in software engineering field and operating system?? and please mention little bit easy work.. optical computers and quantum computers is one of those things that border on fringe science. There is long way to go before optical computers and quantum computers start to hit computer store and go mainstream. sorry it not going to happen before 2050.
Enthalpy Posted December 30, 2015 Posted December 30, 2015 Compute chips are even smaller than my last figure. The 14nm Knights Landing has no on-chip L3 and carries 1152 mul-acc on 35mm*25mm estimated from pictures, or 0.76mm2 per scalar mul-acc. If the common complicated vector sequencer and the L2 pay for individual simple sequencers, then an 8 scalar nodes chip takes 6mm2.
Enthalpy Posted January 1, 2016 Posted January 1, 2016 I'm sorry but it will be long time before optical computers replace circuit boards probably 20 to 40 years out or more. 30 years ago some research papers claimed to have made optical "gates" but these were only light subtracters, a dark fringe if you wish. They were linear, which doesn't make a logic gate. I don't know whether things have progressed meanwhile, but one doesn't need a hot topic to do research. Is optical computing still interesting at all, since electric components have improved so much? We can have a metal line every 30nm, but optical lines that close would interfere horribly, or rather, they wouldn't even carry light for being too narrow. Making an optical circuit 302 times less dense would need serious reasons. The power consumption looks bad too. Presently one electron has 0.6eV energy, one photon rather 1eV - but once you have charged 500 electrons at a node the bit remains without consuming power except for leaks. In contrast, light may need 100 photons right from the beginning of the transition, and more and more photons all the time. That's why I suggested instead - as everyone does - to concentrate on optical transmissions rather than optical computing. Beware, though, that electric transmissions are good: for instance 7Gb/s over 500 closely spaced pins between a graphics processor and its Ram. That is the kind of figures that need a strong improvement, so showing a fibre with a 10mm connector that transmits 40Gb/s won't bring anything. It's the reason why I propose printed circuits for the described supercomputers. Including all air, the crossboards transport a 20Gb/s signal in 0.75mm2. To be serious competitors, fibres must be very closely packed including all connectors, transmitters and receivers, or carry each >100 quick links but have tiny filters. To my knowledge, this isn't solved, and is a current research topic - for telecomms too. Computers need <100m range hence would accept more diverse wavelengths, presently they even use multimode fibres outside the transparency window; maybe it helps.
Enthalpy Posted January 1, 2016 Posted January 1, 2016 (edited) Microprocessors, and previously mainframes, execute several instructions per cycle. The sequencer may decide it during the first pass in a loop or on the fly, the instruction cache may store the original instructions or the resulting "macro-op": many variants exist. It gains speed but complicates the sequencer.I suggest instead that the compiler does this job, using special explicit macro-op instructions that are defined dynamically at execution and always executed in one cycle. Example:ClrMacro 0DefMacro 0R0 = R1 + R2DefMacro 0R2 = R1DefMacro 0R1 = R0DefMacro 0Blt R4++, Maxloop, Macro 0Macro 0would define Macro 0 then execute it as:#pragma HeyCompilerMacroThatdo { R0 = R1 + R2; R2 = R1; R1 = R0;} while (r4++ < Maxloop); The sequencer can be dumb. It can just stall the execution if data isn't available yet. The machine still needs conditional instructions, maybe register renaming. The defined macros must be saved at context switching, so there are very few macros. A loop can contain several macros, and a macro can contain several passes of the source code loop. Each execution and branch unit can memorize locally the part of a macro definition it contributes and recognize its call. Branching to a macro, not only to addresses, helps the hardware. A complicated sequencer must run faster when some instruction unexpectedly lags, say a memory access, but this is unfrequent at the suggested database - web server - AI engine where the tiny unit Dram responds in very few cycles. Advantageously, several loop passes are easier to run in a cycle when the compiler decides it.I suppose it was done long ago, at a mainframe, a Dsp... Here at a database - web server - AI engine, whose execution units are tiny (multicycle floats and mult), it reduces the consumption and size of the sequencer and keeps it efficient.Marc Schaefer, aka Enthalpy Edited January 1, 2016 by Enthalpy
Enthalpy Posted January 2, 2016 Posted January 2, 2016 If you didn't understand the example it's because I botched the parallel version. Better: BeginMacro 0 R0 = R0 + R1 R1 = R0 Blt R4++, Maxloop, Macro 0 Endmacro Macro 0It shows as well that the macro-op can't be executed sequentially during its definition. Making a block of the definition is then better. Compilers are already smart enough to rearrange the registers as needed.-----A processor with explicit macro-op and several execution units can do as much as a vector processor. It doesn't need extra vector instructions, and combines different instructions and registers more flexibly in one cycle. The sequencer is much simpler because the compiler does the job.
Enthalpy Posted January 3, 2016 Posted January 3, 2016 The varied flavours of Gamestation can also benefit from spreading the Cpu among the Dram, stacking of Cpu and Dram chips, and networks I've described. I take as an example the Playstation 4https://en.wikipedia.org/wiki/PlayStation_4https://en.wikipedia.org/wiki/PlayStation_4_technical_specificationswith the same 8GB of Dram and 1152 single float Cpu, on 28nm Amd process like the parent Radeon HD 7870 Pitcairn XT.Each of the 16 compute packages contains now a compute chip with 72 Cpu, is 18mm2 big because of individual sequencers, and dissipates 7W at 800MHz. The Jaguar Cpu drops away. The package contains also, stacked, a 72*8MB Dram chip.One Cpu accesses 3 Dram banks in a cycle to exchange three 32b words. With addresses, 360 contacts in 0.25mm2 by my adaptive connection leave 26µm*26µm pitch.The compute chip has an internal switch matrix among the 72 Cpu and to the 32 in and 32 out external shared links accessed by through-silicon vias. This accepts 170-ball compute packages and switch packages. 16 packages of 32x32 switch matrices operate in parallel, with 2*2 links at 2GB/s to each compute package.The printed circuit can have 7+6 layers and be 120mm*120mm small, plus some power converters.Each compute package can have 8GB of Mlc Flash at its side to load the games 16x faster than an Ssd would.The no-latency Dram provides 3 words per cycle to each Cpu and L1, or 11TB/s. The network transfers 500GB/s between any machine halves, three times as much as the PS4's Gddr5; the nodes can receive simultaneously a word each in 4 cycles. The design is easy to cool and scale up.This design can spread among gamestations earlier than among graphic cards because fewer companies must adapt.Marc Schaefer, aka Enthalpy
dating.forbeer Posted January 6, 2016 Posted January 6, 2016 backed up by good references and research i think you're on the right track.
Enthalpy Posted February 14, 2016 Posted February 14, 2016 (edited) Intel reveals the Knights Landing, a new and much improved Xeon Phi meant for supercomputing.https://software.intel.com/en-us/articles/what-disclosures-has-intel-made-about-knights-landinghttps://en.wikipedia.org/wiki/Xeon_PhiIts 144 vector processors mul-acc 8 double floats each in one cycle at 1.3GHz, so one huge chip offers 3TFlops. Better: this edition accesses directly <=384GB of Dram to remove one bottleneck and ease programming.The Knights Landing smashes Gpu on double floats, consumes less and is much easier to program, so it's a clear choice for supercomputers.Since the architecture I propose deduces its performance from the Knights Landing, the number of mul-acc is the same and the consumption too, except that I assumed a maybe suboptimum 1GHz clock. Though, my architecture is obviously waaaaay better - but then, why?----- Computer sizeOne Knights Landing and its six Dram modules occupy 150mm*150mm*45mm, or 880mm3 rack volume per mul-acc. The boards I described on 09 November 2015, putting 2 chips in 24mm*20mm*20mm but with 8 nodes per chip, occupy 600mm3 per node. That's equivalent, unless one accepts to keep the network density and put more chips, which my design and the much easier cooling allow.Both options add a cabled network, which is more bulky in my design because its throughput is much bigger.----- Chip production costAgain as many mul-acc for the same computing power, but my design needs no L2, no L3 nor "near memory", and it spreads the mul-acc over smaller chips. Scalar processors need more sequencers but fewer instructions make these simpler. The snappy Dram makes hyperthreading unnecessary, and the explicit macro-op instructions I described here on 01 and 02 January 2016 can replace out-of-order execution if desired.More silicon founders can make the smaller chips, possibly for cheaper than the huge Knights Landing. I explained here on Nov 12 and Dec 27, 2015 how to keep the Finfet and Dram chips separate.----- Memory throughputAt algorithms needing 2 double floats per operation, the Avx512 demands 128 bytes per cycle: for each core of the Knights Landing, that's 166GB/s, which only the registers and L1 can provide. The L2 is 4x, the near Dram 48x, the far Dram 260x too slow for simple programming.When a bigger problem lets an algorithm make more computations per data, the solution is to decompose the problem into varied subchunks that fit in the successive caches. Here is a simple example for matrix multiplication; some algorithms accept less memory throughput, but others like the Fourier transform demand more, and at least this one can be planned and it accesses consecutive data (one matrix is transposed in advance).The units are doublefloats (1W = 8B = 64b) and their mult-acc (fma). I suppose the algorithm makes few writes and loads complete submatrices at once - partial loads would be even more complicated. The program can't sweep through both source matrices. It must load 1024*1024 chunks in the near Dram, 64*64 in the L2, 16*16 in the L1, 8*8 in the registers, to make successive uses of a bigger chunk. That's a set of loops for each cache size. The compiler may manage the registers but not the L1, L2, near Dram. Each cache performance has a small margin in speed or chunk size, but it's eaten by Hyperthreading, and one same executable isn't optimum on the Knights Landing's predecessor and successor. Add mentally the vector instructions and alignment. In addition, every supercomputer is loosely multitask through a slower network. Mess. While feasible for matrix multiplication (imagine an Fft or a database), the Russian dolls data chunks make the program difficult to write, read, port, and more prone to bugs. Ancient computer designers wanted instead the same throughput at every cache level and called this the Golden Rule which my proposals follow.My architecture blanks out the Knights Landing here.----- NetworkA 3TFlops Knights Landing chip has two Pci-E 3.0 16x buses. Each carries 16GB/s in and out. Imagine a 2D matrix of 5782 Knights Landing: through any network equator, the chip sends and receives 16GB/s. It computes 750 mul-acc in the time it receives one doublefloat. The transport of 1024*1024 chunks is 19x slower than their bidimensional Fourier processing.Well, the chip could that at most, but planned supercomputers foresee hyperdull fibre hypertori.The chips and networks I propose carry 2GB/s from and to each 2GFlops node in each dimension. A node computes 4 mul-acc while receiving one doublefloat. Even with complex 2*32 mul-acc, 4*4 chunks occupy the execution units with a 2D Fourier transform. If slashing the network by 4, 256*256 chunks suffice. Clean win here too.Marc Schaefer, aka Enthalpy Edited February 14, 2016 by Enthalpy
Enthalpy Posted April 9, 2016 Posted April 9, 2016 A banal PC too suffers from limited Dram bandwidth. The OS and many applications fit in some cache, but databases, Internet servers, file servers, scientific computing and more would need to access quickly a memory range that exceeds the caches, and they stall or need intricate programming.Many scalar Cpu with a small private Dram, as previously described, solve it but change the programming model, so the general PC market would unlikely adopt them.But for instance a quad-core Avx Cpu can be stacked with a common Dram as described therehttp://www.scienceforums.net/topic/78854-optical-computers/page-5#entry898348to provide the good bandwidth and keep existing software.The Dram, in the same package as the Cpu, has a limited and frozen capacity. At end 2015, a 1GB Ddr4 chip measured 9.7mm*5.7mm hence 8GB take 20mm*23mm. This is half the Knights Landing area, so mounting and thermal expansion are already solved, and Dram accepts redundancy.Reasonable programming wants 2 reads/cycle at least, so a 3.5GHz quad Avx256 needs 112GW/s of 64b (900GB/s). If the Dram's smaller banks react in 18ns, then 2047 banks, with 64b wide access, suffice. Favour latency over density. The previously described scaled indexed and butterfly data transfers are desired. Cascaded logic of course. Transfers every 18ns between the chips need 131,000 contact pairs. At 5µm*5µm pitch that's 1.8mm*1.8mm plus redundancy. Routing in 4*1.8mm width takes 2 layers of 100nm pitch. 1ns transfer cycle eases it. No L3 should be needed, maybe a common L2 and private L1 pairs. Many-socket database or math machines don't need a video processor per socket neither. The computing chip shrinks. More cores per chip are possible but more sockets seem better. I've already described interconnections. Flash chips deliver 200MB/s as of end 2015. They should have many fast direct links with the compute sockets. Marc Schaefer, aka Enthalpy
Recommended Posts