Enthalpy Posted January 24, 2015 Posted January 24, 2015 Already the Service Pack 1 for the Gpu... The new front view shows 1024 Macc and the side view 2048 Macc on two linked cards. A video card's height leaves no room around the Bga. The green lines must pass elsewhere. Between two Bga of 64Macc (in one or two chips) plus the rear side are 256 lanes, not 128. The pink and blue dimensions need 4 layers, varied signals 1, plus as many ground or supply planes: the Pcb has already 10 layers. To align 4 Bga, the green dimension passes twice through the card's middle.To route them, the Pcb has 7 layers there, one for other signals, plus as many mass or supply planes: 16, ouch. Or add flexible printed circuits to the stiff board to route the green signals between the card's right and left ends - shows as thin green lines on the side view. 512 lanes fit in four flex with one signal and one ground layers each. The stiff board keeps 10 layers. If passing green signals under the Bga, the orange dimension's lands are distributed to limit the obstruction. Heat pipes (brown in the side view) shall take the heat from the rear side. Copper pillars through the Pcb would have taken much room.Flexible printed circuits can link two cards in a full hypercube. 2048 signals fit in 4+4 wide flex that, at 20Gb/s, transport 4TB/s. The flex are permanently welded on the Pcb and demand a strong mechanical protection. Some elastic means holds the Pcb permanently together.Can we exaggerate?As described, each Macc node can exchange 20Gb/s in any dimension, or two words in 8 cycles. I feel a Gpu accepts less, and because the links are reshuffled at the chips, we can have fewer links and then more chips, or more Macc per chip.Marc Schaefer, aka Enthalpy
Enthalpy Posted January 25, 2015 Posted January 25, 2015 One or two Pci-E card can be a data cruncher in a PC to outperform the Cpu (170GFlops and 50GB/s for a six-core Avx). Some existing applications have been ported on a video or video-derived card (1300GFlops, 6Go and 250Go/s), and some other chips are meant for them, but: Databases need a quick access to all the Dram; Scientific computing too, and a Gpu is difficult to program when the function isn't supplied in library; Artificial intelligence runs better on multitask machines, as do most applications. Scalar 64b float Cpu, with the wide access to the whole Dram resulting from same-chip integration, linked more loosely by a network, fit these applications better than many processing elements separated from the Dram by a limited bus. So here's how to cable a bunch of them on a Pci-E card.256MB Dram per scalar Cpu fit usual applications better than 8GB for 2000 processors. Dram chips have now 256MB or 512MB capacity, so each chip shall carry 1 Cpu, or exceptionally 2. The broad and low Dram responding in 15ns reads or writes 512B at once: that's 3 words per processor cycle. An 8kB Sram caches 16 accesses and responds in 1 processor cycle, the registers do the rest. 512MB Dram chips fit in an 11.5mm*9mm*1.1mm Bga, so a 256MB+Cpu package shall measure 9mm*6mm. I estimate it to draw 1W at 1.5GHz and 0.5W at 1.1GHz.----------256 chips fit on one card, better at both sides, for crest 768GFlops and 8.7TB/s at 1.5GHz. A full switch matrix takes one previously described switch chip. A hypercube is less easy to route; by cabling first 4 chips in a package, the card's Pcb needs about 9 layers.512 Cpu fill a card much - easier with 2 Cpu per chip - for crest 1126GFlops and 17TB/s at 1.1GHz. A switch matrix takes 4 big packages then and the Pcb about 8 layers. A hypercube would host and route 16 chips of 2 Cpu per 600 ball Bga package, with 8 packages per side on the 10 layer Pcb.A 1024 Cpu hypercube might have 32 chips of 2 Cpu per 1300 ball Bga package, by stacking two chips or populating both sides, and then the card looks like the recently described Gpu with 1024Macc - or double in Sli. If feasible, the card would bring 2252GFlops and 34TB/s, an Sli the double - still at 1.1GHz, needing 500W or 1000W power and cooling.Modules perpendicular to the Pcb hold more chips and are easily cooled. An So-dimm connector with 260 contacts suffices for matrix switching only. A low two-sided So-dimm card hosts 40 Cpu, 24 So-dimm in two rows or around a central fan total 960 Cpu. The switch matrix takes 16 packages (32 per card if Sli) on a thick Pcb, and six sheets of flexible printed circuit can connect two cards. At 1.1GHz, this card offers adaptable 2212GFlops (64b), 240Go, 33TB/s and the double in Sli.----------Customers may adopt more easily a new data cruncher than a video card. It needs no directX standard, can integrate an existing Risc Cpu, and many applications are easy to program efficiently for it. The public pays 250€ for 16GB Dram now, so a 1000 Cpu card might sell for 10k€, which many companies can afford.Marc Schaefer, aka Enthalpy
Enthalpy Posted January 31, 2015 Posted January 31, 2015 Here's a better estimate of the consumption of a 64b Cpu, including when slowed down.Intel's 2700K, a 32nm Sandy bridge, draws 95W at 3.5GHz for 4 Avx cores making each 4 Mult-Add per cycle. That's 5.9W per 64b scalar Cpu and includes the refined sequencer, all registers and caches, bus, pins... Intel has a 22nm tri-gate process (5.5W at 4.0GHz) but the combined chips need a process good at Dram.To estimate how much power underclocking saves when undervolting, I refer tohttp://www.anandtech.com/show/5763/undervolting-and-overclocking-on-ivy-bridgehttp://www.silentpcreview.com/article37-page1.htmlhttp://www.hardware.fr/articles/897-12/temperatures-overclocking-undervolting.htmland a mean value is: P as F2.40 down to F/2 and rather as F2.0 below. Having more of slower Cpu saves power, as known from Gpu and Cpu.So this is the estimated consumption of a scalar 64b float Cpu:3.5GHz 5.9W3.0GHz 4.1W2.0GHz 1.6W1.5GHz 0.8W1.0GHz 0.4W---------- How much does the Dram consume? The 5W for Intel's L3 is very bad or very wrong, since four 8-chip memory modules don't dissipate 160W - already the ddr2 modules ran cool, so here's my estimate. The read amplifiers preload the data lines of 500fF each through 5kohm and 10kohm Mos sharing 1.2V; 2048 of them working 1/5 of the time draw 40mW. One 5mm 4*64b 2GHz 0.8V bus to the cache draws 80mW. That would be 0.12W when the Dram works at full speed, quite less than 1.6W for the Cpu. ---------- A cpu with one-cycle 64b float M-Add excels at scientific computing. Though, databases need fast integer comparisons rather than float multiplications, possibly with more Cpu per gigabyte; Lisp needs resemble databases plus a quick stack and a fast Dram random access.Maybe binary-compatible chips with different optimizations target these uses better, say one for float operations and the other for quicker simpler integer operations. More Cpu per chip need more contacts at the package.
Enthalpy Posted January 31, 2015 Posted January 31, 2015 On 02 December 2013 I proposed connectors and cables that run from one board to one other and carry many or all signals between two boards. One alternative would be to choose identical subsets of nodes on each board and have for instance one stiff printed circuit make all connections across each subset: The nature of a hypercube needs no connection between the cross printed circuits, as the processing boards make them, and all cross printed circuits are identical. The subsets can be sub-cubes of the boards or not.A good stack of multilayer cross printed circuits accommodates many signals. This setup typically needs more connectors with fewer signals each for a constant total. It provides a cleaner routing than many big flexible cables and is compact. Numerical examples should follow, as well as connector examples - the sketch doesn't show every signal.The cross printed circuits could have processing boards at two edges, four, more, be circular... but then they're difficult to service. I prefer to have connectors and maybe cables between separate sets of cross printed circuits, as long as the number of signals permits it.As an other variant, the cross printed circuits could be processing boards and vice versa. The computer would then have two sets of boards, say vertical and horizontal, each set completing the interconnects for the other. This needs bigger connectors.Cables remain necessary for the biggest computers. They can complement the cross printed circuits at longer distances. The links from each node heading to a small distance travel with their neighbours in a subset, while the links heading to a long distance travel with other links from the board that have the same destination.Marc Schaefer, aka Enthalpy
Enthalpy Posted February 2, 2015 Posted February 2, 2015 The connections between the processing boards and the cross printed circuits must pass many signals but be narrow and thin to enable close packing. Here the boards and the cross circuits have slits to slip in an other, and the signals use the overlapping length on four sectors to pass.Big printed circuits are flexible, so the slits define the relative positions and must have accurate widths and positions. Added plastic parts would help, easily gliding and maybe with some elasticity.To carry thousands of signals per board, electric contacts would demand a huge insertion force. I consider instead optical links here, with photodiodes and Leds or laser diodes: Sketched for vertically-emitting diodes, but side emitting looks simpler. Plastic parts at each board carry the light, here with an angle that could also result from an elliptic profile; the separator to the adjacent beam absorbs light instead of reflecting it, so the remaining light has a small divergence at the gap between the parts attached to the processing board and the cross circuits.The light carriers could consist of 0.6mm transparent plastic sheet alternating with 0.4mm black one, all glued before the profile is made. At 1mm pitch between the light signals, this would accept, in a direction where printed circuits are stiff, nearly 0.4mm misalignment before the links interfere. Some trick shall index the part's positions at production.Optical chips are small and don't integrate electronics well, so they shall piggyback on a bigger electronics chip, say 9mm long and 3mm wide if shared among adjacent slits. 10Gb/s is present technology. The electronic chips recover the clocks, reshape the pulses, drive the optics or transmission lines, and detect faulty optical links to replace them with redundancy, having for instance 9 optical sets for 8 data links, since light emitters aren't so reliable presently. The mounting method on the printed circuits should allow servicing.90mm overlap carry 160 signals in and 160 out per slit. A slit every 10mm*10mm is feasible.Other designs are possible.Marc Schaefer, aka Enthalpy
Enthalpy Posted February 8, 2015 Posted February 8, 2015 One other design using cross boards has vessels (as for blood) made of printed circuit that gather the signals from and scatters to the computing chips, which help the computing boards to carry the many signals to the cross boards. They add circuit area and volume, and can be dense while the bigger computing boards don't have to for their other goals; can be perpendicular to the computing boards and parallel to the cross boards, which may help pass the signals. As an example, the W=5mm h=2.4mm vessels can have 12 signal and 13 ground layers to carry >112 links each. 4 vessels then carry all links for 2*16 nodes to and from 127 other computing boards. Note that the cross boards have short notches here, leaving room to spread and carry the links, while the computing boards have deep notches that may almost swallow the cross boards.----------To let the links jump between the vessels and the cross boards, the facing packages can have photodiodes and Vcsel (vertically emitting laser diodes). Kind of eyelashes, at one or both sides, can dim parasitic light between the links and stay soft enough. One package can be 5mm*14mm to have nearly 70 contact balls and carry 28 signals, so 4 packages per example vessel suffice. This leaves 2.5mm2 to let each link jump.----------Alternately, capacitances can pass the links between the packages at the vessels and the cross boards. Just d=0.5mm facing an other at 0.5mm distance make 3.5fF for each link (and even less because ground tracks run at both packages between the electrodes of different links to reduce crosstalk): not much, but well enough for an integrated receiver at 20Gb/s situated <3mm away.I prefer the capacitive coupling over the optical one: It's more reliable; it's cheaper; It saves power. With light or capacitances, very little silicon is required to process the links, so a small dice in a bigger package that makes interconnections looks preferable. With light, the many emitters and receivers would be small, and to the sides of the processing silicon rather than on it; with capacitances, the package just provides additional electrodes, which a polymer or ceramic thin film can protect.----------If the emitter and receiver for capacitive coupling are close enough, their electrodes can be much smaller, so two bigger chips pass as many signals as the package contacts can, like >500. Then, the silicon chips (or GaN - something for signal processing) can carry the electrodes directly, for instance on their back side through vias if they're flip mounted, with a durable nitride protective layer.Alignment of the emitting and receiving packages stays easy if the chips have more electrodes than they pass links and map the best electrodes to each link. This is done at power up in a very quiet environment, or rather dynamically. The transmitting chip can have fine electrode subdivisions in the N-S direction and the receiving one in the E-W to limit complexity. The undriven electrode subdivisions can serve as ground to reduce crosstalk.The many-links packages don't fit the vessels so well, but can make massive connections between computing boards parallel to an other.Light beams could benefit too from plethoric receivers (and maybe emitters) chosen dynamically to carry many links, over a distance with some optics. Though, it gets reasonable only if the processing electronics is on the optical chip. Silicon Schottky maybe, or very dissimilar epitaxy.Examples of computers are to come.Marc Schaefer, aka Enthalpy
Enthalpy Posted February 21, 2015 Posted February 21, 2015 If many links must pass from a board to a parallel one, for instance if using the previously suggested vessels an crossboards, short optical links are an option. A pair of light emitting and receiving modules may look like this:For instance 6mm*5mm modules can have 36 balls to pass 16 monodirectional links each. 17 or 18 light emitters and receivers carry the links between the boards, with signal chips that chose the healthy opto chips and drive them or process the signals. Future big opto chips may integrate several emitters or receivers; presently, GaAs photodiodes can integrate a preamplifier. The module's fine printed circuit board can interconnect the signal and opto chips, or a big signal chip carries the opto chips.The opto chips have a wide divergence and angle of view. Lenses make the light parallel between the modules to tolerate misalignment, say 0.5mm; they can be molded after the opaque polymer. Soft lashes absorb stray light.Without lenses, much power would achieve a only a noisy signal. With lenses, one D=8µm Vcsel consumes 4mWe*3V half of the time, so 13 links per node draw 80mW per computing node at each jump. At 10Gb/s and 850nm, one third of the emitted 0.75mWo hit the 0.5A/W photodiode to produce 0/12fC per bit. A low-power bipolar transimpedance amplifier can have 2nV/sqrt(Hz) noise, the photodiode 2pF: over 5GHz, the noise is 0.3fC. 20Gb/s still offers some margin.Marc Schaefer, aka Enthalpy
Enthalpy Posted February 22, 2015 Posted February 22, 2015 To pass links from a board to an other, I had also suggested capacitive transmission modules, here are sketches: A transmitting or a receiving module has only small silicon or SiGe signal chips, cheap and power-saving, that live longer than light emitters. Four chips per 16 links make shorter and easier tracks. Thin polymer or ceramic should protect the electrodes.At 0.5mm gap and 0.4mm offset, a 0/0.8V signal passes 1.2fC peak through 3fF roughly. The receiving pad's capacitance is 50fF and the transimpedance amplifier's 200fF, so 3nV/sqrt(Hz) and 10GHz bandwith make 75aC noise: not a limit to the throughput. 50+200fF at the transmission pad cost 0.8mW per monodirectional link at 20Gb/s.Marc Schaefer, aka Enthalpy
Enthalpy Posted February 22, 2015 Posted February 22, 2015 I've suggested bigger chips in direct contact to pass many links by capacitance. A 2000 ball Bga package measures about 45mm*45mm to transmit 1000 signals - fewer and smaller is possible. 1 cm2 chips offer then ~200µm*200µm per link; 20µm air gap make 18fF while each pad and preamplifier load with 200fF if a 10µm oxide or polymer layer is purposely produced, so the signal is huge. A nitride layer would protect the electrodes well.The boards are not aligned to <200µm: the chips have several electrodes per link, and the electronics chooses which ones are the best. This is done at boot time and maintained permanently; compensating mechanical movements is easy at many Gb/s. The misalignment and movements can exceed the size of a link.If the position of a link on the chips is adjusted to, for instance, 1/5 of the link's size, the receiver could have one electrode per link and the transmitter 5*5; I prefer to have 5 electrodes per link at the transmitter and the receiver, which suffices if they're elongated and crossed as on the sketch.Routing such modules would be difficult at the end of signal vessels; they fit better near the centers of two superimposed computing boards to connect them, or spread among the computing chips, or possibly as combined computing and connection chips.Marc Schaefer, aka Enthalpy
Enthalpy Posted February 25, 2015 Posted February 25, 2015 This example is a personal number-crunching machine with cross boards that fits on a desktop. It has the size of a Pc, the dissipation of an electric heater, is air-cooled, and can connect to a Pc host via one or two Pci-E cards running external cables.Each computing chip bears one 64b float Cpu that multiplies-adds at 1GHz - for power efficiency - and 256MiB of Dram delivering 32GB/s. The 9mm*11.5mm Bga78 packages occupy 30mm*20mm pitch.Each vertical computing board comprises 16*4*2=128 computing chips. They're 500mm tall, 150mm long, spaced by 15mm. Less than 2 layers route 160 vertical links, less than 1 the 64 horizontal ones. 1 ground plane and 1 supply plane distribute 64A*0.8V from the board's insertion edge where the local converter receives 48Vdc for instance.32 computing boards make a 500mm*500mm*250mm machine. The 4096 nodes dissipate 1700W, mean 0.2m/s air flow through 500mm*500mm cools them.One in and one out data vessels connect each row of 4*2 computing chips to their homologues at 5 neighbour computing boards. Each vessel carries 40 links in 5mm height and 4+ signal layers plus grounds and supplies. 3 capacitive modules pass data between each vessel and each cross board.16 cross boards interconnect cleanly the 32 computing boards as on the 01 February 2015 sketch. A cross board passes 352 links at the densest section, so 60mm width suffice with 3+ signal layers plus ground and supply.The cross boards could have computing boards on two or four sides, but the bigger machine heat too much for a desktop, and I prefer bigger computing boards then.Total 8.2TFlops aren't much more than one Tesla Pci-E card (1.3TFlops on 64b) but the rest is: 131TB/s Dram instead of 0.25 feed the Cpus for easy algorithms 1TiB Dram instead of 6GB mesh the Earth at 10 numbers per element with 200m pitch rather than 2.6km, or mesh an 80m*80m*40m airliner flow with 30mm pitch rather than 150mm. A disk per board, or few flash chips per Cpu, can store slow data, say at each time step of the simulation. At 20Gb/s, the network transmits 4TB/s through any equator. Its latency is 6ns (6 Cpu cycles, less than the Dram) if the chips are good and it carries one word in 4ns more, one 512B Dram access in 262 cycles - so the scaled indexed and butterfly access to Dram at a remote chip is interesting. And even if a node requested individual remote Dram floats randomly over one link, 32b in 15+6+2 cycles would feed two Macc per cycle for a 2048-point Fft.Marc Schaefer, aka Enthalpy============================================The same box makes a small cute 1TB database machine - or an inference, Prolog or Lisp engine, or a web server. Though, this use emphasizes other capabilities: Quick floating point is secondary, even integer multiplies aren't so frequent; snappy random access to the whole Dram is paramount; unpredicted branches must bring minimal penalty; logic operations occur often. Different versions of the computing chips, meant as float-cruncher or instead as database runner etc, can run the same executables but optimize the hardware differently: No 64b*64b float multiplier. A 1b*64b multiplier, or 4b*16b, takes 64 times less power. But cheap provision for exponent, denormalization, renormalization, conversion, division step is welcome. The 64b instructions run, just slower. We could have raised the clock, widened the words, have many scalar integer execution units... But: I prefer to keep 1GHz, shorten the pipeline for quick branches, and have 16 cores per chip (not 64 as the sequencer and registers draw power). The 256MiB Dram cut in 16 must be ~16 times faster, or 2ns access. The access and cache width shrink to 8 words. A tiny Bga98 package limits to 2 links in each direction between two chips, shared among the cores, easy for the boards. Bigger packages would occupy more area to no benefit. Fortunately, a database doesn't need much network throughput. Faster logic operations cost little hardware and opcodes which are useful on any machine: several logic registers, destination tables, compare-and-branch, compare-and-store, compare-and-combine, compare-combine-and-branch... µops fusion achieve it anyway, but opcodes that simplify the sequencer are welcome.One cycle per logic operation is a perfect waste. Programmable logic could combine tens of bits per cycle to filter database items; though, its configuration takes time even if loaded from the Dram, and this at each context switching, not so good. As a compromise, I suggest a cascade logic operation from an implicit source register to an implicit destination register working stackwise: And, Or, Eq, Gt, Lt and their complements, Push, Swap, DuplicateOrNop - coded on 4 bits, the source bit position on 6 bits, this 5 times in a 64b instruction. The operations and source bits can also be read from a register and the source and destination registers be more flexible. Feel free to execute several such adjacent instructions at once. Useful on many machines.Some string support wouldn't hurt any machine, especially the search of a variable number of bytes in a register at any byte shift. A fast stack for Lisp is nice everywhere.In the database or AI version, this micro machine offers 65*106 Mips and 320*1012 logic operations per second. It could dumb-parse the 1TB in 2ms, but indexing will naturally gain a lot.Marc Schaefer, aka Enthalpy
Enthalpy Posted February 26, 2015 Posted February 26, 2015 Maybe the new cascaded logic operation deserves an illustrative example. Imagine a seller proposes many graphic cards memorized in a database where some attributes are stored as bit fields:enum { Ddr=0x0, Ddr2=0x1, Ddr3=0x2, Ddr5=0x3, Gddr3=0x4, Gddr4=0x5, Gddr5=0x6, Amd=0x0, nVidia=0x8, Used=0x00, Refurbished=0x10, NewWithoutBox=0x20, NewInBox=0x30, Agp=0x00, PciE=0x40, OnlyMonitor=0x000, VideoIn=0x080, VideoOut=0x100, VideoInOut=0x180, NoLed=0x000, GreenLed=0x200, RedLed=0x400, YellowLed=0x600, Cash=0x0000, CreditCard=0x0800, Transfer=0x1000, Cheque=0x1800};and a customer seeks only such items:only Gddr5 => bits 1 and 2only nVidia => bit 3new or refurbished => bits 4 or 5only PciE => bit 6video in and out => bits 7 and 8led colour => don't carecredit card or transfer => bits 11 xor 12the the database runtime reads to register 2 the current item's attributesfills the register 5 with the source bits positions (coded on 6 bits each) 1 2 3 4 5 6 7 8 11 12fills the register 8 with the operations (coded on 4 bits each) push and and push or chain and and and and push xor chain and nop nopthen performs CascLog R2, R5, R8, R13and checks in R13 if the current item satisfies the customer's wish.This differs slightly from the previously suggested operation with immediate bit positions and operations. It offers runtime flexibility (to adapt to the customer's wish here), operates on 10 bits at once, and can leave several result bits for following instructions.Marc Schaefer, aka Enthalpy
Enthalpy Posted March 1, 2015 Posted March 1, 2015 Here's the reasonably biggest hypercube routed by printed circuit boards without cables. At half a million 2GHz single-core chips, it consumes 0.9MW, offers 128TB and 34PB/s Dram, 1PB/s network through any equator, and 2TFlops on 64b - a database machine with the 16 core per chip and cascaded 10 logic instruction would offer 167*1015 logic operations per second.It has at most four cabinets of 64 vertical computing boards. Each computing board carries 64*16 computing chips per side. Each cabinet has 64 horizontal cross boards to interconnect the own computing boards and connect with the neighbour cabinet.The 256MB and 64b float chips measure 14mm*9mm as a Bga98 and have 19 link pairs. At 18mm*18mm pitch, they make a computing board slightly over 1m high and 0.55m long (0.16m interleaving, 0.29m computing, rest power and fluids). Vertical links take 3 board layers of 2 lines/mm plus repeaters every 0.25m for 20Gb/s, horizontal one take <1 layer.The alloyed copper cooling pipes distribute the current as well : 1V*51A per two-sided row, 3.3kW per board received as 48Vdc.The data vessels serve for adjacent rows of computing chips, with 1 input and 1 output vessel for 2*16 chips connected to 8 computing chips that are on neighbour boards on the hypercube. That's 256 links per 4mm*7mm vessel with ~20 signal layers plus ground and supply planes. 16 capacitive chips per vessel connect with the cross boards. Each cross board carries 2688 links within a cabinet. Using 300mm width and 1 line/mm (for 2 repeaters per metre at 20Gb/s) this takes 9+ signal layers and as many grounds. It also carries 4096 links to the homologue cross board in the cabinets at its right and left, which takes 3+3 signal layers and 3 grounds; this sketch should follow. Capacitive chips interconnect the cross boards between the cabinets; their package can be bigger for the cross boards to ease routing. The cross boards also are 1m+ big.The worst latency between hypercube neighbours in 13ns, similar to the Dram. Through the worst hypercube diagonal, the latency is ~60ns without collisions: as much as transmitting 120 bytes.Marc Schaefer, aka Enthalpy
Enthalpy Posted March 2, 2015 Posted March 2, 2015 And this is the sketch of the machine. At 2PFlops (not TFlops! My mistake) it isn't the world's fastest (55PFlops in 4Q2014), but it's cleanly routed without cables, cute and compact, occupying just 2.2m*2.2m.If all sectors shared the cross boards, servicing one would let extract all computing boards first. I prefer these interleaved cross boards, where the cabinets hosting the computing boards can be separated. Strong adjustable struts hold the cross boards and cabinets at proper distance.Marc Schaefer, aka Enthalpy
Enthalpy Posted March 3, 2015 Posted March 3, 2015 (edited) Heavy numerical tasks are well-known, so I've sought instead examples of databases to compare with the machines' capability.The worldwide eBay seems big: 200M items for sale, each one visited ~5 times a week, and users make supposedly 5 searches before viewing 5 items. That would make 1700 searches per second. Items have <200 bit properties and small attributes, plus a few words like price and dates, a 50 bytes title, a description text seldom searched estimated to 500 bytes long as a mean, and images not searcheable hence put on disks.The items base occupies 10GB (bits and numbers) plus 10GB (titles) plus 100GB (descriptions) - plus some disks.The PC-sized micromachine hosts all searcheable data in its 1TB Dram - some can even be replicated so not every chip must receive every request. Even if the program compared each query with each item, the 16 cores *4096 chips would still have 200 cycles per comparison. That's more than it takes, because the cores check 10 bits in one cycle and can rule out most items after one cycle. Not even the 7 words title takes 200 cycles of string instructions to parse and most often discard early.Then, precompute index tables to go faster, sure, especially for the description texts.Amazon and Alibaba must have similar search amounts, other sale sites and banks smaller ones. Puzzling idea, all eBay fitting in one single small box. Edited March 3, 2015 by Enthalpy
Enthalpy Posted May 1, 2015 Posted May 1, 2015 For scientific computing, mass memory is easy. Tianhe-2 has only 9 bytes of disks per byte of Dram. For the machine sketched here on 03 March 2015, this equals one 4TB mechanical disk per computing board. As one Ethernet link per disk suffices, a separate file server can provide Raid functions, route the requests to relieve the main network, provide peak throughput, broadcast the executable - all from commercial technology and hardware. Good.But since it takes 0.5h to read 1/9 of a disk, the file server can't host a paging file. Computing nodes don't access it simultaneously, but the speed discrepancy is just too big. If simulating a galactic collision, the disks can store the successive collision epochs, but the Dram must hold all the data between two epochs.----------Also, a database needs bigger disks. The micromachine suggested here on 03 March 2015 hosts in its 1TB Dram all attributes, titles and text from eBay's current items, but five 200kB pictures for every fifth item take 40TB or 40 times the Dram, while expired items may need slow 2PB. Google hosts 30T web pages of each 500kB (?) including the images, that's 5M disks possibly split on several sites; a page must be 10-50 bigger than the searchable text, suggesting the disk/Dram ratio.Disk data must be the biggest throughput in a database machine, so the main network shall carry it and the computing chips shall run the storage work too, then with each disk connected to a computing board. Each computing board of the machine sketched here on 03 March 2015 has 500GB Dram, so 20TB or 5 mechanical disks per board seems a good ratio.Ethernet links to a remote box are still a clean option; alternately, blades can hold the disks in front of the boards (for two boards if 3.5" disks) and be toppled to access the boards. Sketched here with two disks, room suffices for the ten:For deeper pockets, Flash disks are faster. One blade per board can host up to fourty 500GB Ssd in four rows, again 20TB mass storage for 0.5TB Dram - other uses like paging accept less. At this scale, all the flash chips for one blade can reside on a board, rather than be packaged in disks, which also flexibilizes their organization; one (or few) multilane cable is also more convenient between the computing and mass storage boards and permits more lanes.----------Flash chips on the computing boards are an option for less populated boards but only keep the performance. One or several flash and computing chips in one package, possibly stacked, can save room and accelerate the mass memory. Flash chips store presently 128GB in a 18mm*14mm Bga, so 16GB (Dram*64) would leave room for more parallelism. 160MB/s and 0.3ms would bear a useful paging file, while the 0.5M chips machine would offer 8PB at 80TB/s (32PB with 64GB chips) to a database, beating remote disks. If the chips optimized for a database have 16 simpler nodes, stacked chips must enable one Flash access per node.Marc Schaefer, aka Enthalpy
Enthalpy Posted June 13, 2015 Posted June 13, 2015 I proposed on 26 February 2015 machines for databases, artificial intelligence and more, with more Cpu per chip by accepting a slower float multiplication, and a cascaded logic instruction.Since the instruction decoder-scheduler consumes more than the cascaded logic, we can put several cascaded logic units per Cpu. Existing Cpu designs have several Alu to execute more computations per cycle; that will be good for cascaded logic too.A database search routine could then process in parallel through its datasubset several queries with varied logic operations. This accepts the same Dram throughput, and with each simultaneously treated query giving one instruction in a loop, it's still easy for the programmer and the compiler-optimizer.Each treated query needing two own source registers (telling which bits and what operations) and one result register, a database-oriented Cpu could treat some 8 cascaded logic operations per cycle, 5*1015 logic operations per second from the micromachine, and 1.3*1018 from the 2m*2m machine. Good for cryptanalysis too.Could we exaggerate further? It has been done in the past, with many processing units treating all queries in parallel from the common flow from the dataset. It wouldn't be a universal design, though: the configuration and context switching needs much more information. As opposed, a Cpu with 8 cascaded logic units is still interoperable with a float cruncher - just better at different tasks.----------After suggesting to integrate an "existing Risc Cpu" with the Dram chips, I read "Mips" at many occasions... I have absolutely nothing against a Mips64 core, provided it accepts some extensions. The cascaded logic is so useful, and the demand for data servers so obvious, that this instruction belongs to any Cpu (hi there).Marc Schaefer, aka Enthalpy
Enthalpy Posted September 13, 2015 Posted September 13, 2015 The US government has issued an order in August 2015 creating the National Strategic Computing Initiative to build an exaflops computerhttps://www.whitehouse.gov/the-press-office/2015/07/29/executive-order-creating-national-strategic-computing-initiativeand other groups (notably in India) had already such ambitions, so here are my two cents of thoughts.----------Tianhe-2 leads for abnormally long at 55PFlops, but my equivalent consumes already an unreasonable power I'd say: 33MW.Newer silicon processes save little power. For the same 4 cores of 256b Avx drawing 65W, Haswell at 22nm offered 3.2GHz, Broadwell at 14nm 3.3GHz and Skylake at 14nm 3.4GHzhttps://en.wikipedia.org/wiki/Skylake_(microarchitecture)Though, the integrated graphics coprocessor may conceal a power improvement at the cores. Let's await the Avx512.Many more nodes running slower must bring some efficiency improvement. 3.2GHz and 8M nodes achieve 55PFlops as I suggested here on 07 December 2013, slowing to 800MHz improves the Flops/Watt by ~5.6 but a even slower clock brings little. 1EFlops then demands 100MW and 600M nodes, where dissipation permits several nodes per chip.Scientific computing demands 64b floats, but data processing often accepts 32b: seismic, images... 64b need much power, or a slower clock, to propagate the carry. Provided that the subtle sequencer doesn't draw too much, 1-cycle 32b and 4-cycles 64b can be considered for a less specialized machine, since this permits more MHz and nodes per watt and chip.----------The Dram density too has almost stalled at 512MB/chip. Present improvements stack several chips to save footprint area, and do a bit more than cumulate the old throughput per chip and pin.The previous 40MB/GFlops were decent. If nodes are slower but a package contains many more within the same dissipation, a way to keep the MB/GFlops is to stack chips in the package.One reason to keep many MB per node, in addition to the application, is that the fully symmetrical described machines, which are easier to program and more efficient on more varied applications like databases and AI, must run on every node at least a part of the OS to switch between the tasks. Win95 ran on 8MB, but software has progressed so much.More people consider that present heterogenous machines are difficult to program:https://en.wikipedia.org/wiki/Tianhe-2 section "criticism"explaining why I cling to identical scalar nodes, Cpu+Dram chips, uniform network.Flash memory still progresses in size and throughput. Stacking some in the package gets more important with less Dram per node, as paging and swapping become more necessary and efficient.----------With slower Cpu clocks and smaller on-chip Dram per node, the Dram latency takes fewer cycles. Soon it will be below one cycle, and then a Complex Instruction Set architecture becomes good again, like the Vax-11 that offered three elaborate operand addresses per operation. For databases and often scientific computing, most data throughput is from and to the Dram anyway - hence my quest for full throughput. While present sequencers achieve to overlap well and flexibly the computations with the Dram accesses, Cisc instructions can simplify this aspect of the sequencer and reduce the code size.----------The very efficient hypercube network I described on 07 December 2013 can't expand; printed circuits like on 03 March 2015 only alleviate the cables' burden on a big machine.It's legitimate to share the same network throughput among N times more nodes that would be N times slower... but here we'll have N2.2 times more nodes for the same consumption, and even more nodes with more consumption. So, unless someone finds an even more compact network - fibres and present connectors are not! - some compromise must be made, gasp.That is, several nodes must share a smaller group of links. Or the network must be less dense than a hypercube. Or both.The good news is that the 2.5PB/s network I described on 07 December 2013 is 100x better than existing supercomputershttps://wiki.ci.uchicago.edu/pub/Beagle/SystemSpecs/Gemini_whitepaper.pdfMarc Schaefer, aka Enthalpy
Enthalpy Posted September 15, 2015 Posted September 15, 2015 In January 2015 I suggested a video card and a derived Pci-E data cruncher with a hypercube connecting the chips that combine Dram and Cpu. A different network looks better, simpler, more scalable.The chips fit in smaller 170-Ball Grid Arrays like Gddr5 does. This permits ~35 links in, 35 out, and the package contains 256MB of Dram split among, as an example, 64 scalar 32b Cpu at ~1GHz. The video card hosts 36 packages totalling 2304 Cpu. Each package has one link to and from each other package.Now all Cpu in a package must share the links, but this is globally better. The previous hypercube had 512 pairs of lines through any cube equator. Now there are 182=324 pairs through any equator, but a message occupies one line instead of mean successive 2 for the 4-D cube: the throughput improves. The latency improves and is better predictable. The packages are cheaper and easier to cool. At 7W each, they can populate both board sides. Say ¡Adios! to the luminous routing clarity of the hypercube. This network is a holy mess. Routing takes less than one signal layer per package, oh good. But how much less? Putting the packages on a hexagonal pattern and routing in six directions can help. The packages could be hexagonal too, as well as their ball pattern. The chips must allocate the lines to the balls dynamically to help routing. 4MB Dram per Cpu have 1 cycle latency at 1GHz or nearly. Then, a 5*32b wide access feeds the Cpu well. Decent 64b Cpu capability (four cycles) would qualify the card for general data processing.A Pci-E number cruncher would put only four 64b Cpu at 800MHz among 256MB Dram on a 1W chip. Stacking 8 computing chips and a 128GB Flash in a package brings, for 36 packages on a board, 1800GFlops, 74GB and 26TB/s 5-cycles Dram, 4TB Flash, for 290W.A Pci-E database and AI engine would carry per chip sixty-four 64b Cpu at 1GHz with the cascaded logic instruction but multi-cycle integer multiply and float operations. The 256GB Flash chip stacks with 8 computing chips in a package. 18Tips, 180T logic /s, 74GB and 590TB/s 1-cycle Dram, 9TB Flash.Marc Schaefer, aka Enthalpy
Enthalpy Posted September 19, 2015 Posted September 19, 2015 One known network is the hypertorus, less dense than the hypercube hence enabling more nodeshttp://en.wikipedia.org/wiki/Hypertoroid#n-dimensional_torusR nodes connects only to both neighbours in each ring, but the the hypertorus has d dimensions, so it offers shorter and redundant paths to a network of Rd nodes: R=4 rings make a hypercube, with just nodes and dimensions called differently, and R=2 doubles each link. R=3 would be denser and shorter than a hypercube where useful, while R>4 saves on cabling but worsens the latency and throughput.A reasonable ring length R doesn't slow much since the hypertorus has fewer dimensions, so a message crosses a moderate mean distance, but with longer rings the Log( R) gain in dimensions brings little. Though, many computers use quite long rings to save on cabling; they may have 30*30 packages per board, 30 boards in a cabinet and 30*30 cabinets, for 305 = 24M nodes. Intelligible and clear cabling, but the latency increases, and even with several cables between the cabinets the throughput drops - in the previous example, a true hypertorus demands 303 cables between cabinet pairs, which is never done. (The dimension d is not rounded here. 20 million is arbitrary anyway).The physical propagation delay speaks for the hypertorus. At chip size, diffusion slows the signals as L2, so regenerative amplifiers accelerate the propagation, and one per node is often good. At supercomputer size, 2m electrical or optical cables contain 100 bits on their way, so a short stop at all neighbours doesn't slow a message much. So long-haul cables aren't quite useful at these sizes... seemingly! Because the "short stop" needs to forward the bits as they arrive, while hypertori use to receive and store a whole long message before forwarding it, yuk - and because collisions slow all messages.---------- Less banal about hypertoriAlmost identical to a hypertorus: put all Rd nodes on a big circle, connect each node with the nearest neighbour and also to nodes R1, R2... positions away for bigger jumps. Seducing? Well, 30,000 boards spaced by 20mm take D=200m, so folding in every dimension has merits. At short rings, odd R lengths are slightly better as they avoid the farthest element. A radix-5 torus is almost as short as a cube.The number of ports is often the limit at a package, a board... Then unidirectional rings take half as many ports, which lets cut the ring length R to its square root by doubling the number of dimensions d - for instance two unidirectional rings of 10 nodes take mean 10 hubs delay while one bidirectional ring of 100 nodes takes 25. This is a possible upgade to some existing machines, just with longer cables.Fitting complete torus dimensions in natural groups is simple: one dimension in the package, two more on the board, one more in the cabinet, two more in the room... But cutting instead each dimension saves ports! In this sketch with 8 nodes in a group (extrapolate to 97 in your mind):- A 8*1*1 line would take 34 ports, or 32 if 8 elements make a ring;- A 4*2*1 surface would take 28 ports;- A 2*2*2 volume takes 24 ports.A collective failure (cable cut, cabinet uncooled) has different consequences but the hypertorus circumvents them too. Marc Schaefer, aka Enthalpy
Enthalpy Posted September 26, 2015 Posted September 26, 2015 An other known network is the multidimensional crossbar. When a full crossbar is impractical with thousands or millions of nodes, one can split them into dimensions which are themselves fully connected. This illustrates just 3 dimensions of 4 packages each:Every components package connects to all others along every dimension. For instance, a 170 Ball Grid Array permits some 36 links out and 36 in, so 2 dimensions of 18+1 elements make a network of 361 packages, 9 dimensions of 4+1 elements enable 2M packages, and 36 dimensions are the 69G hypercube.Or separate switch matrix components, boards and cabinets can connect the dimensions. This reduces the number of dimensions hence the latency and the collisions at a big network. Just a 170-Bga package can route 36+0 computing packages along one dimension, so 5 dimensions suffice for 20M packages compared with >12 dimensions needed by direct links between the packages - and with matrices, the computing packages can have just 5 links out and 5 in. The matrices keep also the useful throughput, because each message passes through fewer dimensions.The computing packages are more flexible if they can communicate both with switch matrix components or among themselves.The multidimensional crossbar offers multipath routing and redundancy. It's conceptually simple. Boards and cabinets may contain complete subsets along fewer dimensions, and then the remaining dimensions among the boards or cabinets only connect homologue computing packages.----------The crossbar wins the comparison with the torus cleanly. At identical network dimensions and i/o counts as the bidirectional torus, the crossbar has 2 parallel links out and 2 in per package and dimension, a shorter latency, more throughput. If the switch components have big packages and as long as the board connectors suffice, each dimension can even grow to 500 members without drawback, so fewer dimensions reduce the latency.As compared with a cube, the crossbar reduces the latency. The throughput is less obvious since each package has fewer dimensions to send messages in, but since the mean message crosses proportionally fewer packages, the crossbar is as good - or better, if fewer dimensions permit parallel links.----------5 dimensions or less route the biggest machine, taking only 20 balls at the computing packages. So what can the other balls do usefully?Implicitly, I've arranged the computing packages as an elementary cubic crystal here. It could have been a body-centered cubic or a face-centered cubic, with nearest neighbours in 4 or 6 directions for three dimensions instead of 3 directions for the elementary cube, and the added links would reduce the distances, not just increase the throughput. But... Rectangular boards, brick cabinets and well-used switch components need to fold all directions. And then, who can imagine a face-centered cubic lattice in 5 dimensions and predict the number of connector contacts?Less ambitious: keep the elementary cube, add switch matrices in every <11111> directions. There are 16 unoriented ones (to be folded): easy for the packages, hard for the board connectors.My feeling is that reliable engineering achieves only elementary cubes, with extra i/o just put in parallel along the <10000> directions. 1
Enthalpy Posted October 10, 2015 Posted October 10, 2015 Here are outlines of a Pci-e video card with crossbars.For 500W, the board runs 6000 nodes at 1GHz to achieve 12TFlops single precision, as resulting from a 28nm fabrication process, and scales up easily with a power-saving process.The 80 computing chips contain each 75 nodes. Nodes are complete processors with a 32b float Macc and Alu, sequencer, registers, single-level cache, and 6MB Dram. I expect a single-cycle latency to the individual Dram, so four 32b transfers with the cache feed the Macc over all the Dram and five 32b wide banks avoid collisions; increase that if the latency is worse. The 6000 Drams cumulate 96TB/s, 200x the present video cards.Each chip has four 1GB/s links in and four out (shared by the 75 nodes over a switch) to four crossbar chips operating in parallel, with each eighty inputs and eighty outputs for as many computing chips. The crossbar 31mm*31mm ball-grid arrays have about 500 contacts, just two chips with 160 in and out would be possible. Through any equator, the network transports 320GB/s unidirectional peak, as much as the present Gddr throughput on video cards. Both board faces support crossbar chips and computing chips. Individual 6W at the thin face flow through a thin plate and metal pillars to the heat sink at the thick face. Routing takes estimated 6 signal layers and 3 supply planes. The Ramdac, Bios, Pci-E lanes are not sketched and could be the crossbars' function.Supplying 500W around 1V needs an excellent layout, many phases, and good components. From the horizontal central line as on the sketch, 70µm supply planes drop only 15mV, while a custom 12V busbar lets put the tall electrolytic capacitors away from the heat sink.While cumulated 36GB Dram exceed present video cards and offer the proper throughput, individual 6MB may not always contain the wanted piece of data, even if duplicating it. But with mean 50MB/s network throughput available per node, or 75 cycles per 32b word, fetching the data is reasonable. If an algorithm transfers 3*N words over the network to compute N2 Mul-acc, then N=225 chunks keep the node busy and take 3kB in the Dram.Presently the Dram throughput defines a video card's capability, at least when I run dX9 games on a dX11 card, and the architecture I describe improves that. But would existing dX9 and dX11 games benefit from it too, that is, will the driver bridge effectively the architecture difference?The compute power (30* an eight-core Avx256), Ram and network throughput (1000* and 3* a four-channel Ddr4), and the simple multitask programming model make such a card highly desirable for general computing too, so decent Alu and 64b capabilities are worth it.Marc Schaefer, aka Enthalpy
Enthalpy Posted October 11, 2015 Posted October 11, 2015 A Pci-E board can also host a number cruncher or a database machine. More chips give them Dram capacity. Daughter bords resembling widened So-dimm modules carry the many chips and plug at the outrageously thick side of the Pci-E board. A two-dimensional crossbar network, where only one dimension leaves the doughter modules, reduces the number of contacts and brings a good throughput. Details should come. Marc Schaefer, aka Enthalpy
Enthalpy Posted October 17, 2015 Posted October 17, 2015 Oops: the Pci-E standard allows 375W per board with two additional connectors, not 500W. This limits the Pci-e graphics card to 70 nodes per chip at 1GHz giving 11 TFlops on 32b - or the power supply must offer more cables and current. Also, the network of the Pci-e graphics card transports only 160GB/s between to halfs of the board.
Enthalpy Posted October 18, 2015 Posted October 18, 2015 Here's a number cruncher on a Pci-e board. Differing with 25 January 2015, it has a 2D-crossbar network and more nodes at a lower frequency.375W permit 1056 nodes at 750MHz for 1.6TFlops on 64b and more as processes improve. A more general machine could have had instead 32b nodes at 1300MHz that take several cycles on 64b floats. Alternately, signal processing would enjoy a 32b+32b complex mul-acc at 750MHz, which reuses about the same hardware and power as the 64b real mul-acc: winning combination? The cores must be universal, and existing simple ones like the Mips or Arm are seducing.3 nodes fit in one computing chip and package, with exclusive 170MB Dram per node, 180GiB in total. This, and simple multitask parallelism, makes this board easier to program and more flexible than competing solutions; stack three single-node chips in a package if it doesn't suffice. Estimated 15ns = 12 cycles latency let 48 word wide accesses deliver the full mul-acc throughput over a complete Dram sweep; 53 banks of 64b avoid collisions in the scaled and butterfly datamoves described here on 19 November 2013, and cumulate 25TB/s over a board. The network is just a 2D crossbar of 33*32 nodes but room and routing let assemble them less naturally. One 33x33 matrix chip and 11 computing chips of 3 nodes each make a group of 33 nodes. 4 groups fit on a module, and the Pci-e board holds up to 8 modules, feeds them and connects them to 33 matrix chips there used as 32x32.The module connector passes 132 links in and 132 out to the 4 groups, needing some 560 contacts. It's 2.2* as wide as a 260 contacts So-dimm and shall pass 10Gb/s per link, less than the wider Pci-e 3.0. If this combination of speed and density were impossible, I'd instead stack computing chips in packages sitting on the front side of the Pci-e board.Each module routes 132 links side-by-side within the independent four groups and some more links towards the connector. This needs 4 signal layers plus two supply layers over available 30mm. Trying to ease routing, the matrix chips have 4+4 rows of 23 balls (not all drawn) in a long package and the computing chips have the balls at the short sides.The Pci-e board passes ~1000 links side-by-side between its matrices and the modules. This needs 4 signal layers plus two supply layers over 150mm. Modules oriented as drawn ease it.The network passes 530GB/s peak unidirectional: as much as the the best competitor's Ram, with less latency than a hypercube and on easier Pcb.Each module receives 47W through extra side connectors, probably with screws - or feed through the module connector if busbars fit at the Pci-e board or in the connectors. A common clock with phase staged among all computing nodes would make the power quieter.Marc Schaefer, aka Enthalpy
Enthalpy Posted October 19, 2015 Posted October 19, 2015 And Oops! for the number cruncher too... 1056 nodes can run at 950MHz, achieving 2.0TFlops double precision, and each accesses 56 eight-bytes words out of 59 Dram banks in 14 cycles, so the cumulated Dram throughput is 32TB/s.
Recommended Posts