Enthalpy Posted July 6, 2016 Posted July 6, 2016 (edited) I propose a bit transposition capability to take advantage of a processor's ability to make logic operations on 32, 64 or more bits at a time.Some applications, especially databases, must compute bitlogic operations on many items, but the homologue information bits are often spread as one per item (example: item for sale is new with tag, new without box, refurbished, used) while bitlogic needs them grouped in one word. So the instruction would perform Destination [word i, bit j] = Source [word j, bit i] on 64 words of 64 bits for instance, prior to the sequence of bitlogic operations that filters the items through a set of conditions, and possibly after it.The bit transposition would nicely combine with the "scaled copy" I already describedhttp://www.scienceforums.net/topic/78854-optical-computers/page-2#entry778467The scaled copy extracts the bit fields of a group of items and packs them in consecutive words, then the bit transposition puts homologue bits extracted from varied items in one word, ready for parallel bitlogic processing, after which a bit transposition and a scaled copy can write the processed bits to the items when necessary. One operation could combine the scaled copy with the bit-reversed copy, but I don't feel it vital.The execution unit could hardly shuffle so much data, but the Dram or the datacache can. It could then be a transposed copy instruction similar to the already described "scaled copy" and "bit-reversed copy", to be done between address areas corresponding to variables named by the source program, which would then care about data coherency as it has the necessary information for it while the hardware has not. Here too, the cache or Dram must receive and execute more detailed instructions than read and write, which is easier when they're linked at manufacture.Transposing 32 words * 32 bits or 64 words * 64 bits means 128 or 512 bytes, an efficient size for data transfer between the Dram and the cache. For words of 256 or 512 bits, do it as you can.Data areas don't always start and stop at multiples of 64 words, and memory page protection will meddle in too. I happily leave these details to other people . Solutions exist already for vector computers.Marc Schaefer, aka Enthalpy ================================================================= I suggest a wide cascaded logic operation that operates on many independent bits in parallel, for instance the word width of the execution unit, and computes at once a small chain of logic operations over a few registers.It can be similar and complementary to the cascaded logic operation I aleady described :http://www.scienceforums.net/topic/78854-optical-computers/page-3#entry854751 and following The reverse polish notation and an implicit stack simplify the execution; Four bits can encode the operations And, Or, Eq, Gt, Lt and complements, plus Push, Swap, Nop... Three or four bits can designate the successive source registers (rather than the source bits for the previous cascaded logic) The logic expression to parse a database isn't known at compilation, so the operation follows a computed data rather than a set of opcodes. For healthy context switching, some general registers can hold this information about the source registers and the chain of operations.For instance, one 64b register can indicate 7 source registers among 32 coded on 5 bits plus 7 operations coded on 4 bits. This seems enough for a simple low-power processor that specializes in databases. Or one 64b register can indicate 16 source registers among 16 coded on 4 bits, and an other register indicate 16 operations coded on 4 bits. This fits the consumption and complexity of a number cruncher better. The throughput of the Dram and cache can be a limit.The "wide cascaded logic" operation combines nicely with the "transposed copy" and "scaled copy" operations.As the resulting bitfield can indicate which database items to process further, and the wide cascaded logic screens the items in less than one cycle each, one more instruction could usefully indicate where the first 1 bit is in the result and discard it. Some Cpu have that already.Marc Schaefer, aka Enthalpy Edited July 6, 2016 by Enthalpy
Enthalpy Posted July 8, 2016 Posted July 8, 2016 (edited) These are the technology development for the last architectures I described as of today, with the date of first suggestion and if possible a search keyword.---------- Processors Explicit macro-op instructions, 01 January 2016, search "macro-op".Meaningful for any Cpu, most useful for the simplest ones with lowest consumption. The compiler must be adapted.This needs thoughts and paperwork first, possibly at a university or elsewhere. Cpu optimized for integer add and compare at low power.31 January 2015. Hints on 26 February 2015, 12:44 AM. Description on 25 October 2015.Useful for databases, web servers, web search machines, artificial intelligence as noted on 31 January 2015 and meanwhile.This can begin as paperwork before switching to a semiconductor company. Cascaded logic instruction, 26 February 2015 two messages, search "cascade", andWide cascaded logic instruction, Jul 06, 2016 6:34 pm, search "wide cascade".Mostly for databases, web servers, web search machines, hence belongs to any Cpu.Small hardware development immediately at the silicon design company. Extensions to the optimizing compiler by the editor, may benefit from theoretical thoughts at a university. Find string instruction, 26 February 2015, search "string".For any Cpu. Some existing Cpu must already offer it.Hardware development at the silicon design company. Extensions to the optimizing compiler by the editor. Complex 32b mul-acc, 18 October 2015 - 01:25 PM, search "complex". Multiple flow control units, 26 October 2015, search "flow control".Subtract-one-and-branch instruction, 26 October 2015, search "subtract one".Desired everywhere outside linear algebra. May well exist already.This can begin as paperwork before switching to a semiconductor company. ---------- MemoryCommon point: at most one Dram chip per computing node, with enough throughput to feed the processor. Dram chip with many accesses, 19 November 2013. Full-speed scaled indexed access, 19 November 2013, search "scaled".For any Cpu, Gpu, gamestation, database, web server, number cruncher.Needs clear thoughts before, possibly at a university. Then, easy silicon. Full-speed butterfly access, 19 November 2013, search "butterfly".For number crunchers, especially signal processing.Needs clear thoughts before, possibly at a university. Then, easy silicon. Bit transposed copy, Jul 06, 2016 5:05 pm, search "transposed copy".Mostly for databases.Needs thoughts before. Then, easy silicon. Flash chips with big throughput, 01 May 2015.A need, not a solution. Flash chips close to the computing nodes, with ports faster than Usb and Sata.For server machines and also number crunchers.A task for the Flash company. ---------- Stacked chips Adaptive connections accept misalignment, 08 and 23 February 2015, search "adaptive" (not only chips).For any Cpu, Gpu, gamestation, database, web server, number cruncher.Hardware development at the silicon design company. Optional upsized proof-of-concept at any electronics lab. Capacitive connections, Nov 12, 2015 2:47 am, search "capacit" (not only chips)For any Cpu, Gpu, gamestation, database, web server, number cruncher. PhD thesis exists already.No fine lithography, hence by any semiconductor lab, including equipped university. Connections by reflow accept misalignment, search "reflow".Nov 12, 2015 2:47 am, details 27 December 2015.No fine lithography, hence by any semiconductor lab, including equipped university. ---------- Software OS subset for small unit Dram, 13 September 2015.A need, not a solution. Database machines can have a unit Dram even smaller than present supercomputers.Usually done by the computer manufacturer, could be a collective effort too. Lisp, Prolog interprets, inference engine for small unit Dram, 07 December 2013.A need, not a solution. ---------- Boards Data vessels, 08 February 2015, search "vessel".By any electronics lab, or the Pcb company. Pcb with more signals: denser or more layers (or bigger).A task for a Pcb company. ---------- Machines Crossboards, 01 February 2015.For most machines with several boards.Electrical engineers, possibly with a (opto-) chip company, shall bring throughput. Optical board-to-board connectors, 02 February 2015 and 22 February 2015 - 12:47 AM, search "optical link".For most machines with several boards - but consider capacitive connectors.Needs some optics and some fast electronics, possibly together with an optochip company. Capacitive board-to-board connectors, 08 February 2015, search "capacitive".For most machines with several boards.Proof-o-concept by a skilled and equipped electronics lab, then chips by semiconductor designer and manufacture. Flexible cables with repeaters for many fast signals, 02 December 2013.For every machine except the smallest ones.By a skilled and equipped electronics lab together with a manufacturer of flexible printed circuits. Cable to board connectors by contact, 02 December 2013.For most machines with several boards - but consider capacitive connectors.Needs skills for mechanical engineering and fast electronics. Wider so-dimm-like connector, 18 October 2015 - 01:25 PM.For Pci-E boards, maybe others.Development by the connector manufacturer, needs skills for fast electronics. Capacitive cable to board connectors, Nov 12, 2015 1:49 am, search "capacitive".For most machines with several boards.Almost the same development as the capacitive connections between chips. Superconducting data cables, 02 December 2013, search "superconduct".For many-cabinet machines.Needs skills for superconductors, (printed?) cables and fast electronics. Insulating coolant, 15 November 2015.Chemistry lab. Edited July 8, 2016 by Enthalpy
Enthalpy Posted December 16, 2018 Posted December 16, 2018 Some chips progressed from mid-2016 to end 2018. Processors have stalled. 14nm finfet then, 12nm finfet now. Intel's Xeon Phi got a minor update to the Knights Landing. nVidia's Titan V provides as many 64b GFlops using as many watts as the Knights Landing. Waiting for 7nm processors. Dram chips have still 1GB capacity. Their throughput increased thanks to parallelism and new bus standards, not to faster access to the cells. But Flash memory did improve: Imft (Intel-Micron) made a chip, nonvolatile but not flash, that is fast that draws too much power for my goal here Samsung brought its Z-Nand (Slc V-Nand) to around 20µs read and write latency. The chips are not documented, but as deduced from the Ssd, they could each hold 25GB and read or write 200MB/s. This is fantastic news for a database machine and also for a number cruncher with good virtual memory. Better: this throughput seems to access one single 4kB page, so each processor could have its own 200MB/s access to its part of the Flash chip. 50* the Dram capacity, read or write the complete Dram to virtual memory in 0.3s, that looks sensible again. How much does a chip cost? How much does it consume?
Enthalpy Posted May 29, 2019 Posted May 29, 2019 I had suggested an instruction copied from the Vax 11 's Subtract One and Branch, on26 October 2015 and the 86 family has already one. ========== One computation I have quite often in my programmes is if (|x-ref| < epsilon) The operation on floating numbers is lighter and faster than a multiplication, hence easily done in one cycle. It's often in inner loops of heavy computations. Processors that don't provide this operation should. Depending on hardware timing, the instruction set could provide variants, preferibly the most integrated one: |x-ref| Compare |x-ref| with epsilon Branch if |x-ref| < epsilon Simd processors (Sse, Avx...) could compute on each component and group the comparisons by a logical operation, possibly with a mask, in the same instruction or a following one. Inevitably with Simd, it makes many combinations. ========== An other computation frequent in scientific programmes is if (|x-ref|2 < epsilon) It looks about as heavy as a multiplication-accumulation, but denormalizations take more time. If it fits in a cycle, fine, with the comparison and the branch, better, but it's obviously not worth a slower cycle. Here too, Simd machines could group the comparisons by a logical operation. I feel the square less urgent than the previous absolute value, which can replace it often. Also, a test is often done on the sum of the squared components of the difference instead, and such a sum is also useful alone, without a test nor a branch. Marc Schaefer, aka Enthalpy
Enthalpy Posted July 6, 2019 Posted July 6, 2019 Small corporate file servers, web servers, database servers and number crunchers are commonly built of few dozen blades holding each a pair of big recent expensive processors, with a rather loose network between the blades, and a Dram throughput that follows less and less the Cpu needs. I propose to use the better networks already described here and assemble more of used outfashioned cheap processors. This shall cumulate more throughput from the Dram, the network, optionally the disks. ========== Here I compare old and recent processors. All from Intel as they made most servers recently, but I have nothing against Amd, Arm and the others. I checked few among the 1001 variants and picked subjectively some for the table. North- and southbridges are soldered on existing mobos, hence not available second-hand nor usable, and new ones should remain expensive when outfashioned. So I checked only Cpu that directly access the Dram and create many Pci-E links to make the network. Chips on Pci-E links shall make ports for Ethernet and the disks like on add-on cards; I like Pci-E disks but they rob too much mobo area here. I didn't check what the Bios and monitoring need. One special card shall connect a screen, keyboard etc. I excluded Pci-E 2 Cpu for throughput and Pci-E 4 Cpu for price in 2019. Pci-E 3 offers 8GT/s per link and direction, so a good x16 network provides ~16GB/s to each Cpu. Line e is a desktop Cpu, less cheap. Lines f g are modern big Cpu that make a server expensive for a medium company. Lines a b c d are candidates for my proposal, in 2019 these have Avx-256 and Ddr3. None integrates a Gpu that draws 15W. | # GHz 64b | # MT/s | W | W/GHz | Cy/T | =========================================================== a | 4 2.8 (4) | 3 1333 | 80 | 7.1 (1.8) | 2.8 (11) | b | 6 2.9 (4) | 4 1600 | 130 | 7.5 (1.9) | 2.7 (11) | c | 8 2.4 (4) | 4 1600 | 95 | 4.9 (1.2) | 3.0 (12) | d | 6 2.4 (4) | 4 1600 | 60 | 4.2 (1.0) | 2.3 (9.0) | <<== =========================================================== e | 6 3.3 (4) | 4 2133 | 140 | 7.1 (1.8) | 2.3 (9.3) | =========================================================== f | 72 1.5 (8) | 6 2400 | 245 | 2.3 (0.3) | 7.5 (60) | g | 24 2.7 (8) | 6 2933 | 205 | 3.2 (0.4) | 3.7 (30) | =================================================================== a = Sandy Bridge-EN Xeon E5-1410. LGA1356, DDR3, 24 lanes, 20€. b = Sandy Bridge-EP Xeon E5-2667. LGA2011, DDR3, 40 lanes, 40€. c = Sandy Bridge-EP Xeon E5-4640. LGA2011, DDR3, 40 lanes, 50€. d = Ivy Bridge-EP Xeon E5-2630L v2. LGA2011, DDR3, 40 lanes, 40€. =================================================================== e = Haswell-E Core i7-5820K. LGA2011-3, DDR4, 28 lanes, 130€. =================================================================== f = Nights Landing Xeon Phi 7290. LGA3647, DDR4, 36 lanes, ++++€ g = Cascade Lake-W Xeon W-3265. LGA3647, DDR4, 64 lanes, 3000€ =================================================================== First # is the number of cores, GHz the base clock, 64b is 4 for Avx-256 and 8 for Avx-512. Next # is the number of Dram channels, MT/s how many 64b words a channel transfers in a µs. W is the maximum design consumption of a socket, called TDP by Intel. W/GHz deduces an energy per scalar or (Simd) computation. It neglects the small gains in cycle efficiency of newer Core architectures. Cy/T compares the Cpu and Dram throughputs in scalar and (Simd) mode, ouch! It's the number of cycles the cores shall wait to obtain one 64b or (Avx) word from the Dram running ideally. The price is for second hand, observed on eBay for reasonable bargains on small amounts in 2019. While an OS runs well from cache memory, most scientific programmes demand Dram throughput or are difficult to write for a cache. For databases, the Dram latency decides and it didn't improve for years. An Dram easy to use would read two data and write one per core and cycle, or Cy/T=0.33, which no present computer achieves. I favoured this ratio when picking the list. More cores make recent Cpu worse, wider Simd even worse. Assembling many old Cpu cumulates more Dram channels. Process shrinks improve the consumption per computation. If an oldfashioned Cpu draws 60W and a recent one saves the half by finishing faster, over 1/3 activity and 5 years the gain is 438kWh <100€, which doesn't buy a fashionable Cpu. If the Dram stalls the recent Cpu more often, the gain vanishes. ========== Each Cpu with its Dram, Ethernet and disk ports shall fit on a daughter board plugged by Pci-E x16 (or two x16 if any possible) on a big mobo that makes the network. But if Pci-E 3 signals can cross a Ddr4 connector, then carry 32+ lanes. The network comprises 16 independent planes where chips make a full crossbar, or if needed a multidimensional crossbar. Can a Cpu on the mobo make a 40*40 crossbar? It takes many Cpu there, and software makes slow communications. At least one crossbar Asic exists for Pci-E connections among Cpu. If that Asic isn't available, make an other. A 32*32 full matrix chip fits in a Dram chip package and can connect 1024 Cpu in a 2D crossbar. 15*16=240 Cpu and 8+8 lanes each take (15*16)*8=248 matrix chips. A chip can serve several smaller planes. The routing density needs a mobo with many layers. Repeaters may be necessary. A big machine with 480 Cpu connects any two Cpu in two hops and transfers 2TB/s through any equator in any direction. Better than a few fibres as a hypertorus. Many small Cpu outperform again few big ones. ========== Liquid cooling takes few mm over the Cpu. Some alkanes are insulators, good coolants, and hard to lightLow-freezing rocket fuels New Dram chips soldered directly on the daughter boards, like on graphics cards, would enable 12.7mm spacing. Few 10€ pay 16GB presently. Used Dram modules would be bigger and more flexible, tilted connectors exist for Ddr4 at leastau.rs-online.com and horizontal connectors for So-dimm. Or have a minimum Pcb to hold a second Dimm connector making the angle. Daughter boards need local regulators for the Cpu, Dram etc. Like graphics cards do, they can receive 12V from (many) Pc power supplies with minimum recabling. As the cabinet's sides are usefully reserved to Ethernet, the Sata disks and power supplies could reside in the doors. Using an Ivy Bridge-EP Xeon E5-2630L v2 or similar, each daughter board might sell for 250€. A small cabinet with 30 daughter boards would sell for 10k€, cumulate 3.5TFlops on doubles, Dram 1.5TB/s, network 240GB/s through the Equator. A big cabinet with 480 daughter boards would sell for 160k€. 56TFlops, 240TB/s, 1.9TB/s. 400MB/s disks, one per board, would cumulate 190GB/s. Drawings may come. Perhaps. ========== While not a competitor for the clean-sheet architectures I proposed previously, such machines assemble existing hardware. As Pci-E is fully compatible, the number and nature of the daughter boards can evolve, the size of the mobo too, and the boards can serve in successive machines at different customers. As they depend on available second-hand processors, the daughter boards would be diverse within a machine, and the software must cope with small variations in the instruction set. With reasonable capital, a startup company can buy used Cpu on eBay-Alibaba-etc, or rather complete old-fashioned servers with Dram and Ssd, and reorganize more components better around the superior network. Marc Schaefer, aka Enthalpy
Enthalpy Posted September 1, 2019 Posted September 1, 2019 This shall illustrate a computing node with outfashioned server Cpu to make up a parallel machine. In this case a socket 2011 because the E5-2630L v2 and other Ivy or Sandy Bridge Xeon are fast and cheap in 2019. As per last message, each Cpu socket communicates over 16x Pci-E 3 and has 4x Ddr3. Liquid cooling enables 20mm stacking like Atx extension boards. The drawing suggests 125mm side spacing. As reasonable used bargain on eBay, the drawn So-dimm 1600MHz Ddr3 cost 13€/4GB or 22€/8GB. They could occupy both board sides. Desktop modules are more common, cost 9€/4GB, can offer Ecc, but at ~160mm they widen the board and need a small Pcb to make the angle as all Ddr3 connectors seem straight. If cheap, new chips would instead land directly on the board at both sides. A server node needs disk and Ethernet links provided by cheap small chips on the Cpu's Pci-E lanes. Alas, it may also need a boot Rom and sensors, which connect over the South bridge on the mobos I checked. However, new Chinese mobos sell for 85€ on eBay, with an X79 South bridge, many connectors and added functions, and even for 50€ with 2*Ddr, so a South bridge is feasible.ebay The node board, simpler than the Chinese mobos, may sell for 40€. An E5-2630L v2 costs 40€, 16GB 52€, a used 120GB Mlc Ssd 20€, 1/8 used power supply 5€. Without connection board nor cabinet, one equipped node can sell for 250€ and provide 115GFlops with Dram and network throughputs better than present servers. The cabinet may contain also standard Pci-E boards: Ssd, graphics... If not, the nodes can use the Pci-E power contacts for 2 lanes more and carry the components on top side. Marc Schaefer, aka Enthalpy
Enthalpy Posted September 8, 2019 Posted September 8, 2019 Here's a possible aspect of a computer that assembles many outfashioned Xeon processors with the nodes described in last message. Recent processors too would benefit from the better network and be faster, but when cheap older ones suffice, they're easier to program since their Dram throughput matches the Cpu needs less badly. Amd or Arm, Mips, Risc, Sparc and other processors too can be assembled that way. This supercomputer uses cheap hardware and needs no costly development. It runs existing compilers, OS and applications. That would make a very nice university project, both to develop and to use. This design's backbone uses a dense Pcb 1m*1m big to connect up to 48*8*2=768 computing nodes. More processors per node could share the throughput, yuk. A hypothetic bigger Pcb would expand the computer. The example Xeons E5-2630L v2 cumulate then 39TB/s Dram throughput and peak 88TFlops double precision, as much as >85 recent Xeon W-3265, or 1/13th as much as the smallest supercomputer in the Top 500 list of June 2019. Outfashioned used 120GB Mlc Ssd cost 20€ and read 250MB/s. 4 Ssd per node can reside in trays away from the precious backbone Pcb, at 1m S-ata cables. 3072 Ssd cumulate 368TB and 768GB/s. Or the X25-E with Ncq cumulate 175Miops. Flash chips on the node Pcb, or mini-PciE, are alternatives, as are Pci-E Ssd replacing some computing nodes. One 1GB/s fibre per node gives the Ethernet 768GB/s. ========== Network It's a bidimensional matrix as described on 26 September 2015 The vertical dimension groups adjacent pairs of columns of 8 nodes on 2 sides of the backbone, or 32 nodes that connect over a 32*32 single-lane full switch matrix. The horizontal dimension groups every second node of one backbone side, or 24 nodes per row over a switch chip of same kind, as the vertical dimension completes the job. Pci-E 3 over 16x connectors offers 8 bidirectional lanes per node and direction. Each dimension has 8 independent sub-networks and switch matrix chips for throughput. Rows use 8*2 matrix chips per backbone side on 128mm height, so all matrix chips can reside at the backbone's centreline, just staggered a bit for 8mm vertical spacing. Pairs of columns use 8 matrix chips spread on 20mm per backbone side, so 5mm horizontal chip spacing lets the backbone's centreline accommodate them with little staggering. The matrix chips input and output 32 differential lanes needing about 4 balls each. >256 balls need packages but bigger than a Dram chip. Each dimension of the backbone Pcb transports 8*768/2=3072 bidirectional differential lanes. The 12288 copper lines spread over 1m take >6 Pcb layers, maybe 9 per dimension plus the ground and power planes. A 36-layer big Pcb is a challenge, affordable for a 250k€ machine. Does Pci-E 3 reach 0.5m? The backbone Pcb can host repeater chips, say at 1/3 and 3/4 of each dimension. Drams need no differential signals, so I'd convert the Pci-E signals to asymmetric and back, possibly at the repeaters, or near the Pci-E connectors, or if possible on the nodes. This saves on the matrix packages, on the Pcb, or lets route more lanes through the connectors and Pcb. 8GT/s Pci-E 3 lets the backbone transport 3TB/s through any equator in each direction for easier programming. Top that with fibres between server blades! ========== Cheaper network Pci-E switch matrix chips exist already, but are they available commercially? A less ambitious network can save this development and ease the backbone Pcb, but its throughput and latency are worse. And can Xeons interconnect directly through Pci-E? For instance a hypertorus can group rows of 15 nodes in 2 dimensions as 3*5 and columns of 45 nodes as 3*3*5. Instead of 2, the biggest distance is then 7 hops, as odd numbers ease it. All rings can be tripled, one quadrupled. This hypertorus transports 0.8TB/s through any equator. Details about the cabinet may come later. Marc Schaefer, aka Enthalpy
Strange Posted September 8, 2019 Posted September 8, 2019 ! Moderator Note The OPs questions were answered long ago. The current posts are increasingly off-topic and look more suitable for a blog.
Recommended Posts