NP4P is single purpose, non programmable implementation of 4P.
For the 23 city Traveling Salesman Problem the NP4P implementation will beat the worlds fastest supercomputer in a device that fits inside a standard PC case using a standard outlet.
NP4P has no real use outside supercomputer applications because it cannot be reprogrammed for any task other than the one it was specifically designed for. That being said, NP4P has overwhelming processing power for computing a single problem. Since supercomputers typically run about 30 types of problems, it makes sense to consider building an NP4P for each problem type in place of a single supercomputer.
After 4P is used to solve a problem, a NP4P solution can be obtained by stripping away the programmable part leaving a very efficient ASIC solution.
The goal of the 23 City NP4P TSP implementation is to reach a solution faster than the world's fastest supercomputer using a device that fits in a standard PC case. The "contest" would require the following specifications:
The design shown above is a functional in the sense that it produces correct output at the result. Due to funding constraints the permutation algorithm has not yet been completed but has been replaced by simple indexing for verifying the remainder of the circuit.
Note that every step between the existing design, shown above, and the final 23 city design is solidly founded upon engineering application. Hence, there are no theoretical open problems to solve, only practical application of known principles to build a technically accurate implementation described by a functioning prototype.
A Single Clock Pulse is incremented every 32 clocks setting theEnable Storage Line for the next 32 bits of storage.
The Eneble Storage Line Selects the 32 Bits of Storage that stores the Data Input. D Flip Flops constantly cycle the storage around 2 lines of 16 bits each.
Refer to the main image to see the distances stored between cities.
Each city is joined to 2 other cities so each group has a pair of index sets that are added together with 0 or 1 elements from each index set selected.
The index set on the right has a constant selection. The index set on the left has a simple iterator. These are used for testing the remainder of the design. The final design will have a permutation algorithm implemented and has not yet been completed due to budgetary constraints.
Two distances are added serially with the carry bit fed back. City #5 only has a single distance stored. With the exception of #5, all the other cities add the two possible index values from its stored values as shown in the main image. City #5 is delayed so its output is aligned with the rest whose additions are pairwise added in succesion. Another delay is neccesary since 6 is not a power of 2. The final result contains the summation of all selected cities. Refer to the main image to see the layout.
The final Adder produces the result of the current iteration for the entire path of all cities added together for the current permutation. The result is run through a 32 bit to feed the Most Significant Bit first. The Data Reverser and Less Than operator are shown in more detail below.
A Single pulse is cycled through 32 D Flip Flops. The 32 Bit reverser loads the 32 bit result. The single pulse signals the reverser to unload the 32 Bits with the Most Significant Bit first as required by the Less Than Operation.
The Less Than operation is initialized to all ones until the first value has passed through the Less Than operator. This ensures that the first value will be stored. After that, the stored value is compared with the new value and the smallest is stored.
All Outputs begin streaming after running the specified # of iterations (see input).
Input is passed serially to each TSPM one after the other over the entire chip/wafer.
After all TSPMs complete all calculations, they stream their results to Accumulators that pass the smallest value forward along with the corresponding UID and Iteration Count.
The final output pin streams off the chip/wafer through a USB to a PC where the UID and Iteration Count are used to determine the path.
Multiple Clock Speeds:
Here we make a rough, yet conservative, estimation on how much hardware will be required to reach 200 petaflops. Since the purmutation iteration algorithm has not yet been completed we make a general, conservative estimation rather than being obscured in the details of the existing, exact, transistor count for the 6 city prototype shown.
We use 100,000,000 transistors per square mm for all calculations.
We use a clock speed of 1 GHz. Higher clock speeds can result in a smaller hardware footprint.
The transistor count for the 23 city TSP will be dominated by the storage requirement since the Data Reverser and Less Than operator occur once per TSPM. For 253 distances times 32 bits per distance times 8 transistors per D Flip Flop we get
253 * 32 * 8 = 64,768 transistors per TSPM for storage.
Since the Storage will dominate the transistor count we will simply round up to 100,000 then double to 200,000. This should provide an absolute maximum transistor count per TSPM.
Each TSPM will produce 22 additions every 32 clocks. At 1 GHz, this is 22/32 = .6875 Billion operations per second per TSPM.
For efficiency, we are interested in using the entire wafer uncut with pinouts located on the edge. We are presently verifying this design with manufacturers. Using chips cut from wafers will impose additional inefficiencies due to lost silicon.
At 70,000 square mm per wafer and 200,000 transistors per TSPM and a transistor density of 100,000,000 we get
70,000 * 100,000,000 / 200,000 = 35,000,000 TSPMs per wafer
35,000,000 TSPMs per wafer * .6875 Billions of calculations per Second per TSPM = 24,062,500 Billions of Calculations per Second per Wafer
Goal is 200,000,000 Billion calculations per second.
200,000,000 / 24,062,500 = 8.312
Round up to 9 with a 7% factor of safety.
Round up to 10 with a 19% factor of safety.
Factor of safety includes manufacturing defects and edge losses for cutting if the chips must be cut.
10 wafers will fit inside a standard PC box along with a Raspberry PI, Power supply, Cooling System and control board with USB IO to communicate with the wafers. Note that running at a higher clock speed or using less transistors per TSPM will reduce the number of wafers proportionality.