A Supercomputer built on 4P will be 10 to 100 times smaller with the same performance

NP4P is single purpose, non programmable implementation of 4P.

For the 23 city Traveling Salesman Problem the NP4P implementation will beat the worlds fastest supercomputer in a device that fits inside a standard PC case using a standard outlet.

NP4P has no real use outside supercomputer applications because it cannot be reprogrammed for any task other than the one it was specifically designed for. That being said, NP4P has overwhelming processing power for computing a single problem. Since supercomputers typically run about 30 types of problems, it makes sense to consider building an NP4P for each problem type in place of a single supercomputer.

After 4P is used to solve a problem, a NP4P solution can be obtained by stripping away the programmable part leaving a very efficient ASIC solution.

The goal of the 23 City NP4P TSP implementation is to reach a solution faster than the world's fastest supercomputer using a device that fits in a standard PC case. The "contest" would require the following specifications:

- The input data format is 32 bit integers containing the distances between every pair of cities.
- The input data will be generated randomly immediately prior to the contest.
- The input data will be presented in a file formatted according to a prior agreement by both parties.
- The input data will be limited to distances that do not produce overflow of 31 bits when adding the largest path.
- Both contestants must use brute force to arrive at the solution.
- The first contestant to iterate through every path and reach the correct solution wins.

- Technically examine the existing SPICE model for 6 cities (shown above) in detail
- Describe the generalizations and additions needed to engineer the 23 city design into a functional ASIC
- SPICE files available on request

The design shown above is a functional in the sense that it produces correct output at the result. Due to funding constraints the permutation algorithm has not yet been completed but has been replaced by simple indexing for verifying the remainder of the circuit.

Note that every step between the existing design, shown above, and the final 23 city design is solidly founded upon engineering application. Hence, there are no theoretical open problems to solve, only practical application of known principles to build a technically accurate implementation described by a functioning prototype.

A Single Clock Pulse is incremented every 32 clocks setting theEnable Storage Line for the next 32 bits of storage.

The Eneble Storage Line Selects the 32 Bits of Storage that stores the Data Input. D Flip Flops constantly cycle the storage around 2 lines of 16 bits each.

Refer to the main image to see the distances stored between cities.

Each city is joined to 2 other cities so each group has a pair of index sets that are added together with 0 or 1 elements from each index set selected.

The index set on the right has a constant selection. The index set on the left has a simple iterator. These are used for testing the remainder of the design. The final design will have a permutation algorithm implemented and has not yet been completed due to budgetary constraints.

Two distances are added serially with the carry bit fed back. City #5 only has a single distance stored. With the exception of #5, all the other cities add the two possible index values from its stored values as shown in the main image. City #5 is delayed so its output is aligned with the rest whose additions are pairwise added in succesion. Another delay is neccesary since 6 is not a power of 2. The final result contains the summation of all selected cities. Refer to the main image to see the layout.

The final Adder produces the result of the current iteration for the entire path of all cities added together for the current permutation. The result is run through a 32 bit to feed the Most Significant Bit first. The Data Reverser and Less Than operator are shown in more detail below.

A Single pulse is cycled through 32 D Flip Flops. The 32 Bit reverser loads the 32 bit result. The single pulse signals the reverser to unload the 32 Bits with the Most Significant Bit first as required by the Less Than Operation.

The Less Than operation is initialized to all ones until the first value has passed through the Less Than operator. This ensures that the first value will be stored. After that, the stored value is compared with the new value and the smallest is stored.

Input:

- Wake bit. The first 1 starts the timer.
- 15 distances of 32 bits each

Output:

- Result: 32 bit minimum distance constantly streamed

Input:

- Wake bit.
- 253 distances of 32 bits each. Same values for all modules.
- N bits specifying the number of iterations to run. N is the smallest number of bits needed to hold the iteration count. Same value for all modules.
- 253*K bits, where K is a constant to be determined, specifying the starting permutation for each module
- 32 bits specifying the UID for each module
- 4 bits specifying clock speed for each module
- 1 bit enable flag

Output:

All Outputs begin streaming after running the specified # of iterations (see input).

- Result: 32 bit minimum distance.
- UID: for TSPM
- Iteration Count corresponding to result

Input is passed serially to each TSPM one after the other over the entire chip/wafer.

After all TSPMs complete all calculations, they stream their results to Accumulators that pass the smallest value forward along with the corresponding UID and Iteration Count.

The final output pin streams off the chip/wafer through a USB to a PC where the UID and Iteration Count are used to determine the path.

Multiple Clock Speeds:

- Manufacturing Irregularities may require clock reduction in some TSPMs.
- The maximum clock speed for each TSPM will be determined during hardware validation testing
- A 4 Bit, programmable variable will instruct each TSPM what clock speed to run
- The starting permutations will be externally calculated taking into account each clock speed to ensure that the entire set of permutations is calculated
- The number of iterations determine when all of the TSPMs will release their results
- All results will be released running on the slowest clock

Defective TSPMs:

- TSPMs that are determined to be defective during testing, have their enable flag set to 0.
- A TSPM with a 0 enable flag streams 1s for their result to ensure that they do not affect the final result.
- Critical paths, such as program load or result output, are over designed or redundant.

Here we make a rough, yet conservative, estimation on how much hardware will be required to reach 200 petaflops. Since the purmutation iteration algorithm has not yet been completed we make a general, conservative estimation rather than being obscured in the details of the existing, exact, transistor count for the 6 city prototype shown.

We use 100,000,000 transistors per square mm for all calculations.

We use a clock speed of 1 GHz. Higher clock speeds can result in a smaller hardware footprint.

The transistor count for the 23 city TSP will be dominated by the storage requirement since the Data Reverser and Less Than operator occur once per TSPM. For 253 distances times 32 bits per distance times 8 transistors per D Flip Flop we get

253 * 32 * 8 = 64,768 transistors per TSPM for storage.

Since the Storage will dominate the transistor count we will simply round up to 100,000 then double to 200,000. This should provide an absolute maximum transistor count per TSPM.

Each TSPM will produce 22 additions every 32 clocks. At 1 GHz, this is 22/32 = .6875 Billion operations per second per TSPM.

For efficiency, we are interested in using the entire wafer uncut with pinouts located on the edge. We are presently verifying this design with manufacturers. Using chips cut from wafers will impose additional inefficiencies due to lost silicon.

At 70,000 square mm per wafer and 200,000 transistors per TSPM and a transistor density of 100,000,000 we get

70,000 * 100,000,000 / 200,000 = 35,000,000 TSPMs per wafer

35,000,000 TSPMs per wafer * .6875 Billions of calculations per Second per TSPM = 24,062,500 Billions of Calculations per Second per Wafer

Goal is 200,000,000 Billion calculations per second.

200,000,000 / 24,062,500 = 8.312

Round up to 9 with a 7% factor of safety.

Round up to 10 with a 19% factor of safety.

Factor of safety includes manufacturing defects and edge losses for cutting if the chips must be cut.

10 wafers will fit inside a standard PC box along with a Raspberry PI, Power supply, Cooling System and control board with USB IO to communicate with the wafers. Note that running at a higher clock speed or using less transistors per TSPM will reduce the number of wafers proportionality.