What is NP4P?

NP4P is Non-Programmable 4P

NP4P is single purpose, non programmable implementation of 4P.   

NP4P is powerful

For the 23 city Traveling Salesman Problem the NP4P implementation will beat the worlds fastest supercomputer in a device that fits inside a standard PC case using a standard outlet.

NP4P is a solution for supercomputing applications

NP4P has no real use outside supercomputer applications because it cannot be reprogrammed for any task other than the one it was specifically designed for.  That being said, NP4P has overwhelming processing power for computing a single problem.  Since supercomputers typically run about 30 types of problems, it makes sense to consider building an NP4P for each problem type in place of a single supercomputer.

NP4P is derived from a 4P program

After 4P is used to solve a problem, a NP4P solution can be obtained by stripping away the programmable part leaving a very efficient ASIC solution.

Non-Programmable 4P (NP4P) Design Description

The NP4P 23 City Traveling Salesman Problem (TSP) Problem Outline

The goal of the 23 City NP4P TSP implementation is to reach a solution faster than the world's fastest supercomputer using a device that fits in a standard PC case.  The "contest" would require the following specifications:

  1. The input data format is 32 bit integers containing the distances between every pair of cities.
  2. The input data will be generated randomly immediately prior to the contest.
  3. The input data will be presented in a file formatted according to a prior agreement by both parties.
  4. The input data will be limited to distances that do not produce overflow of 31 bits when adding the largest path.
  5. Both contestants must use brute force to arrive at the solution.
  6. The first contestant to iterate through every path and reach the correct solution wins.


Outline for describing the 23 City TSP NP4P Design

  1. Technically examine the existing SPICE model for 6 cities (shown above) in detail  
  2. Describe the generalizations and additions needed to engineer the 23 city design into a functional ASIC
  3. SPICE files available on request


The design shown above is a functional in the sense that it produces correct output at the result.  Due to funding constraints the permutation algorithm has not yet been completed but has been replaced by simple indexing for verifying the remainder of the circuit.     


Note that every step between the existing design, shown above, and the final 23 city design is solidly founded upon engineering application.  Hence, there are no theoretical  open problems to solve, only practical application of known principles to build a technically accurate implementation described by a functioning prototype.

Input Data Storage, indexing and permutations

Line Selector

A Single Clock Pulse is incremented every 32 clocks setting theEnable Storage Line for the next 32 bits of storage.

32 Bit Integer Storage

The Eneble Storage Line Selects the 32 Bits of Storage that stores the Data Input.  D Flip Flops constantly cycle the storage around 2 lines of 16 bits each.


Refer to the main image to see the distances stored between cities. 

Indexing & Purmutation


Each city is joined to 2 other cities so each group has a pair of index sets that are added together with 0 or 1 elements from each index set selected.


The index set on the right has a constant selection.  The index set on the left has a simple iterator.   These are used for testing the remainder of the design.  The final design will have a permutation algorithm implemented and has not yet been completed due to budgetary constraints.

Distance adding & result accumulation

Standard Adder

Two distances are added serially with the carry bit fed back.  City #5 only has a single distance stored.  With the exception of #5, all the other cities add the two possible index values from its stored values as shown in the main image.  City #5 is delayed so its output is aligned with the rest whose additions are pairwise added in succesion.   Another delay is neccesary since 6 is not a power of 2.  The final result contains the summation of all selected cities.  Refer to the main image to see the layout.

Iteration Result, Bit Reverser & Less Than

The final Adder produces the result of the current iteration for the entire path of all cities added together for the current permutation.   The result is run through a 32 bit to feed the Most Significant Bit first.  The Data Reverser and Less Than operator are shown in more detail below.

32 Bit Reverser

A Single pulse is cycled through 32 D Flip Flops.  The 32 Bit reverser loads the 32 bit result.  The single pulse signals the reverser to unload the 32 Bits with the Most Significant Bit first as required by the Less Than Operation.

Result for 6 city model

Less Than & Result Storage

The Less Than operation is initialized to all ones until the first value has passed through the Less Than operator.  This ensures that the first  value will be stored.  After that, the stored value is compared with the new value and the smallest is stored.  

Additions from 6 city prototype to 23 city production model

Input Output: 6 City TSPM

Input:

  1. Wake bit.  The first 1 starts the timer.
  2. 15 distances of 32 bits each

Output:

  1. Result: 32 bit minimum distance constantly streamed

Input Output: 23 City TSPM

Input:

  1. Wake bit.
  2. 253 distances of 32 bits each.  Same values for all modules.
  3. N bits specifying the number of iterations to run.  N is the smallest number of bits needed to hold the iteration count.  Same value for all modules.
  4. 253*K bits, where K is a constant to be determined, specifying the starting permutation for each module
  5. 32 bits specifying the UID for each module
  6. 4 bits specifying clock speed for each module
  7. 1 bit enable flag

Output:

All Outputs begin streaming after running the specified # of iterations (see input).

  1. Result: 32 bit minimum distance.  
  2. UID: for TSPM
  3. Iteration Count corresponding to result

Multiple TSPMs

Input is passed serially to each TSPM one after the other over the entire chip/wafer.


After all TSPMs complete all calculations, they stream their results to Accumulators that pass the smallest value forward along with the corresponding UID and Iteration Count.  


The final output pin streams off the chip/wafer through a USB to a PC where the UID and Iteration Count are used to determine the path.



Multiple Clock Speeds & Defective TSPMs


Multiple Clock Speeds:

  • Manufacturing Irregularities may require clock reduction in some TSPMs.  
  • The maximum clock speed for each TSPM will be determined during hardware validation testing
  • A 4 Bit, programmable variable will instruct each TSPM what clock speed to run
  • The starting permutations will be externally calculated taking into account each clock speed to ensure that the entire set of permutations is calculated 
  • The number of iterations determine when all of the TSPMs will release their results
  • All results will be released running on the slowest clock


Defective TSPMs:

  • TSPMs that are determined to be defective during testing, have their enable flag set to 0.   
  • A TSPM  with a 0 enable flag streams 1s for their result to ensure that they do not affect the final result.     
  • Critical paths, such as program load or result output, are over designed or redundant.

Calculations for required hardware

Introduction

Here we make a rough, yet conservative, estimation on how much hardware will be required to reach 200 petaflops.  Since the purmutation iteration algorithm has not yet been completed we make a general, conservative estimation rather than being obscured in the details of the existing, exact, transistor count for the 6 city prototype shown.

Transistor Density and Clock Timings

We use 100,000,000 transistors per square mm for all calculations.  


We use a clock speed of 1 GHz.  Higher clock speeds can result in a smaller hardware footprint.

Storage

The transistor count for the 23 city TSP will be dominated by the storage requirement since the Data Reverser and Less Than operator occur once per TSPM.  For 253 distances times 32 bits per distance times 8 transistors per D Flip Flop we get 


253 * 32 * 8 = 64,768 transistors per TSPM for storage.

Conservatively Rounding

Since the Storage will dominate the transistor count we will simply round up to 100,000 then double to 200,000.  This should provide an absolute maximum transistor count per TSPM.  

Calculations Per TSPM

Each TSPM will produce 22 additions every 32 clocks.  At 1 GHz, this is 22/32 = .6875 Billion operations per second per TSPM.  

TSPMs per 300 mm wafer

For efficiency, we are interested in using the entire wafer uncut with pinouts located on the edge.  We are presently verifying this design with  manufacturers.   Using chips cut from wafers will impose additional inefficiencies due to lost silicon.


At 70,000 square mm per wafer and 200,000 transistors per TSPM and a transistor density of 100,000,000 we get

70,000 * 100,000,000 / 200,000 = 35,000,000 TSPMs per wafer

Calculations per wafer

35,000,000 TSPMs per wafer * .6875 Billions of calculations per Second per TSPM = 24,062,500 Billions of Calculations per Second per Wafer

Number of wafers needed

Goal is 200,000,000 Billion calculations per second.


200,000,000 / 24,062,500  = 8.312

Round up to 9 with a 7% factor of safety.

Round up to 10 with a 19% factor of safety.


Factor of safety includes manufacturing defects and edge losses for cutting if the chips must be cut.

10 wafers will fit inside a standard PC box along with a Raspberry PI, Power supply, Cooling System and control board with USB IO to communicate with the wafers.  Note that running at a higher clock speed or using less transistors per TSPM will reduce the number of wafers proportionality.