4P Simplified

A short overview of 4P in layman's terms

4P using non-technical analogy

It all started with the PSP

That PSP is made out of 14 transistors.  Yet it can provide all the calculation power for any software ever written on any digital hardware ever built.  How?  It creates a hardware equivalent in source code.

Learn about the PSP

Same Power 100x Smaller: The Calculations

100x smaller supercomputer is a big claim.  Here is how we calculated that number.

The Calculations

4P Hardware Fundamentals

  • The 4P IC layout is very simple after understanding the design philosophy because the layout is highly repetitive/symmetrical and conceptually elementary.

  • 4P is not a single IC layout, but rather an architectural framework that will ultimately result in many specific IC designs based upon the 4P paradigm.  For example, one 4P design trade off is faster program load times at the cost of lower memory cell density.

  • 4P is built on commercial silicon etching technology that has been successfully powering computers for decades.  The implementation is very basic, almost trivial, once engineers understand the underlying fundamental principles of how it works and, more importantly, why it works.  

  • 4P provides a revolutionary foundation for computation.  Instead of traditional CPU/GPU cores, 4P uses simple, tiny processors enabling massively parallel programmable computation areas over the entire silicon wafer surface.  Using 400-2000 fully etched, uncut, 300 mm wafers @ 1GHz, 4P will reach 100 petafops.  Then we optimize to redefine computation.

4P Architecture Details

4P: Computational Canvas

4P  hardware is basically an empty canvas for developers to code.  The TSP video shows one program but this canvas can be used for anything from  graphics ray tracers to credit card embedded chips to low latency, real-time audio filtering.   Anything that can be digitally computed can be run on a 4P canvas and are all treated the same.   Above is an image of an operating system responding to a user request to run an online app.  The operating system is first loaded  through a secure connection (0).  The app is run as follows:

  1. The User clicks an app to download 
  2. The Allocation Module within the OS takes the following actions:
    1. Determines if the App is allowed to run, if not, notifies the user
    2. If the App is allowed to run, calculates the App space requirements
    3. Determines if there is enough room within available Free 4P Canvas Area
    4. If there is enough room, determines where the App will run 
    5. If there is not enough room, makes room by clearing out old Apps or notifies the user
  3. The App is loaded then run 
  4. The App writes to the screen as needed

Notice that the Modules shown are not simply logical relationships but rather designate the physical area occupied by each Module (not to scale).  Furthermore, in an actual 4P simulator, each Module could be independently viewed down to its component parts similar to the TSP video.

4P operates in two distinct modes:  

  1. Program Load 
  2. Program Run

Program Load

A program is loaded from a secure input line.  For a physical chip, this corresponds to the control line.  Note that some chips can be pre-loaded for security purposes.  For example, a simple operating system on a drone may be physically etched into hardware to prevent anyone from reprogramming it ever.

The control line for a physical, programmable 4P chip guarantees that the owner has complete control over the chip.  The control line can reprogram any and all portions of the canvas at any time ensuring complete security.  This control line will typically be connected to a ROM or a highly secure network during program load.   The control input can program any other input pins with the authority to act as local control pins over the entire canvas or any sub portion of the canvas.

Regardless of prior programs or any possible state, however, the single control input pin always retains 100% control over the entire chip.

Program Run

The video for he TSP shows a program that has already been loaded.  The program is "pinned" into location and will remain there unless reprogrammed by the control line.  In this way, programs are deterministically separated from the data they process in a way simply not possible by conventional architectures.  The data passes through the program and is processed but has no way to affect the underlying program without first entering through the control pin.  As such, very secure systems can be built and validated through simple inspection.

Architecture: Conventional Vs 4P

Conventional Processors: Discrete processors

Conventional processors such as CPUs and GPUs have discrete processors using instruction fetch/decode and arithmetic logic units.  Data is addressed per 32 bit or 64 bit word.  Data and Instructions are streamed from RAM or VRAM into layers of caches that hold a copy of each word of data.   The inefficiency in maintaining multiple copies of data is significant as is the amount of silicon used for the addresses, caches and associated support logic.  

Worse still is the effect that the multiple copies of data have on parallel processing.  Data consistency is a major hurdle in multi-threaded design.   GPUs use SIMD cores to mitigate these affects through SIMD design thereby enabling parallel computation without as many synchronization issues.  As a result, however, GPUs, ASICs, FPGAs and systolic arrays are not suitable for general purpose programming such as operating systems.       

4P gives the best of both worlds by allowing the developer to use only as much processing area as each program needs.

4P: Massively Parallel or General Purpose Code

While the massive performance increases of 4P hardware will fuel the headlines, it is important to note how well suited 4P is to more basic programming tasks.  The amazing thing about the 4P canvas is that you use what you need.  If you want to fill up the 4P canvas will an array of TSP modules and use it like a supercomputer, you can.  If you want to run a series of applications at the same time on your desktop, you can.  You can use the processing power however you wish and the 4P canvas is happy to oblige.     

Supercomputers will be the first to benefit as a 4P supercomputer on a desk will outperform current systems requiring the floorspace of a football endzone.  Note that 4P allows supercomputers at different locations to stream data back and forth as if they were a single machine.  As far as 4P is concerned, any number of supercomputers can make a single module no different than the modules  created from several 4P chips in a single system.       

PC’s will naturally evolve as slimmed down supercomputers.  Followed by phones, routers, credit card chips and everything else.  The streaming connectivity make 4P so general and simple.  For example, if you want a very simple 4P chip with no operating system, you don’t need one.  The hardware is no different.  Just program it with the module you design and connect the pins to run it like any other piece of hardware.

Security: Conventional vs 4P

Conventional Security

Conventional Hardware recognizes programs running at different security levels as shown in the image above.  The problem with conventional security is the reality of centralized addressing.  User code downloaded from an untrusted location often requires making system calls at elevated levels of privilege.  A malicious attacker can use these calls to exploit code vulnerabilities that are notoriously difficult to notice until after they are exploited.  With the millions of lines of software running at elevated privilege, modern security is often reactionary rather than deterministic.   

For example, a simple printf() statement can and has used buffer overflow to gain access to protected kernel address space.  Once the attacker can execute instructions at high privilege, the entire system is vulnerable.  With regular OS and device driver updates, it is next to impossible to ensure that all system code is free from security risks.  The truest security is the sheer magnitude of complexity required to unravel the code to exploit the security flaw.  As a result, many modern security experts agree that any machine connected to the Internet has security by obscurity rather than actually being secure.

4P Security

In the age of the Internet, there is no easy answer to system security just like there is no easy answer to home security.  4P is not a magic answer but its’ natural design makes it much simpler to stay secure with significantly less effort.  For example, in your home security would you prefer constant video monitoring with motion control alerts or a daily, 500 page report?  

In 4P, security is visual not text based.  You SEE the data moving into and out of areas of high security.  More importantly, each program lives in its own physical location in hardware instead of sharing a single processor.  The physical Control Input pin that programs the Operating System is highly guarded using a secure ROM, USB or secure network connection.  For extremely high security, such as military drones, the Control Input can actually be destroyed thereby ensuring that OS is NEVER altered.    

In conventional systems, security diagrams are just that: diagrams.  In reality, every program runs through the same CPU and moves data out of the same central RAM.  In contrast, 4P programs live in physical areas on the physical silicon.  Layers of security are directly analogous to military perimeter security.  Data lines are validated like visitors entering checkpoints.  

In 4P you don’t see diagrams that logically correlate to hardware execution that is largely unprovable.  You see the actual data physically moving through the real program right where it lives on the hardware.

Latency: Conventional Vs 4P

Conventional Cache Misses Kill Latency

In a conventional processor, when data is has not been recently used it can result in a cache miss that destroys latency.  Deterministic clock timings must either use special hardware or include the cache miss delays or guarantee that data is in the cache.  Including cache miss delays greatly reduces the clock resolution while keeping data in the cache reduces the availability of the processor to do other tasks.  

Worse still, even when including the cache miss delay or keeping the data in the cache there is typically still an “engineering approach" of using what works rather than a logical design that guarantees deterministic clock timings.  While low-latency applications suffer the most, this non-determinism has a cascading effect on all levels of software.  

Thus, software timing is often a matter of trying software on a multitude of hardware and dialing back the load if it bogs down.  This approach leads to one of the greatest computer users frustrations: an unresponsive system.

4P = Exact clock timings

Any 4P module can guarantee clock timings.  For a fixed data size, processing timings can be guaranteed down to a single clock.  This level of clock resolution is naturally present and easily witnessed in 4P.  Run the TSP demo.  The result appears on exactly the same clock every time because there is no guessing and hoping that your data is in the cache.  Trying to accomplish this in conventional processors is sheer fantasy.  In 4P you SEE it.   

Imagine an operating system guaranteeing simple state query responses in under 1000 clocks.  And those occur in parallel for hundreds of thousands of queries from thousands of services.  4P finally makes it possible and practical to guarantee basic, significant levels of responsiveness with every click.

Example: 4P Network Packet Processing

We envision an Internet where internet packet routers use 4P to provide an extra service, typically used for video games, network presentations and Voice over IP (VOIP).  Low Latency Low Bandwidth (LLLB) packets will be specifically designated by clients and separately processed by routers.  This service may incur extra charges, that certain clients will be happy to pay.  

LLLB packets will have a small fixed size, likely under 1K although routers may provide different tiers.  When a router receives an LLLB packet, it is separated from regular packets and processed very quickly in deterministic time.    A direct extension of LLLB processing would be Medium Latency High Bandwidth (MLHB).  MLHB would typically be used for video where latency is less important that constant throughput.

Example: 4P Real Time Audio Processing

Real time audio mixing is particularly affected by latency due to the high frequencies involved.  Additionally, the human ear notices very a small time delay between a guitar string being strummed and the time the sound is played on the speaker.  Since 4P can produce output deterministically, real time mixing is possible on a general 4P processor that would typically require dedicated hardware.    

High end robotics systems would also be affected similarly.  It is important to note, that as military combat inevitably becomes ever more automated, these very small timer differentials could make the difference between victory and failure in actual combat.

4P Algorithms

Less Than Module (LTM)

The Less Than Module (LTM) is a streaming version of the standard less than operation typical in modern programming languages.  The TSP directly implements the LTM described here.  The LTM has 2 streaming input lines, Data #1 and Data #2.  The data on each of these lines must be fed Most Significant Bit (MSB) first.  The Reset input must receive a single True/1 bit on the same clock as the first bit of data.  The reset bit is connected to a timer that sends a single True/1 bit every 32 bits.  Thus, the LTM can be used to compare any data size by altering the timer connected to the Reset input.    

After Reset, the LTM compares the input bit from Data #1 and Data #2.  As long as both inputs are the same, the result is copied to the Result Output.  On the first clock that the inputs are different, the input with the False/Zero is locked.  The locked input is then copied directly to the Result Output until the Reset Input receives a True/1.  Note that, due to the internals of the LTM, the Result Output is 6 clocks behind the Data Inputs.

Less Than Switch Module (LTSM)

The Less Than Switch Module (LTSM) is very similar to the LTM except that both Data Inputs are streamed to 2 Data Outputs.  The smaller value is always written to the lower Data Output #2.

4P Equivalent of Bubble Sort is O(N)

The 4P rough equivalent to Bubble Sorting N 32-bit integers takes exactly 32*N+6*N clocks to produce a sorted array from an unsorted array.   A streamlined version runs in 6N clocks and eliminates the need for the second storage array.

 The 32 is directly based upon the bit size.  Each compare/switch operation uses a Less Than Switch Module (LTSM).  A clock resets the LTSM every 32 bits to match the 32 bit integer word size just like the TSP.  The LTSM requires 6 clocks to generate output after being reset. This cost only occurs 1 time since each Less Than  operation is continuously streamed during the Bubble Sort.

4P Bubble Sort requires 2 storage arrays similar to the one used to store the data in the TSP.  Each step consists of 32 clocks corresponding to the 32 bits in each integer.  In the first step, adjacent pairs are compared using a LTSM for each pair.   

On odd steps, the LTSM compares two numbers and routes the smallest number to the lower position.  Since this is done in parallel, half the numbers are switched each clock as the numbers are written into the storage array.  On even steps, alternate pairs of numbers are compared as shown and written back to the original array.  In total, N steps are required.  

Note that the second storage array can be replaced by N columns of LTSMs.  The result would then be produced in N+6 clocks since the output of each LTSM would be fed into the next one on the next clock rather than waiting for the entire 32 bits to be processed and stored.

Finding the max and min of N 32-bit integers takes 32*Ln(N) clocks.    We expect similar results for the variety of basic computer tasks.  Furthermore, although it has not been proven, we anticipate every conventional, serial Order N-squared operation to be solvable in Order N with 4P.

4p Variations

Mixing 4P canvas with conventional DRAM

4P is not a single design but a family of designs that will be specialized for various applications.  For example, credit card chips will be small with few physical pins while supercomputer chips will be entire wafers.  Display driver chips will have many physical pins to feed the physical display.  Yet none of those cases really change the overall 4P canvas hardware structure.  

Here we look at a modification that will significantly alter the 4P canvas: mixing the 4P canvas with standard DRAM.  Note that this can be done regardless of the size of the chip.  A 4P chip designed to replace a PC CPU/GPU, for example, will need significant access to a large amount of memory.  While 4P itself provides memory, this memory is costly in terms of the amount of silicon used.  By mixing 4P with standard DRAM, we get high memory density combined with 4P streaming programmable architecture.    

The image shown is simply for demonstration rather than as a practical application.  The point is that some layout that combines the two will be needed.  This will be determined through usage patterns determined by running a variety of applications on 4P and finding the “sweet spot” in terms of proportions and geometric arrangement.  Note that certain applications will likely have different usage patterns and benefit from different layouts. Note that any 4P chip can run any program (provided enough processing area exists to physically hold the program).  In the end, market forces will ultimately determine how specialized 4P will get to serve customers having specific hardware requirements.