PCIe 3.0 vs 4.0 vs 5.0 vs PCIe 6.0 - Whats the difference?

In this article, we try to understand the differences between PCIe 3.0 vs 4.0 vs 5.0 vs PCIe 6.0.

Note: If you buy something from our links, we might earn a commission. See our disclosure statement.

Demands for fast speed, reduced cost, and robust interconnectivity are ever-increasing. To meet these demands, the PCI-SIG persists in enhancing the esteemed PCIe architecture. This architecture is targeting 64 GT/s for its upcoming release.

The architecture was initially out in the form of a local bus interface for every type of I/O device existing in the PC industry. Moreover, it has developed as a point-to-point link-based interface (PCI Express). The concept behind this evolution is to fulfill the I/O requirements existing around the cloud, enterprise, artificial intelligence, PC, embedded, IoT, automotive, and mobile market segments.

Source: https://pcisig.com/

The same is made possible owing to the capability of PCI Express (PCIe) architecture to flawlessly provide lucrative, power-efficient HVM (high-volume manufacturing) solutions. These solutions come with high bandwidth and low latency with the help of six generations of technology development. This development doubles the data rate in each iteration and maintains full backward compatibility with all earlier generations. Consequently, it guards customer investments.

One exceptional feature of the PCIe specification is that while it supports various widths and data rates to support the varied performance needs of multiple devices over a broad range of usage models, they are interoperable with each other.

As a result, both the platform and silicon developers are facilitated to design and authenticate to a single specification. Although various form factors (like M.2, U.2, CEM, different types of SFF) are developed to fulfill the requirements of multiple systems over the compute range, all of them utilize identical silicon ingredients depending on a universal PCIe base specification.

PCI technology proved to be a successful and omnipresent I/O interconnect. This is because it is an open industry standard supported by a powerful compliance program to guarantee flawless compatibility between devices from various companies. PCI-SIG is a group of 800+ member companies across the world. It possesses and handles PCI specifications and also executes the compliance program. Furthermore, PCI-SIG anticipates that PCIe technology will persist in developing to fulfill the different I/O needs throughout the whole compute range for several years in the future.

The present article dives you deep into the info of PCIe technology that covers its evolution from the fourth to the sixth generation.

Table of Contents

The First Three PCIe Generations at 2.5, 5.0, and 8.0 GT/s:

It was 2003, then PCIe technology began working at a data rate of 2.5-GT/s. It supported widths of x1, x2, x4, x8, and x16 for varied bandwidth levels. All through the six generations of development of PCIe architecture, the supported widths are unchanged. In 2006, the PCIe 2.0 specification doubled the data rate. PCIe technology’s initial two generations adopted 8b/10b encoding, which resulted in an encoding overhead of 25%. The same was required to maintain dc balance (avoiding drift or offset), provide the different encodings essential for the physical layer, and train handshake.

When it comes to the PCIe 3.0 specification, a planned decision was made to enhance the data rate to 8.0 GT/s. The same was supplemented with the latest 128b/130b encoding system to double the bandwidth per pin across PCIe 2.0. Furthermore, the latest encoding system promised excellent reliability by adopting the fault model of the three random bit flip detections. It also presents multiple inventive approaches to accomplish physical-layer framing of packets without any alteration in the packet format conveyed from the upper layers.

Instead of doubling the data rate to 10.0 GT/s, the decision was made to increase it up to 8.0 GT/s. This decision was based on the help of data from broad analysis to make sure that PCIe 3.0 can be compatible on existing channels with the predictable silicon and platform ingredient abilities of the instance when the specification was introduced in 2010.

Also, the decision was made such that it still meets projected power and cost restraints. In PCIe 3.0 specification, backchannel equalization was launched to lessen the effects of channel loss taking place over the platform due to silicon ingredients. Moreover, the encoding and equalization systems of PCIe 3.0 architecture were powerful enough to maintain succeeding generations of speed increments.

Apart from presenting faster speed, the PCIe’s progress during this time backed some features like I/O virtualization and device sharing. These features are supported to hold the budding trend of supporting several independent virtual machines and containers over a single platform. For keeping the performance demands of accelerators, some more features are added. This includes atomics, caching hints, and lower-latency accesses by improved transaction bypass semantics.

Our low-power states are enhanced with more intense low-power states to aid the development of handheld segments like tablets and smartphones. The same leads to a situation where devices can maintain their conditions for a faster recommencement of traffic and the power consumption being maintained in the single-digit microwatts.

This system involves reduced power consumption in an idle state, best-in-class power efficiency in an active state (approx. 5 pJ/bit), and a quick transition time among the two. All these aspects resulted in PCIe architecture becoming the interconnect of choice over low-power and high-performance sections.

October 2017: PCIe 4.0 Specification at 16.0 GT/s:

The process of doubling the data rate from 8.0 GT/s to 16.0 GT/s was slower to make sure that silicon and platform ingredients could develop in a power-efficient and profitable way. As a result, they will make the technology transition seamless. Moreover, the channel loss budget was elevated to 28 dB.

In this context, the routing material persisted to become better with affordable and newer components like Megtron-2 (4 and 6) with superior loss characteristics. Also, the routing material involved the enhancements in packaging technology to make it viable in the cost and power limits of platforms having around hundreds of lanes.

It was not sufficient to encompass longer channel lengths like 20 inches using two connectors. But with the use of available packaging technology and board materials utilized in the systems, we could support 15 inches of board trace through one connector and an add-in card.

Retimers should be formally specified as channel-extension devices. They possess the entire physical layer as well as twice the channel loss. In a link, a maximum of two retimers is permitted to facilitate longer-reach channels through PCIe architecture.

PCIe 4.0 technology enables multiple exceptional transactions to carry on the rising bandwidth competencies via scaled flow-control type credit systems as well as extensive tags. It also improved the reliability, availability, and serviceability (RAS) features to allow migration to the direct-connected PCIe storage devices via downstream port control. Besides, systems can execute non-destructive type lane-margining abilities without disrupting the system operation. Like every time, these improvements will continue a few generations of bandwidth increments.

At this inquest, a question typically arises: How do the systems manage the I/O bandwidth demands when the technology transition from PCIe 3.0 to PCIe 4.0 specification took much more time than the usual transition owing to the inflection points that need to be addressed? The answer is debatable.

Those platforms that began with PCIe 3.0 architecture possessed nearly 40 lanes of PCIe technology emerging from each CPU socket. Before the transition happened to the PCIe 4.0 specification, there was a significant increase in lane count per CPU socket. It reached up to 128 lanes per CPU socket on a few HEDT platforms. Therefore, though the per-slot bandwidth didn’t amplify, the collective I/O bandwidth was increased by thrice in platforms. This increment was in respect of the measured I/O bandwidth as well as the number of lanes.

Storage aims to work as a collective bandwidth driver in which every storage device is linked to the system through a narrow link (like x2 or x4). Therefore, rising storage demands were fulfilled with lane count increment. It is important to know that networking is a single-slot usage. Throughout this time, it switched from 10 Gb/s to 100 Gb/s and dual 100-Gb/s network interface cards (NICs). The corresponding bandwidth was held by NICs shifting from x4 to x16 (for dual 100-Gb/s NICs, it is 2 x16) width –which is a significant payoff from perspectives of power, cost, and performance. There has been a developmental change in accelerators and GPGPUs too. The purpose behind that was to make data transfer efficient by protocol hints and correct transaction sizing.

Generally, the deceleration in speed evolution from PCIe 3.0 to PCIe 4.0 architecture was alleviated by a width increment. This is due to the flexibility provided by PCIe specification. Moreover, the ecosystem evolved on its own, so the speed transition was carried out feasibly with an eye on power consumption.

May 2019: PCIe 5.0 Specification at 32.0 GT/s:

Image: https://pcisig.com/

The last several years have witnessed a prominent revolution in the computing landscape. His is because areas like edge computing, cloud computing, and applications like machine learning, artificial intelligence, and analytics have facilitated quicker processing and data transfer demand. Since the computing and memory ability rises exponentially, we should maintain the I/O bandwidth doubling process at an accelerated beat to carry on with the performance of up-and-coming applications.

For example, to better understand this, 400 Gb (or dual 200 Gb) networking requires an x16 PCIe at 32.0 GT/s to maintain the bandwidth. The same needs the launch of a fully backward compatible PCIe 5.0 interface within two years after the release of PCIe 4.0 architecture. This is essentially a noteworthy achievement for a standard.

Principally, the progression from PCIe 4.0 to PCIe 5.0 specification was a speed upgrade. The 128b/130b encoding was essentially the protocol support to scale up bandwidth to higher data rates. The same was already equipped inside the PCIe 3.0 and PCIe 4.0 specification. Furthermore, the channel loss was prolonged to 36 dB accompanied by the enhancements to the connector to reduce the loss with the enlarged frequency range. Due to the enhancement in board material and packaging technologies, the channel reach is identical to PCIe 4.0 technology. It uses retimers to enlarge the channel reach.

One of the improvements brought with the PCIe 5.0 architecture is the integral support for alternate protocols. Since PCIe technology has developed to be the one that offers the highest bandwidth, majority of the power-efficient, and majority of the extensively deployed interface, few usages need additional protocols, like memory and coherency, to execute on the identical pins as PCIe architecture. For instance, several accelerators and smart NICs might hoard the system memory and plot their memory to system memory space for proficient data exchange and atomics besides PCIe protocols. Likewise, system memory is transferred to PCIe PHY owing to the high bandwidth, low latency, and power-efficient solution it provides.

Few other protocols are implemented as well. For example, symmetric cache coherency is implemented between components through PCIe PHY. Moreover, support for alternate protocols over PCIe PHY is offered to fulfill these user requirements, avoiding the ecosystem’s fragmentation with varied PHY for varied usages.

Targeted for 2021 Release: PCIe 6.0 Specification at 64 GT/s:

Source: https://pcisig.com/

PCIe 6.0 Specification Features Overview

A data rate of 64 GT/s speeds, doubling the 32GT/s data rate of the PCIe 5.0 specification
A move to PAM4 (Pulse Amplitude Modulation with 4 levels) encoding
Low-latency Forward Error Correction (FEC) with additional mechanisms to improve bandwidth efficiency
Backward compatibility with all previous specification generations

We carry on the accelerated drive to once again double the bandwidth in two years in a backward-compatible way. Some applications like machine learning, AI, visual computing, gaming, storage, and networking keep on demanding bandwidth increases as we discover ourselves inside a worthy cycle of applications that efficiently drive throughput with improved capabilities.

Devices like accelerators, GPUs, coherent interconnects, high-end networking (800 Gb/s), and memory expanders carry on to require more bandwidth at a faster speed. Applications with constricted form-factor that can’t increase width also need increased frequency to convey performance.

When doubling the data rate further than 32.0 GT/s with the help of the NRZ (non-return-to-zero), there are some important challenges. These challenges are owing to channel loss. Therefore, PCIe 6.0 will implement PAM4 (pulse-amplitude modulation, 4-level), signaling that networking standards have been extensively accepted while they shifted to data rates of 56 Gb/s and higher. Through this method, four levels are used to encode two bits in the same UI. This allows PCIe 6.0 UI (and Nyquist frequency) to be equivalent to PCIe 5.0 architecture.

	FOR PCIe 6.0 DEVELOPMENT
Metrics	Requirements
Data rate	64 GT/s, PAM4 (double the bandwidth per pin every generation)
Latency	<10 ns adder for transmitter + receiver over 32.0 GT/s (including FEC)
Bandwidth inefficiency	<2 % adder over PCle 5.0 across all payload sizes
Reliability	0 < FIT << 1 for a X 6 (FIT = failure in time, number of failures in 109 hours)
Channel reach	Similar to PCle 5.0 under similar setup for retimer(s) (maximum 2)
Power efficiency	Better than PCle 5.0
Low power	Similar entry/exit latency for L1 low-power state Addition of a new power state (LOp) to support scalable power consumption with bandwidth usage without interrupting traffic
Plug and play	Fully backward compatible with PCle 1.x through PCle 5.0
Others	HVM-ready, cost-effective, scalable to hundreds of lanes in a platform

The PAM4 reduces the channel loss by running at half the frequency (with two bits per UI). But it is more prone to errors because of different noise sources created by reduced voltage ranges. The same is visible as a higher bit error rate (BER), some orders of magnitude greater than the 10-12 BER for the PCIe 1.0 throughout PCIe 5.0 specifications.

There is one more side effect, i.e., the correspondence of errors due to interrelated error sources like power-supply noise and error propagation in the similar lane because of the decision feedback equalizer (DFE). Such effects are compensated by implementing a forward-error-correction (FEC) system. The FEC system comes with one drawback, i.e., reduced link efficiency because of the FEC bits and the latency addition for the encoding/decoding systems. The tougher the FEC, the poorer the performance characteristics. But, the effective bit error rate gets better due to correction.

For example, few of the existing standards come with bandwidth loss of 11% and FEC latency higher than 100 ns. The same don’t conform to the bandwidth and latency requirements of a load-store interconnect such as PCIe technology. PCIe 6.0 specification development follows the protective barriers in regards to key metrics outlined in the below table. Though these are difficult goals and not solved earlier, we should conform to these metrics to make sure that PCIe persists to be an efficient interconnect.

We require a fixed FLIT (flow control unit) size when using FEC to employ the correction. Since the FLIT size is fixed, using the error-detection system (like cyclic redundancy check (CRC)) running on the FLIT is beneficial. The PCIe delineates the data-link-layer packet (DLLP) and transaction-layer packet (TLP) of varied sizes. Therefore, we delineate the payload to line up FLITs. This is the reason why a FLIT can have multiple TLPs and DLLPs. Moreover, a TLP/DLLP may extend over multiple FLITs.

Because the FLIT incorporates the CRC within this new mode, the TLP and DLLP will not hold their individual CRC bytes like they carried out in earlier generations. Besides, FLITs are fixed size, and therefore, it is needless to have a PHY layer framing token for each DLLP or TLP. Such savings enhance efficiency to beat the FEC overhead.

The packet efficiency through PCIe 6.0 architecture surpasses earlier generations’ payloads with a size up to 512 bytes. For example, a 4DW (here DW stands for Double Word; the size of a double word is 4 bytes) request TLP would have a TLP efficiency of 0.92 through FLIT-based encoding compared to 0.62 found in prior generations. The same leads to effective throughput improvement by three times (twice from data rate increase and about 1.5 times improvement in the TLP efficiency). With the rise in the TLP size, efficiency declines. For example, for the data payload size of 4-kB, it decreases to 0.98, aligned with the bandwidth inefficiency given in the metrics in the table.

Figure 5 shows the tradeoffs linked with the raw burst error rate across a wire (error propagation to multiple bits is measured as one error) and the efficiency of various FEC to deal with that error. Moreover, a single symbol error-correcting code (ECC) corrects one error burst, while a double symbol ECC can correct up to two error bursts.

The burst’s length follows a definite probability distribution function, but the ECC code is delineated such that the odds of a burst surpassing the ECC capability are minor. Silicon data and simulations are employed to make the payoff between the nature of the burst, the error rate, the silicon capability, and the channel constraints. Furthermore, PCIe 6.0 aims a burst error probability of 10-6. Consequently, it leads to a retry probability of the FLIT of approx. 10-6.

Because PCIe comes with a low-latency type link-level retry system, there is no need to set up a robust FEC to increase bandwidth overhead and latency. A retry probability of 10-6 is a logical tradeoff that leads to the FEC latency adder of 1-2 ns in every direction. During a retry (i.e., in the case of a 10-6 probability), the delay found in FLIT is ~100 ns. This is because of the round-trip retry system. The same is a logical payoff compared to adding 100+ ns to each FLIT with a robust FEC and paying a high bandwidth cost. Information of the PCIe 6.0 specification is accessible to members from the PCI-SIG website.

Conclusion:

PCI-SIG records a strong and successful history of handling various technology transitions backward-compatible for three decades. So, it is now well placed to let the evolving computing landscape continue going forward.

The pledge and power of this unlock standards organization, supported by the united innovation potential of 800+ member companies. This makes our technology innovative, agile, cost-effective, scalable, power-efficient, and multi-generational. It is significant to all market segments as well as valuable models for the near future.

Affiliate Disclosure: Faceofit.com is a participant in the Amazon Services LLC Associates Program. As an Amazon Associate we earn from qualifying purchases.