i7-8665UE USB 3.1Gen2 aggregate BW?

mikoz · ‎07-06-2020

What is the total available BW from the Whiskey Lake i7-8665UE PCH controller to the USB3.1 Gen2 USB ports?
In other words, let’s say that I had 2 enumerated USB3.1 Gen2 devices each capable of delivering simultaneous 9Gbs Tx and Rx transmissions (real data bulk packet rates, think of this like an NVMe drive without the overhead of file transmissions), and I had two of these connected at once and
I wished to send and receive data simultaneously on both ports simultaneously,
what speed can be realized through the PCH of this design?
Assume output from these USB devices goes to /dev/null and input to them is from DDR4-2400
(which has more than enough BW to supply 2*9Gbs streams).

n_scott_pearson · ‎07-06-2020

The processor and the chipset (PCH component) are connected by the DMI Bus. In SOC designs (like this Core i7-8665UE processor), the OPI Bus replaces the DMI Bus, but the performance of the two is essentially the same (unless OPI is configured to run at half speed to save power). The DMI Bus is equivalent in performance to 4 PCIe lanes and thus achieves 32GT/s, which is roughly 4GB/s. Ignoring the minor interfaces (SMBus, I2C, SPI, etc), the bandwidth is spread across 16 HSIO lanes and these lanes are shared by the exposed (up to 12) PCIe lanes, (up to 3) 6Gb/s SATA ports, (up to 6) 5Gb/s USB 3.0 ports and a Gb/s Ethernet port. While many of these interfaces do not saturate the PCIe lanes they are assigned to, you can easily see that saturation of the DMI/OPI bus is going to occur in many circumstances (even more if the OPI is running at half speed). Because of this, it is impossible to predict what the throughput of any individual lane will possibly be at any point in time.

I didn't mention USB 3.1 Gen2. It takes the full bandwidth of a PCIe lane to support a USB 3.1 Gen2 port. The Controller IC that supports two USB 3.1 Gen2 ports thus consumes two PCIe lanes. In many cases, because of the bandwidth bottlenecks in the PCH, implementations will use processor PCIe lanes to support USB 3.1 Gen2 and thus guarantee 10Gb/s bandwidth per port.

Clear as mud?

...S

mikoz · ‎07-06-2020

Thanks. However, what if there's minimal or no internet traffic or SATA traffic and the other PCIe lanes are largely idle as well? Would this create a more predictable set of circumstances?

Am I also understanding you that if I were to take 2 of those PCie Gen3 lanes per USB port that I require (I need 2 USB ports, so I'd need 4 lanes) and use an external USB controller I would stand a much better chance of meeting the speed requirements even if the other interfaces like GbE and SATA are largely idle during this time? Said another way, there are two ways I could TRY to get to my speed requirements, I could:

(a) use the native ports connected to the PCH

(b) take Gen3x2 PCIe lanes for each port (with a total of 4 Gen3 lanes) and use two PCIe->USB controller ICs available commercially.

It sounds like (b) would yield superior and more determinstic results. Is this what you are also confirming?

It seems like you're overall saying that it's just not possible to get a determinstic and optimal speed situation from the USB ports connected through the PCH, under the conditions I outlined, due to the limitations of the PCH itself. Is this an accurate statement?

n_scott_pearson · ‎07-06-2020

No. While I said the USB 3.1 Gen2 Controller IC consumes two PCI lanes, it supports two USB 3.1 Gen2 ports. A PCIe lane can support 8GT/s or ~10Mb/s - which is exactly what you need to support a single USB 3.1 Gen2 port. If the system is relatively idle otherwise (i.e. not using SATA lanes, (other) PCIe lanes, GbE or USB 3.0 too heavily), the USB 3.1 Gen2 ports should be able to approach their theoretical speed. The question is whether these other ports remain idle enough to allow this for any appreciable period of time.

I am not sure that you blame the PCH for this bottleneck. It is the DMI/OPI bus that is saturating. Intel ran the razor's edge determining how far to take the DMI bus performance. Costs (and power utilization) would have gone up considerably if they had, say, doubled the width and transfer capabilities of this bus. They chose not to do that - knowing that the bottlenecks would exist - by (also) looking at the aggregate performance needed. If everything peaks at the same time, then bottlenecking occurs, but do they all peak at the same time? No.

...S

mikoz · ‎07-06-2020

Thanks for your 2nd reply.

1. To confirm, are you saying that two 10Gbs USB3.1Gen2 USB ports are serviced by only two PCIe Gen3 lanes total (which means an aggregate BW of 16Gbs, minus obviously protocol and coding overhead)? Obviously, if it's only two lanes for two Gen2 USB ports, you'd never be able to get > 8Gbs on both ports at the same time no matter what's going on with this bus. I wasn't clear on your reply here.

2. I am not purposefully picking on the PCH, there could be some other bus such as DMI/OPI that could cause the restriction, I am just mentioning the PCH because the USB ports stem from it on the the block diagrams. If this other bus is the bottleneck, I'll adjust my verbiage. Is there any documentation available on this or is it something a bit mysterious on purpose?

3. Depending on your answer to Q1 above, it would seem that if I used 4 Gen3 lanes and brought out two PCIe to USB controllers (one for each USB port, consuming TWO pcie Gen3 lanes), I'd stand a much better chance of meeting my speed objectives, correct? Would this DMI/OPI bus still be an issue or would the "spare" bandwidth of the 16Gbs of the two PCIe lanes servicing an IC which converts the 2 Gen3 lanes into a USB port and does away with the native ports with a requirement of only 10Gbs max (9Gbs in real life) be enough to overcome such bottlenecks?

n_scott_pearson · ‎07-06-2020

Correct (but see my answer to #3).
Check out this article: Choosing the Right SSD for a Skylake-U System. It is not directly on topic but does talk about the bottleneck issue.
Well, this will depend upon what ICs there are out there that offer this capability. I just know about the one that they have used a couple of times on the NUC platforms, and which have two PCIe lane connections in and two USB 3.1 Gen2 ports out.

...S

mikoz · ‎07-07-2020

Is there anyway to tell if this bus is running with half speed or perhaps any bios settings typically found to restrict this from happening?

Also, are the pcie user accessible lanes attached directly to the CPU or do they go through the bus? The picture of the whiskey lake architecture shows pcie lanes on both the CPU and attached to the PCH. Do we know which is correct?

CarlosAM_INTEL · ‎07-08-2020

Hello, @mikoz:

Thank you for contacting Intel Embedded Community.

You need to check the Port Status and Control USB2 (PORTSCXUSB3) register specifically the Port Speed (PortSpeed) bits values to determine speed that is working the cited ports. This information and more details are stated in section 21.2.20 , on page 838 of the Cannon Lake PCH-LP External Design Specification (EDS) Volume 2 document # 565870. It can be found when you are logged into your Resource & Design Center (RDC) privileged account on the following website:

http://www.intel.com/cd/edesign/library/asmo-na/eng/565870.htm

The RDC Account Support form is the channel to process your account update request or report any inconveniences with the provided site. It can be found at:

https://www.intel.com/content/www/us/en/forms/support/my-intel-sign-on-support.html

Also, you can use any of the tools listed at the following website for the cited propose:

https://www.usb.org/usb32tools

Best regards,

@CarlosAM_INTEL.

mikoz · ‎07-09-2020

I don't know/think that will help. I already know the enumerated status (HS, Gen1, Gen2), what I am after is the data rate allocated to the ports themselves. You can have a Gen2 USB device throttled to some really low number because of the controller or the BW allocated to it. Do these tools help me understand that?

What I really need is someone who can explain the following in sufficient detail:

SETUP:

I have a USB endpoint that's capable of 9Gbs Tx and Rx simultaneous transmission across a USB Gen2 port. I have two of them, I wish to run them concurrently on the USB ports of the Whiskey Lake PCH and hopefully observe the same throughput. Given that they get their data from the DDR4-2400 and, for this example, the code driving them will ignore the received data, there should be no other bottlenecks but the PCH or the xHCI controller. Also, given that option2 gives me this data rate, clearly the setup *can* work at such data rates, but not with the Intel xHCI controller.

OPTION1:

use the native PCH ports with the xHCI controller. Results: I get poor performance compared to option 2(which I'll list below). Even if I use only one USB device, I get poor performance compared to option 2.

OPTION2:

use two ASMedia 3142 chips which each take 2 Gen3 PCie lanes which are attached also to the PCH. I get far superior results. I have each device connected to one ASMedia 3142. An ASMedia 3142 does use 2x Gen3 lanes (16Gbs) of BW maximum per chip. I won't give exact numbers but the difference is very large, over 1 Gbs of throughput per device is observed.

Since both options take data from the PCH and both scenarios are using the same code and the same physical USB devices why is option2 so much better? Can someone explain this?

Is it:

1. the intel xHCI controller is very poor? Meaning it has some BW limitations or ...? Or the ASMedia chipset is so much better?

2. the # of controllers makes a difference. But this won't hold water because even if I just use one device there's a large performance difference.

Given that both the native xHCI ports and the ASMedia chipset both use the PCH lanes which both go through the OPI bus, why is the performance with the intel controller setup so poor?

I just need an explanation to the behavior at a level sufficient for someone who would develop code and the HW for such a setup, in other words, fairly low level details.

Finally, I cannot get access to the first page you mentioned and I don't see how the second page helps me get access to the first.

CarlosAM_INTEL · ‎07-10-2020

Hello, @mikoz:

Thanks for your update.

In order to be on the same page, could you please clarify if the affected project is your design or a third- party one?

In case that it is a third-party device, could you please inform the name of the manufacturer, its model, the part number, and where its documentation is stated?

On the other hand, could you please let us know how many units of the project related to this circumstance have been manufactured? How many are affected? Could you please give the failure rate?

Also, could you please list the sources that you have used to design it and if it has been verified by Intel?

We are waiting for your answer to these questions.

@CarlosAM_INTEL.

mikoz · ‎07-10-2020

It comes down to the question that I posed...why are the USB ports of the pch slower than than USB ports refactored by an asmedia 3142 chipset which uses lanes also from the pch.

It’s not a “failure rate” issue. The intel pch based xHCI usb GEN2 ports work, they’re just slow compared to an alternative and the reason is not clear to me
We have observed this on several different whiskey lake embedded processors, even from different manufacturers, it’s not an issue with one unit.

Try this experiment:

Obtain a fast nvme such a Samsung 970 pro. Buy a usb3.1 GEN2 enclosure to host the nvme.
Plug the nvme into a native pch port of whiskey lake
Run various speed tests using dd or hdparm, get read and write benchmark numbers.
Now, use a commercially available pcie board that has an asmedia 3142 chipset and plug it into the pcie slot of a carrier board for the same computer. This pcie board will use quantity 2 gen3 lanes. The lanes will also be funneled through the pch.

here is one such board:

StarTech.com PCIe USB 3.1 Card - 2X USB C 3.1 Gen 2 10Gbps - PCIe Gen 3 x4 - ASM3142 Chipset - USB Type C PCI Express Card (PEXUSB312C3) https://www.amazon.com/dp/B087G7T234/ref=cm_sw_r_cp_tai_4zkcFbBYQK7EJ

5. Move the nvme to one usb port of the as media chipset-based board. In other words, unplug the nvme disk and move it to the board mentioned in step4.
6.; Repeat the speed tests. Observe that the pcie based asmedia significantly outperforms the intel xHci based ports.

why is this? Can we explain this? Since both the USB ports and the pcie lanes used by the asm3142 both funnel through the Pch, why is the performance so much different?

The problem, meaning the intel controller speed difference, gets more severe if the experiment above is replicated twice over, meaning you double everything mentioned above...2 drives, 2 pcie cards, etc. In that scenario, if the speed tests are done in parallel the intel controller gets destroyed by the asmedia based boards.

I need someone to explain why the intel native ports are so bad in terms of throughput. If both the native USB ports and the pcie lanes feeding the asm3142 chipset both go through the pch, why are then native intel ports so slow?

Note finally the speed difference is not due to power. We have made sure that there’s sufficient 3amps 5vdc power to each nvme in all cases.

CarlosAM_INTEL · ‎07-13-2020

Hello, @mikoz:

Thanks for your reply.

In order to help you, we have contacted you via email.

Best regards,

@CarlosAM_INTEL.

AndrewG_Intel · ‎07-07-2020

Hello mikoz

Thank you for posting on the Intel® communities.

We noticed that your inquiries are regarding Intel® Core™ i7-8665UE Processor, which is for "Embedded" targets (Vertical Segment, https://ark.intel.com/content/www/us/en/ark/products/193554/intel-core-i7-8665ue-processor-8m-cache-up-to-4-40-ghz.html)

We hope that the information provided by the community has been useful. Also, we would like to inform you that we have a forum for those specific issues and inquiries regarding Embedded Processors so we have moved this thread to the Embedded Intel® Core™ Processors Forum so you can get answered more quickly.

Best regards,

Andrew G.

Intel Customer Support Technician

mikoz · ‎07-07-2020

Ok, sorry, actually it was intel rep on the phone who suggested I use this thread. But thanks for moving it.

I still have some open questions:

1. the Whiskey Lake architecture pictures show the PCIe lanes attached to both the CPU and the PCH. Which is correct? If I plug in a pcie device, are the lanes attached to the CPU or through the PCH? Here's an example of such a picture, notice the PCie lanes are shown in 2 places:

https://www.intel.com/content/www/us/en/design/products-and-solutions/processors-and-chipsets/whiskey-lake/overview.html

2 . Can I determine if the DMI/OPI bus is operating at peak speed and not being throttled?

3. If the answer to #1 is that the PCie lanes are still connected through the PCH, and if I had a chipset which used 2 dedicated PCie lanes per USB port (meaning 16Gbs minus protocol and coding overheads) , could this still allow me to get closer to 10Gbs? Said another way, if I need 2 USB Gen2 ports, and for each one I use 2 dedicated lanes (meaning 4 total lanes) and a dedicated chipset to host the USB ports, even if it's through the DMI/OPI bus, is this something that still could get us closer to my goal of 9Gbs per USB port with 2 USB ports running simultaneously?

n_scott_pearson · ‎07-07-2020

My responses by the numbers...

Both can be correct. This is a design decision of the board manufacturer. It may not affect your cases, but there are also plenty of boards that utilize PCIe bridge devices that simply take the resources of one PCIe lane and spread it across multiple downstream PCIe lanes. This is common, for example, in boards that need to support multiple PCIe graphics cards.
I don't know of any specific way to do so. It can be at least implied by testing downstream device performance (yea, I know; that's not the kind of answer you are looking for, but all I can provide).
I don't think so. Just because you make additional bandwidth available doesn't mean that realtime factors will not ensure that "protocol and coding overheards" don't affect overall bandwidth possible (because they certainly will).

...S

mikoz · ‎07-07-2020

As to #1, I guess I'll contact the board maker.

As to #3, maybe I wasn't clear. I understand that protocol and coding overhead is *always* there. I was just trying to say that you never get to 8Gbs on a PCIe Gen3 lane no matter what, given the protocol and coding overheads. What am I saying is that you know my goal of getting 2x 9Gbs Tx and Rx streams running simultaneously on 2 USB 3.1Gen2 ports, and, given that, if the data is being funneled through the bus to the PCH, would I stand a *better* chance of getting 2 USB ports worth of 9+Gbs if I used TWO LANES of PCIe per USB port using a commercially available chipset (*) per USB port, as opposed to using the native USB ports which share two PCIe lanes over the PCH.

Is that clear?

(*) these do exist, you can take 2 PCIe Gen3 lanes to service ONE USB 3.1Gen2 port.

n_scott_pearson · ‎07-07-2020

No, I understood your question - and No, I do not believe that the extra bandwidth will make a significant difference (certainly not get you to 9Gb/s). Why? Because most transfers are transactional in nature and thus will always have an effect in real-time. That is, a new transaction is not going to be requested while the previous is in flight, essentially wasting most of the additional bandwidth that you have provided. Yes, you'll get a bit, but not as much as you would expect.

...S