- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am currently trying to use the Data Streaming Accelerator (DSA) to (1) read two inputs, (2) do XOR, and (3) store the outputs.
According to the “Intel® Data Streaming Accelerator Architecture Specification (v3.0)”, this kind of compute operation appears to be supported by the hardware.
However, when using the Intel Data Movement Library (DML) or when using DSA directly via idxd.h, I cannot find any interface for XOR operations -- only memory copy, memory fill, ...
Are the compute operations described in the DSA architecture spec currently unavailable in software stacks?
If they are available, is there any guideline, header, or example for using them?
Thanks.
// idxd.h
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi asdasf,
Greetings for the day!
Thank you for reaching out to Intel Support. We acknowledge receipt of your concern and would like to assure you that assisting you is our top priority.
To assist you further, we require some additional information from your end.
Kindly provide the system details and the processor model for which you are seeking the necessary information.
This will help us review the complete details and assist you further.
We appreciate your understanding!
Best regards,
Poojitha N
Intel Customer Support Technician
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your prompt response.
Regarding your request, please find the system details below:
Product / Platform
Intel Xeon Platinum 8558 (2× sockets)
OS / Kernel / Drivers
OS: Ubuntu 25.04 LTS
Kernel: 6.14.0-1007-intel
DSA driver: idxd v1.0
Library: Using the linux/idxd.h UAPI headers for descriptor submission
accel-config version: accel-config 4.1.8+
Issue summary
We are attempting to submit a descriptor for a Reduce/XOR operation through the user-space write(fd, &desc, …)submission path using /dev/dsa/wq*. The device returns a completion status 0x10 (DSA_COMP_BAD_OPCODE). The same descriptor pipeline works for DSA_OPCODE_MEMMOVE.
This raises the question of whether Reduce/XOR opcodes are currently supported on this CPU/driver combination or require a newer DSA specification / microcode / driver stack.
If you need additional traces, PCIe capability dumps, DSACAP registers, or accel-config dumps (accel-config list -i), I will gladly provide them.
Thank you again for your assistance, and please let me know if further details are required.
Best regards,
Juntaek
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello asdasf,
Thank you for providing the detailed issue summary regarding the DSA Reduce/XOR descriptor submission failure.
To proceed with our analysis, could you please share the following details from the affected system:
1) DSA capability registers
2) Full DSA configuration and work queue information:
3) PCIe capability and device information for the DSA device:
4) Kernel log messages related to DSA initialization:
Regards
Pujeeth_Intel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here are the results of several commands that may contain the details you are looking for. Please let me know if you need any additional information.
$ sudo accel-config list -i
[
{
"dev":"dsa0",
"read_buffer_limit":0,
"max_groups":4,
"max_work_queues":8,
"max_engines":4,
"work_queue_size":128,
"numa_node":0,
"op_cap":"00000000,00000000,00000000,00000000,00000000,00000000,00000001,003f027d",
"gen_cap":"0x40915f0107",
"version":"0x100",
"state":"enabled",
"max_read_buffers":96,
"max_batch_size":1024,
"configurable":1,
"pasid_enabled":1,
"cdev_major":509,
"clients":0,
"groups":[
{
"dev":"group0.0",
"read_buffers_reserved":0,
"use_read_buffer_limit":0,
"read_buffers_allowed":96,
"grouped_workqueues":[
{
"dev":"wq0.0",
"mode":"dedicated",
"size":64,
"group_id":0,
"priority":1,
"block_on_fault":0,
"max_batch_size":32,
"max_transfer_size":2097152,
"cdev_minor":0,
"type":"user",
"name":"swq",
"driver_name":"user",
"threshold":0,
"ats_disable":0,
"state":"enabled",
"clients":0
}
],
"grouped_engines":[
{
"dev":"engine0.0",
"group_id":0
},
{
"dev":"engine0.1",
"group_id":0
},
{
"dev":"engine0.2",
"group_id":0
},
{
"dev":"engine0.3",
"group_id":0
}
]
},
{
"dev":"group0.1",
"read_buffers_reserved":0,
"use_read_buffer_limit":0,
"read_buffers_allowed":96
},
{
"dev":"group0.2",
"read_buffers_reserved":0,
"use_read_buffer_limit":0,
"read_buffers_allowed":96
},
{
"dev":"group0.3",
"read_buffers_reserved":0,
"use_read_buffer_limit":0,
"read_buffers_allowed":96
}
],
"ungrouped workqueues":[
{
"dev":"wq0.1",
"mode":"shared",
"size":0,
"priority":0,
"block_on_fault":0,
"max_batch_size":32,
"max_transfer_size":2097152,
"type":"none",
"name":"",
"driver_name":"",
"threshold":0,
"ats_disable":0,
"state":"disabled",
"clients":0
},
{
"dev":"wq0.2",
"mode":"shared",
"size":0,
"priority":0,
"block_on_fault":0,
"max_batch_size":32,
"max_transfer_size":2097152,
"type":"none",
"name":"",
"driver_name":"",
"threshold":0,
"ats_disable":0,
"state":"disabled",
"clients":0
},
{
"dev":"wq0.3",
"mode":"shared",
"size":0,
"priority":0,
"block_on_fault":0,
"max_batch_size":32,
"max_transfer_size":2097152,
"type":"none",
"name":"",
"driver_name":"",
"threshold":0,
"ats_disable":0,
"state":"disabled",
"clients":0
},
{
"dev":"wq0.4",
"mode":"shared",
"size":0,
"priority":0,
"block_on_fault":0,
"max_batch_size":32,
"max_transfer_size":2097152,
"type":"none",
"name":"",
"driver_name":"",
"threshold":0,
"ats_disable":0,
"state":"disabled",
"clients":0
},
{
"dev":"wq0.5",
"mode":"shared",
"size":0,
"priority":0,
"block_on_fault":0,
"max_batch_size":32,
"max_transfer_size":2097152,
"type":"none",
"name":"",
"driver_name":"",
"threshold":0,
"ats_disable":0,
"state":"disabled",
"clients":0
},
{
"dev":"wq0.6",
"mode":"shared",
"size":0,
"priority":0,
"block_on_fault":0,
"max_batch_size":32,
"max_transfer_size":2097152,
"type":"none",
"name":"",
"driver_name":"",
"threshold":0,
"ats_disable":0,
"state":"disabled",
"clients":0
},
{
"dev":"wq0.7",
"mode":"shared",
"size":0,
"priority":0,
"block_on_fault":0,
"max_batch_size":32,
"max_transfer_size":2097152,
"type":"none",
"name":"",
"driver_name":"",
"threshold":0,
"ats_disable":0,
"state":"disabled",
"clients":0
}
]
},
{
"dev":"dsa1",
"read_buffer_limit":0,
"max_groups":4,
"max_work_queues":8,
"max_engines":4,
"work_queue_size":128,
"numa_node":2,
"op_cap":"00000000,00000000,00000000,00000000,00000000,00000000,00000001,003f027d",
"gen_cap":"0x40915f0107",
"version":"0x100",
"state":"disabled",
"max_read_buffers":96,
"max_batch_size":1024,
"configurable":1,
"pasid_enabled":1,
"cdev_major":509,
"clients":0,
"groups":[
{
"dev":"group1.0",
"read_buffers_reserved":0,
"use_read_buffer_limit":0,
"read_buffers_allowed":96
},
{
"dev":"group1.1",
"read_buffers_reserved":0,
"use_read_buffer_limit":0,
"read_buffers_allowed":96
},
{
"dev":"group1.2",
"read_buffers_reserved":0,
"use_read_buffer_limit":0,
"read_buffers_allowed":96
},
{
"dev":"group1.3",
"read_buffers_reserved":0,
"use_read_buffer_limit":0,
"read_buffers_allowed":96
}
],
"ungrouped workqueues":[
{
"dev":"wq1.0",
"mode":"shared",
"size":0,
"priority":0,
"block_on_fault":0,
"max_batch_size":32,
"max_transfer_size":2097152,
"type":"none",
"name":"",
"driver_name":"",
"threshold":0,
"ats_disable":0,
"state":"disabled",
"clients":0
},
{
"dev":"wq1.1",
"mode":"shared",
"size":0,
"priority":0,
"block_on_fault":0,
"max_batch_size":32,
"max_transfer_size":2097152,
"type":"none",
"name":"",
"driver_name":"",
"threshold":0,
"ats_disable":0,
"state":"disabled",
"clients":0
},
{
"dev":"wq1.2",
"mode":"shared",
"size":0,
"priority":0,
"block_on_fault":0,
"max_batch_size":32,
"max_transfer_size":2097152,
"type":"none",
"name":"",
"driver_name":"",
"threshold":0,
"ats_disable":0,
"state":"disabled",
"clients":0
},
{
"dev":"wq1.3",
"mode":"shared",
"size":0,
"priority":0,
"block_on_fault":0,
"max_batch_size":32,
"max_transfer_size":2097152,
"type":"none",
"name":"",
"driver_name":"",
"threshold":0,
"ats_disable":0,
"state":"disabled",
"clients":0
},
{
"dev":"wq1.4",
"mode":"shared",
"size":0,
"priority":0,
"block_on_fault":0,
"max_batch_size":32,
"max_transfer_size":2097152,
"type":"none",
"name":"",
"driver_name":"",
"threshold":0,
"ats_disable":0,
"state":"disabled",
"clients":0
},
{
"dev":"wq1.5",
"mode":"shared",
"size":0,
"priority":0,
"block_on_fault":0,
"max_batch_size":32,
"max_transfer_size":2097152,
"type":"none",
"name":"",
"driver_name":"",
"threshold":0,
"ats_disable":0,
"state":"disabled",
"clients":0
},
{
"dev":"wq1.6",
"mode":"shared",
"size":0,
"priority":0,
"block_on_fault":0,
"max_batch_size":32,
"max_transfer_size":2097152,
"type":"none",
"name":"",
"driver_name":"",
"threshold":0,
"ats_disable":0,
"state":"disabled",
"clients":0
},
{
"dev":"wq1.7",
"mode":"shared",
"size":0,
"priority":0,
"block_on_fault":0,
"max_batch_size":32,
"max_transfer_size":2097152,
"type":"none",
"name":"",
"driver_name":"",
"threshold":0,
"ats_disable":0,
"state":"disabled",
"clients":0
}
],
"ungrouped_engines":[
{
"dev":"engine1.0"
},
{
"dev":"engine1.1"
},
{
"dev":"engine1.2"
},
{
"dev":"engine1.3"
}
]
}
]
$ lspci -nn | grep -Ei 'data streaming|I/O Accel|idxd|0b25'
6a:01.0 System peripheral [0880]: Intel Corporation Device [8086:0b25]
e7:01.0 System peripheral [0880]: Intel Corporation Device [8086:0b25]
$ sudo lspci -vvv -s 6a:01.0
6a:01.0 System peripheral: Intel Corporation Device 0b25
Subsystem: Intel Corporation Device 0000
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
NUMA node: 0
IOMMU group: 4
Region 0: Memory at afffff20000 (64-bit, prefetchable) [size=64K]
Region 2: Memory at afffff00000 (64-bit, prefetchable) [size=128K]
Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, IntMsgNum 0
DevCap: MaxPayload 128 bytes, PhantFunc 0
ExtTag+ RBE+ FLReset+ TEE-IO-
DevCtl: CorrErr- NonFatalErr- FatalErr+ UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
DevCap2: Completion Timeout: Not Supported, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
AtomicOpsCtl: ReqEn-
IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
10BitTagReq+ OBFF Disabled, EETLPPrefixBlk-
Capabilities: [80] MSI-X: Enable+ Count=9 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=0 offset=00003000
Capabilities: [90] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq+ ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
UESvrt: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP+
ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CorrIntErr- HeaderOF-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CorrIntErr- HeaderOF-
AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [150 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [160 v1] Transaction Processing Hints
Device specific mode supported
Steering table in TPH capability structure
Capabilities: [170 v1] Virtual Channel
Caps: LPEVC=1 RefClk=100ns PATEntryBits=1
Arb: Fixed+ WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01
Status: NegoPending- InProgress-
VC1: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=1 ArbSelect=Fixed TC/VC=02
Status: NegoPending- InProgress-
Capabilities: [200 v1] Designated Vendor-Specific: Vendor=8086 ID=0005 Rev=0 Len=24 <?>
Capabilities: [220 v1] Address Translation Service (ATS)
ATSCap: Invalidate Queue Depth: 00
ATSCtl: Enable+, Smallest Translation Unit: 00
Capabilities: [230 v1] Process Address Space ID (PASID)
PASIDCap: Exec- Priv+, Max PASID Width: 14
PASIDCtl: Enable+ Exec- Priv+
Capabilities: [240 v1] Page Request Interface (PRI)
PRICtl: Enable+ Reset-
PRISta: RF- UPRGI- Stopped+ PASID+
Page Request Capacity: 00000200, Page Request Allocation: 00000200
Kernel driver in use: idxd
Kernel modules: idxd
$ sudo dmesg | grep -Ei 'dsa|idxd'
[ 13.462756] idxd 0000:6a:01.0: enabling device (0144 -> 0146)
[ 13.476012] idxd 0000:6a:01.0: failed to attach device pasid 1, domain type 4
[ 13.476325] idxd 0000:6a:01.0: No in-kernel DMA with PASID. -22
[ 13.528386] idxd 0000:6a:01.0: Intel(R) Accelerator Device (v100)
[ 13.528513] idxd 0000:e7:01.0: enabling device (0144 -> 0146)
[ 13.542617] idxd 0000:e7:01.0: failed to attach device pasid 1, domain type 4
[ 13.543174] idxd 0000:e7:01.0: No in-kernel DMA with PASID. -22
[ 13.559001] idxd 0000:e7:01.0: Intel(R) Accelerator Device (v100)
[153908.260496] idxd dsa0: attribute deprecated, see max_read_buffers.
[153908.260631] idxd dsa0: attribute deprecated, see read_buffer_limit.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi asdasf,
Greetings for the day!
As checked, we could see that the processor is a tray processor. We request you to contact your Intel account representative or the place of purchase for further assistance on this query.
Thanks for your understanding
Regards
Jerome
Intel Customer Support Technician
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi asdasf,
Greetings for the day!
Meanwhile, we will check with our internal resources regarding the requested details and will provide an update once available.
We appreciate your understanding!
Best regards,
Poojitha N
Intel Customer Support Technician
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello asdasf,
This is regarding the ongoing issue. After reviewing carefully, we would like to share with you our findings.
The Intel® Data Streaming Accelerator (DSA) primarily supports data movement and transformation operations such as memory copy, fill, compare, CRC, DIF, delta, and flush. However, XOR operations or similar compute operations are not explicitly mentioned as supported functionalities in the current software stacks or libraries like Intel Data Movement Library (DML) or idxd.h.
If XOR operations are described in the architecture specification but not available in the software stack, it might indicate that these operations are either not implemented in the current software or require specific configurations or updates. You can refer to the Intel® DSA Architecture Specification and User Guide for further details and guidelines:
- Intel® DSA Architecture Specification: Link
- Intel® DSA User Guide: Document #759709. (Available to public in Intel Resource and documentation Center, you can search with Doc ID)
- https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/data-streaming-accelerator.html
For further clarification or updates on the availability of XOR operations, you may need to consult development forums.
Also, based on the provided logs, here is a basic analysis:
Intel DSA Configuration and Capabilities
1. Device Configuration:
- The logs show two DSA devices (dsa0 and dsa1) with configurations for work queues, engines, and groups.
- dsa0 is enabled, while dsa1 is disabled. This indicates that only one device is actively configured for operations.
2. Work Queue Details:
- dsa0 has one dedicated work queue (wq0.0) enabled, with a size of 64 and a maximum transfer size of 2 MB. This work queue is configured for user mode operations.
- Other work queues are in shared mode but are disabled, which limits the operational capacity of the device.
3. PASID and Virtualization:
- PASID (Process Address Space ID) is enabled for dsa0, which supports virtualization and user-level Shared Virtual Memory (SVM). However, the logs indicate issues with PASID attachment for in-kernel DMA operations (No in-kernel DMA with PASID).
4. Operational Capabilities:
- The op_cap field indicates supported operations, including memory move, fill, compare, and transformation tasks like CRC generation and DIF. However, XOR operations are not explicitly supported.
Error Analysis
1. PASID Attachment Issues:
The error failed to attach device pasid 1, domain type 4 suggests that the kernel driver is unable to attach PASID for DMA operations. This could be due to hardware or software limitations in the current setup.
2. Deprecated Attributes:
The logs mention deprecated attributes (max_read_buffers and read_buffer_limit). This indicates that newer configurations or driver updates may be required to fully utilize the DSA capabilities.
3. PCIe Error Handling:
The logs show PCIe-related errors during device initialization. These errors are marked as correctable, but they may impact the stability and performance of the DSA.
Recommendations
1. Driver and Firmware Updates:
Update the DSA driver and firmware to address PASID attachment issues and deprecated attributes.
2. Configuration Review:
Enable additional work queues and optimize their configurations for specific workloads. Ensure that the device is properly configured for virtualization and SVM.
3. Error Mitigation:
Investigate PCIe error handling and ensure that the kernel is configured to handle DSA-related errors effectively.
4. Documentation Reference:
Refer to the Intel® Data Streaming Accelerator Architecture Specification and User Guide for detailed configuration and optimization guidelines.
Please note that this is shared as a best effort from our end (Xeon Hardware break fix team) while this issue is more related to software/ OS configurations.
Hope this information helps and if there is anything else that we may assist you with, please feel free to write back to us.
Happy Troubleshooting!
Regards,
Subhashish_Intel.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page