设备信息

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
root@wang-1604:~# lspci -d 1e36: -vv
01:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd T10 [CloudBlazer] (rev 01)
        Subsystem: Shanghai Enflame Technology Co. Ltd T10 [CloudBlazer]
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at a2000000 (32-bit, non-prefetchable) [size=16K]
        Region 1: Memory at a1000000 (32-bit, non-prefetchable) [size=16M]
        Region 2: Memory at 4000000000 (64-bit, prefetchable) [size=16G]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <4us, L1 <64us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [b0] MSI-X: Enable+ Count=8 Masked-
                Vector table: BAR=0 offset=00000000
                PBA: BAR=0 offset=00001000
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
        Capabilities: [148 v1] #19
        Capabilities: [178 v1] #26
        Capabilities: [1a8 v1] #27
        Capabilities: [1f0 v1] #22
        Capabilities: [1fc v1] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
        Capabilities: [2fc v1] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
        Capabilities: [334 v1] #25
        Kernel driver in use: dtu
        Kernel modules: enflame

lspci显示存在3个bar space,物理地址分别在0xa1000000, 0xa2000000, 0x4000000000

1
2
3
Region 0: Memory at a2000000 (32-bit, non-prefetchable) [size=16K]
Region 1: Memory at a1000000 (32-bit, non-prefetchable) [size=16M]
Region 2: Memory at 4000000000 (64-bit, prefetchable) [size=16G]

查看/proc/iomem,存在对应物理地址

1
2
3
4
5
6
root@wang-1604:~# cat /proc/iomem |grep 0000:01
  a1000000-a20fffff : PCI Bus 0000:01
    a1000000-a1ffffff : 0000:01:00.0
    a2000000-a2003fff : 0000:01:00.0
  4000000000-43ffffffff : PCI Bus 0000:01
    4000000000-43ffffffff : 0000:01:00.0

normal memory和device memory

normal memory:normal memory就是我们平常所说的内存,对该种memory访问时无副作用(side effect),即第n次访问与第n+1次访问没有任何差别(对比device memory的side effect特性,更容易理解一些)。

进一步地,通过memory attribute可以对normal memory进行细分,一段vma(virtual memory address)的memory attribute定义在页表的描述符中。

  1. 是否可共享:

    • shareable:可以被所有PE(Processing Element,处理元素)访问,包括inner shareable和outer shareable。
    • non-shareable:只能被唯一的PE访问。
  2. 是否支持缓存:

    • write-through cacheable:同时写入cache与内存,可以理解为写穿;
    • write-back cacheable:只写入cache,不更新内存,内存更新操作放到之后的适当时机;
    • non-cacheable:无cache,直接读写内存。

device memory:The Device memory type attributes define memory locations where an access to the location can cause side-effects, or where the value returned for a load can vary depending on the number of loads performed. Typically, the Device memory attributes are used for memory-mapped peripherals and similar locations.

简言之,device memory是外设对应的物理地址空间,对该部分memory访问时,可能存在副作用(side effect),比如

  • 某些状态寄存器可能read clear;
  • 某些寄存器有写入顺序(否则写入不成功);
  • 设备fifo地址固定不变,但是每次访问,内部的移位寄存器就会将下一个数据移出来[1],因此访问同一地址第n次访问与第n+1次访问结果是不同的。

进一步地,通过memory attribute可以对device memory进行细分

  1. 是否可合并访问
    • non-Gathering(nG):处理器必须严格按照代码中内存访问来进行,不能把多次访问合并成一次。例如,代码中有两次对同样一个地址的读访问,此时处理器必须严格进行两次read操作。
    • Gathering(G):处理器可以对内存访问进行合并。
  2. 是否可乱序
    • non re-ordering(nR):处理器不允许对内存访问指令进行重排,必须严格执行program order;
    • re-ordering(R):处理器允许对内存访问指令进行重排。
  3. 写入是否可中途返回(Early Write Acknowledgement–EWA)(E or nE)

PE对内存的访问是有问有答的(专业术语叫transaction),对于写入操作,PE需要收到write ack才能确定完成了一个transaction。为了加快写的速度,系统的中间环节可能会设定一些write buffer(比如cache),nE代表写操作的write ack必须来自最终目的地,而不是中间的write buffer。

综上:设备的bar0,bar1主要是寄存器,属于device memory,bar2是hbm内存,可以认为是normal memory

io memory

在系统运行时,外设的I/O内存资源的物理地址是已知的,这是由硬件设计决定的。但是CPU通常并没有为这些已知的外设I/O内存资源的物理地址预定义虚拟地址范围,驱动程序并不能直接通过物理地址访问I/O内存资源,而必须将它们映射到核心虚地址空间内(通过页表),然后才能根据映射所得到的核心虚地址范围,通过访内指令访问这些I/O内存资源。

根据计算机平台和所使用总线的不同,I/O 内存可能是,也可能不是通过页表访问的,通过页表访问的是统一编址(PowerPC),否则是独立编址(Intel)。如果访问是经由页表进行的,内核必须首先安排物理地址使其对设备驱动 程序可见(这通常意味着在进行任何 I/O 之前必须先调用 ioremap)。如果访问无需页表,那么 I/O 内存区域就很象 I/O 端口,可以使 用适当形式的函数读写它们。

ioremap函数组共有五个接口,根据函数实现可以分为三类:

  • ioremap & ioremap_nocache实现相同,使用场景为映射device memory类型内存;
  • ioremap_cached,使用场景为映射normal memory类型内存,且映射后的虚拟内存支持cache;
  • ioremap_wc & ioremap_wt实现相同,使用场景为映射normal memory类型内训,且映射后的虚拟内存不支持cache

在将I/O内存资源的物理地址映射成核心虚地址后,理论上讲我们就可以象读写RAM那样直接读写I/O内存资源了。**为了保证驱动程序的跨平台的可移植性,我们应该使用Linux中特定的函数来访问I/O内存资源,而不应该通过指向核心虚地址的指针来访问。**如在x86平台上,读写I/O的函数如下所示:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#define readb(addr) (*(volatile unsigned char *) __io_virt(addr))
#define readw(addr) (*(volatile unsigned short *) __io_virt(addr))
#define readl(addr) (*(volatile unsigned int *) __io_virt(addr))

#define writeb(b,addr) (*(volatile unsigned char *) __io_virt(addr) = (b))
#define writew(b,addr) (*(volatile unsigned short *) __io_virt(addr) = (b))
#define writel(b,addr) (*(volatile unsigned int *) __io_virt(addr) = (b))

#define memset_io(a,b,c) memset(__io_virt(a),(b),(c))
#define memcpy_fromio(a,b,c) memcpy((a),__io_virt(b),(c))
#define memcpy_toio(a,b,c) memcpy(__io_virt(a),(b),(c))

这些io函数在不同的平台,实现是不一样的,与内存的memcpy,memset相比,考虑了对齐,依赖io read/write函数读写数据。这也是为什么跨平台代码推荐使用io函数操作io类型内存的原因。

X86:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#define memcpy_fromio memcpy_fromio
#define memcpy_toio memcpy_toio
#define memset_io memset_io

void memcpy_fromio(void *to, const volatile void __iomem *from, size_t n)
{
	if (unlikely(!n))
		return;

	/* Align any unaligned source IO */
	if (unlikely(1 & (unsigned long)from)) {
		movs("b", to, from);
		n--;
	}
	if (n > 1 && unlikely(2 & (unsigned long)from)) {
		movs("w", to, from);
		n-=2;
	}
	rep_movs(to, (const void *)from, n);
}
EXPORT_SYMBOL(memcpy_fromio);

void memcpy_toio(volatile void __iomem *to, const void *from, size_t n)
{
	if (unlikely(!n))
		return;

	/* Align any unaligned destination IO */
	if (unlikely(1 & (unsigned long)to)) {
		movs("b", to, from);
		n--;
	}
	if (n > 1 && unlikely(2 & (unsigned long)to)) {
		movs("w", to, from);
		n-=2;
	}
	rep_movs((void *)to, (const void *) from, n);
}
EXPORT_SYMBOL(memcpy_toio);

void memset_io(volatile void __iomem *a, int b, size_t c)
{
	/*
	 * TODO: memset can mangle the IO patterns quite a bit.
	 * perhaps it would be better to use a dumb one:
	 */
	memset((void *)a, b, c);
}

arm:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
#define memset_io(c,v,l)	__memset_io((c),(v),(l))
#define memcpy_fromio(a,c,l)	__memcpy_fromio((a),(c),(l))
#define memcpy_toio(c,a,l)	__memcpy_toio((c),(a),(l))

/*
 * Copy data from IO memory space to "real" memory space.
 */
void __memcpy_fromio(void *to, const volatile void __iomem *from, size_t count)
{
	while (count && !IS_ALIGNED((unsigned long)from, 8)) {
		*(u8 *)to = __raw_readb(from);
		from++;
		to++;
		count--;
	}

	while (count >= 8) {
		*(u64 *)to = __raw_readq(from);
		from += 8;
		to += 8;
		count -= 8;
	}

	while (count) {
		*(u8 *)to = __raw_readb(from);
		from++;
		to++;
		count--;
	}
}
EXPORT_SYMBOL(__memcpy_fromio);

/*
 * Copy data from "real" memory space to IO memory space.
 */
void __memcpy_toio(volatile void __iomem *to, const void *from, size_t count)
{
	while (count && !IS_ALIGNED((unsigned long)to, 8)) {
		__raw_writeb(*(u8 *)from, to);
		from++;
		to++;
		count--;
	}

	while (count >= 8) {
		__raw_writeq(*(u64 *)from, to);
		from += 8;
		to += 8;
		count -= 8;
	}

	while (count) {
		__raw_writeb(*(u8 *)from, to);
		from++;
		to++;
		count--;
	}
}
EXPORT_SYMBOL(__memcpy_toio);

/*
 * "memset" on IO memory space.
 */
void __memset_io(volatile void __iomem *dst, int c, size_t count)
{
	u64 qc = (u8)c;

	qc |= qc << 8;
	qc |= qc << 16;
	qc |= qc << 32;

	while (count && !IS_ALIGNED((unsigned long)dst, 8)) {
		__raw_writeb(c, dst);
		dst++;
		count--;
	}

	while (count >= 8) {
		__raw_writeq(qc, dst);
		dst += 8;
		count -= 8;
	}

	while (count) {
		__raw_writeb(c, dst);
		dst++;
		count--;
	}
}

write combine

对于现代cpu而言,性能瓶颈则是对于内存的访问。cpu的速度往往都比主存的高至少两个数量级。因此cpu都引入了L1_cache与L2_cache,更加高端的cpu还加入了L3_cache.很显然,这个技术引起了下一个问题:

如果一个cpu在执行的时候需要访问的内存都不在cache中,cpu必须要通过内存总线到主存中取,那么在数据返回到cpu这段时间内(这段时间大致为cpu执行成百上千条指令的时间,至少两个数据量级)干什么呢? 答案是cpu会继续执行其他的符合条件的指令。比如cpu有一个指令序列 指令1 指令2 指令3 …, 在指令1时需要访问主存,在数据返回前cpu会继续后续的和指令1在逻辑关系上没有依赖的”独立指令”,cpu一般是依赖指令间的内存引用关系来判断的指令间的”独立关系”,具体细节可参见各cpu的文档。这也是导致cpu乱序执行指令的根源之一。

以上方案是cpu对于读取数据延迟所做的性能补救的办法。对于写数据则会显得更加复杂一点:

当cpu执行存储指令时,它会首先试图将数据写到离cpu最近的L1_cache, 如果此时cpu出现L1未命中,则会访问下一级缓存。速度上L1_cache基本能和cpu持平,其他的均明显低于cpu,L2_cache的速度大约比cpu慢20-30倍,而且还存在L2_cache不命中的情况,又需要更多的周期去主存读取。其实在L1_cache未命中以后,cpu就会使用一个另外的缓冲区,叫做合并写存储缓冲区。这一技术称为合并写入技术。在请求L2_cache缓存行的所有权尚未完成时,cpu会把待写入的数据写入到合并写存储缓冲区,该缓冲区大小和一个cache line大小,一般都是64字节。这个缓冲区允许cpu在写入或者读取该缓冲区数据的同时继续执行其他指令,这就缓解了cpu写数据时cache miss时的性能影响。

当后续的写操作需要修改相同的缓存行时,这些缓冲区变得非常有趣。在将后续的写操作提交到L2缓存之前,可以进行缓冲区写合并。 这些64字节的缓冲区维护了一个64位的字段,每更新一个字节就会设置对应的位,来表示将缓冲区交换到外部缓存时哪些数据是有效的。当然,如果程序读取已被写入到该缓冲区的某些数据,那么在读取缓存数据之前会先去读取本缓冲区的。

经过上述步骤后,缓冲区的数据还是会在某个延时的时刻更新到外部的缓存(L2_cache).如果我们能在缓冲区传输到缓存之前将其尽可能填满,这样的效果就会提高各级传输总线的效率,以提高程序性能。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
#include <limits.h>

#include <stdio.h>

#include <stdlib.h>

#include <sys/time.h>

#include <unistd.h>


static const int iterations = INT_MAX;
static const int items = 1 << 24;
static int mask;
static int arrayA[1 << 24];
static int arrayB[1 << 24];
static int arrayC[1 << 24];
static int arrayD[1 << 24];
static int arrayE[1 << 24];
static int arrayF[1 << 24];
static int arrayG[1 << 24];
static int arrayH[1 << 24];

double run_one_case_for_8() {
    double start_time;
    double end_time;
    struct timeval start;
    struct timeval end;
    int i = iterations;
    gettimeofday(&start, NULL);
    while (--i != 0) {
        int slot = i & mask;
        int value = i;
        arrayA[slot] = value;
        arrayB[slot] = value;
        arrayC[slot] = value;
        arrayD[slot] = value;
        arrayE[slot] = value;
        arrayF[slot] = value;
        arrayG[slot] = value;
        arrayH[slot] = value;
    }
    gettimeofday(&end, NULL);
    start_time = (double)start.tv_sec + (double)start.tv_usec / 1000000.0;
    end_time = (double)end.tv_sec + (double)end.tv_usec / 1000000.0;
    return end_time - start_time;
}

double run_two_case_for_4() {
    double start_time;
    double end_time;
    struct timeval start;
    struct timeval end;
    int i = iterations;
    gettimeofday(&start, NULL);
    while (--i != 0) {
        int slot = i & mask;
        int value = i;
        arrayA[slot] = value;
        arrayB[slot] = value;
        arrayC[slot] = value;
        arrayD[slot] = value;
    }
    i = iterations;
    while (--i != 0) {
        int slot = i & mask;
        int value = i;
        arrayG[slot] = value;
        arrayE[slot] = value;
        arrayF[slot] = value;
        arrayH[slot] = value;
    }
    gettimeofday(&end, NULL);
    start_time = (double)start.tv_sec + (double)start.tv_usec / 1000000.0;
    end_time = (double)end.tv_sec + (double)end.tv_usec / 1000000.0;
    return end_time - start_time;
}

int main() {
    mask = items - 1;
    int i;
    printf("test begin---->\n");
    for (i = 0; i < 3; i++) {
        printf(" %d, run_one_case_for_8: %lf\n", i, run_one_case_for_8());
        printf(" %d, run_two_case_for_4: %lf\n", i, run_two_case_for_4());
    }
    printf("test end");
    return 0;
}

测试结果:

1
2
3
4
5
6
7
8
root@wang-1604:~/Desktop# ./wc-test 
test begin---->
 0, run_one_case_for_8: 27.442992
 0, run_two_case_for_4: 12.841221
 1, run_one_case_for_8: 27.068290
 1, run_two_case_for_4: 12.692629
 2, run_one_case_for_8: 27.159333
 2, run_two_case_for_4: 12.675539

原理:上面提到的合并写存入缓冲区离cpu很近,容量为64字节,很小了,估计很贵。数量也是有限的,我这款cpu它的个数为4。个数时依赖cpu模型的,intel的cpu在同一时刻只能拿到4个。

因此,run_one_case_for_8函数中连续写入8个不同位置的内存,那么当4个数据写满了合并写缓冲时,cpu就要等待合并写缓冲区更新到L2cache中,因此cpu就被强制暂停了。然而在run_two_case_for_4函数中是每次写入4个不同位置的内存,可以很好的利用合并写缓冲区,因合并写缓冲区满到引起的cpu暂停的次数会大大减少,当然如果每次写入的内存位置数目小于4,也是一样的。虽然多了一次循环的i++操作(实际上你可能会问,i++也是会写入内存的啊,其实i这个变量保存在了寄存器上), 但是它们之间的性能差距依然非常大。

**开启write combine以后,相当于写地址都是对齐的,所以在arm上即使不用io专用函数,依然可以正常运行memset。**

ARM memset问题

arm对于对齐data的读写明显快于不对齐的,所以它的memset实现,无论kernel还是umd,都是考虑对齐优化的

首先看kernel mode,kernel里面对io内存的memset,如果不使用io专用函数操作,memset直接访问必定因为对齐因素出现crash。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
[  196.090332] Unable to handle kernel paging request at virtual address ffff00001fb20000
[  196.090377] Mem abort info:
[  196.090393]   ESR = 0x96000061
[  196.090412]   Exception class = DABT (current EL), IL = 32 bits
[  196.090440]   SET = 0, FnV = 0
[  196.090457]   EA = 0, S1PTW = 0
[  196.090474] Data abort info:
[  196.090490]   ISV = 0, ISS = 0x00000061
[  196.090509]   CM = 0, WnR = 1
[  196.090527] swapper pgtable: 64k pages, 48-bit VAs, pgdp = 00000000fae502d6
[  196.090560] [ffff00001fb20000] pgd=00000381fffe0803, pud=00000381fffe0803, pmd=00000381fffd0803, pte=0068082000610f07
[  196.090610] Internal error: Oops: 96000061 [#1] SMP
[  196.090633] Modules linked in: dtu(OE+) dtu_kcl(OE) mpt3sas ast igb ttm raid_class rtc_ds1307
[  196.090677] Process kworker/0:3 (pid: 737, stack limit = 0x0000000031d7cfb7)
[  196.090712] CPU: 0 PID: 737 Comm: kworker/0:3 Kdump: loaded Tainted: G           OE     4.19.90-17.ky10.aarch64 #1
[  196.090756] Hardware name:  /, BIOS RELEASE 5.6 Aug 17 2020
[  196.090789] Workqueue: events work_for_cpu_fn
[  196.090811] pstate: 40000005 (nZcv daif -PAN -UAO)
[  196.090837] pc : __memset+0x16c/0x188
[  196.090904] lr : dtu_bp_create+0x84/0xc0 [dtu]
[  196.091530] sp : ffff80814d7ffab0
[  196.092128] x29: ffff80814d7ffab0 x28: 0000000000000000 
[  196.092723] x27: 0000000000000000 x26: ffff0000024b0000 
[  196.093301] x25: 0000000000000000 x24: 0000000000000002 
[  196.093859] x23: ffff0000024689c0 x22: ffff808107001680 
[  196.094403] x21: 0000000000000001 x20: 0000000000000000 
[  196.094948] x19: ffff83810d6bb000 x18: 0000000000000020 
[  196.095494] x17: ffff808107000110 x16: ffff808107000110 
[  196.096032] x15: ffff0000091de000 x14: 5f766564202c3239 
[  196.096566] x13: 3138203a657a6973 x12: 202c6f62206c656e 
[  196.097110] x11: 0000000000000000 x10: 0000000000000004 
[  196.097656] x9 : 0000000000000000 x8 : ffff00001fb20000 
[  196.098204] x7 : 0000000000000000 x6 : 000000000000003f 
[  196.098746] x5 : 0000000000000040 x4 : 0000000000000000 
[  196.099271] x3 : 0000000000000004 x2 : 0000000000001fc0 
[  196.099777] x1 : 0000000000000000 x0 : ffff00001fb20000 
[  196.100268] Call trace:
[  196.100742]  __memset+0x16c/0x188
[  196.101270]  dtu_bp_init+0x68/0x1c0 [dtu]
[  196.101790]  leo_ih_init+0x5c/0x68 [dtu]
[  196.102291]  dtu_ip_instance_init+0x14c/0x298 [dtu]
[  196.102787]  dtu_device_ips_init+0x174/0x648 [dtu]
[  196.103266]  dtu_device_init.part.3+0x51c/0xd68 [dtu]
[  196.103727]  dtu_device_init+0x48/0xa0 [dtu]
[  196.104167]  dtu_pci_probe+0x22c/0x8f0 [dtu]
[  196.104536]  local_pci_probe+0x3c/0xb8
[  196.104898]  work_for_cpu_fn+0x18/0x28
[  196.105254]  process_one_work+0x1f0/0x3c8
[  196.105607]  worker_thread+0x26c/0x4d0
[  196.105961]  kthread+0x128/0x130
[  196.106308]  ret_from_fork+0x10/0x18
[  196.106652] Code: 91010108 54ffff4a 8b040108 cb050042 (d50b7428) 
[  196.107045] SMP: stopping secondary CPUs
[  196.107898] Starting crashdump kernel...
[  196.108259] Bye!

对umd,mmap出来的虚地址指针(ioremap方式映射),直接memset访问的话,由于汇编指令里面用到了DC ZVA指令,遇到device memory就会报对齐错误,这也是为什么会出现sigbus错误的原因。

1
2
3
4
If the memory region being zeroed is any type of Device memory, these
instructions give an alignment fault which is prioritized in the same way
as other alignment faults that are determined by the memory type."
from arm reference menual

解决办法是设置缺页属性为pgprot_writecombine。详见https://qiita.com/ikwzm/items/3216f907ff8fc41866c4

arm memset分析:https://www.reexpound.com/2021/05/26/arm性能优化三-memset-优化/

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
/* Copyright (C) 2012-2019 Free Software Foundation, Inc.
   This file is part of the GNU C Library.
   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.
   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.
   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library.  If not, see
   <http://www.gnu.org/licenses/>.  */
#include <sysdep.h>
#include "memset-reg.h"
#ifndef MEMSET
# define MEMSET memset
#endif

/* Assumptions:
 *
 * ARMv8-a, AArch64, unaligned accesses
 *
 */
ENTRY_ALIGN (MEMSET, 6)
        DELOUSE (0)
        DELOUSE (2)
        dup        v0.16B, valw
        add        dstend, dstin, count
        cmp        count, 96
        b.hi        L(set_long)
        cmp        count, 16
        b.hs        L(set_medium)
        mov        val, v0.D[0]
        /* Set 0..15 bytes.  */
        tbz        count, 3, 1f
        str        val, [dstin]
        str        val, [dstend, -8]
        ret
        nop
1:      tbz        count, 2, 2f
        str        valw, [dstin]
        str        valw, [dstend, -4]
        ret
2:     cbz        count, 3f
        strb        valw, [dstin]
        tbz        count, 1, 3f
        strh        valw, [dstend, -2]
3:      ret
        /* Set 17..96 bytes.  */
L(set_medium):
        str        q0, [dstin]
        tbnz        count, 6, L(set96)
        str        q0, [dstend, -16]
        tbz        count, 5, 1f
        str        q0, [dstin, 16]
        str        q0, [dstend, -32]
1:      ret

        .p2align 4
        /* Set 64..96 bytes.  Write 64 bytes from the start and
           32 bytes from the end.  */
L(set96):
        str        q0, [dstin, 16]
        stp        q0, q0, [dstin, 32]
        stp        q0, q0, [dstend, -32]
        ret
        .p2align 3
        nop
L(set_long):
        and        valw, valw, 255
        bic        dst, dstin, 15
        str        q0, [dstin]
        cmp        count, 256
        ccmp        valw, 0, 0, cs
        b.eq        L(try_zva)
L(no_zva):
        sub        count, dstend, dst        /* Count is 16 too large.  */
        sub        dst, dst, 16                /* Dst is biased by -32.  */
        sub        count, count, 64 + 16        /* Adjust count and bias for loop.  */
1:      stp        q0, q0, [dst, 32]
        stp        q0, q0, [dst, 64]!
L(tail64):
        subs        count, count, 64
        b.hi        1b
2:      stp        q0, q0, [dstend, -64]
        stp        q0, q0, [dstend, -32]
        ret

L(try_zva):
#ifdef ZVA_MACRO
        zva_macro
#else
        .p2align 3
        mrs        tmp1, dczid_el0
        tbnz        tmp1w, 4, L(no_zva)
        and        tmp1w, tmp1w, 15
        cmp        tmp1w, 4        /* ZVA size is 64 bytes.  */
        b.ne         L(zva_128)
        /* Write the first and last 64 byte aligned block using stp rather
           than using DC ZVA.  This is faster on some cores.
         */
L(zva_64):
        str        q0, [dst, 16]
        stp        q0, q0, [dst, 32]
        bic        dst, dst, 63
        stp        q0, q0, [dst, 64]
        stp        q0, q0, [dst, 96]
        sub        count, dstend, dst        /* Count is now 128 too large.        */
        sub        count, count, 128+64+64        /* Adjust count and bias for loop.  */
        add        dst, dst, 128
        nop
1:      dc         zva, dst
        add        dst, dst, 64
        subs       count, count, 64
        b.hi        1b
        stp        q0, q0, [dst, 0]
        stp        q0, q0, [dst, 32]
        stp        q0, q0, [dstend, -64]
        stp        q0, q0, [dstend, -32]
        ret

        .p2align 3
L(zva_128):
        cmp        tmp1w, 5        /* ZVA size is 128 bytes.  */
        b.ne        L(zva_other)
        str        q0, [dst, 16]
        stp        q0, q0, [dst, 32]
        stp        q0, q0, [dst, 64]
        stp        q0, q0, [dst, 96]
        bic        dst, dst, 127
        sub        count, dstend, dst        /* Count is now 128 too large.        */
        sub        count, count, 128+128        /* Adjust count and bias for loop.  */
        add        dst, dst, 128
1:      dc        zva, dst
        add        dst, dst, 128
        subs        count, count, 128
        b.hi        1b
        stp        q0, q0, [dstend, -128]
        stp        q0, q0, [dstend, -96]
        stp        q0, q0, [dstend, -64]
        stp        q0, q0, [dstend, -32]
        ret

L(zva_other):
        mov        tmp2w, 4
        lsl        zva_lenw, tmp2w, tmp1w
        add        tmp1, zva_len, 64        /* Max alignment bytes written.         */
        cmp        count, tmp1
        blo        L(no_zva)
        sub        tmp2, zva_len, 1
        add        tmp1, dst, zva_len
        add        dst, dst, 16
        subs        count, tmp1, dst        /* Actual alignment bytes to write.  */
        bic        tmp1, tmp1, tmp2        /* Aligned dc zva start address.  */
        beq        2f
1:      stp        q0, q0, [dst], 64
        stp        q0, q0, [dst, -32]
        subs        count, count, 64
        b.hi        1b
2:      mov        dst, tmp1
        sub        count, dstend, tmp1        /* Remaining bytes to write.  */
        subs        count, count, zva_len
        b.lo        4f
3:      dc        zva, dst
        add        dst, dst, zva_len
        subs        count, count, zva_len
        b.hs        3b
4:      add        count, count, zva_len
        sub        dst, dst, 32                /* Bias dst for tail loop.  */
        b        L(tail64)
#endif
END (MEMSET)
libc_hidden_builtin_def (MEMSET)