浮点运算的内核模拟问题 floating-point
- 排查
- 解决
so不匹配问题
- 解决
  - 用python直接修改二进制
指针强转和编译器的 strict aliasing原则
- 详细解释
  - 代码
  - 编译
- 参考
cannot find crt1.o 问题

1. 浮点运算的内核模拟问题 floating-point

在工具链升级后, MIPS的板子在跑一个厂家提供的SDK时, 变得非常慢. 提前剧透一下, 跟floating-point有关.

1.1. 排查

根据版本比较, 和之前升级的问题经验, 排除了-fno-strict-aliasing的问题. 见下文指针强转和编译器的 strict aliasing原则过程如下:

启动慢了30秒, 定位到慢的code block, 其中cmpPorts函数是热点
调查这个函数, 用新老toolchain都对照分析, 虽然最后的汇编不太一样, 但执行时间和结果都没有不同
基本排除这个函数, 但要加更多调试打印来看整个code block
用perf抓到很多软浮点函数的调用, 那可能是硬浮点没使能?
确认硬浮点已经使能, 但为什么还有软浮点的函数调用?

1.2. 解决

上面已经很接近root cause了. 是浮点emulation导致的性能下降.
硬浮点已经正确使能了.为什么还要浮点emulation呢?
为什么测试程序不复现, 只有sdk代码复现?
那关键点在于, 什么情况下, 会使用软浮点?

因为-- 浮点指令ldc1的data不对齐, 会导致浮点异常. 实际上, datasheet上确实要求这个指令要64bit对齐. 如果不对齐, 会触发异常.
kernel捕获这个异常, 用软浮点模拟了计算

接下来就要看, sdk代码里, 到底是不是data不对齐造成的?
C语言要求malloc返回的地址要8字节对齐, 但SDK没有follow.

解决办法是使能SDK里面已有的预定义宏, 使malloc 8字节对齐.

为什么老的工具链不复现? 因为老的工具链没有用这个浮点指令.

2. so不匹配问题

板子上confd起不来, 显示

/run # confd --start-phase0
Internal error: Failed to load NIF library: '/lib/confd/lib/core/util/priv/syslog_nif.so: cannot open shared object file: No such file or directory'
Daemon died status=19

但实际上, 这个so文件是存在的

~ # ls -lh /lib/confd/lib/core/util/priv/syslog_nif.so
-rwxr-xr-x 1 root root 6.3K Sep 9 2019 /lib/confd/lib/core/util/priv/syslog_nif.so

用readelf命令, 可以看到编译时的flags:

readelf -h `find -name *.so` | grep Flag

在octeon3系列上, 一般是这样:
Flags: 0x808e0827, noreorder, pic, cpic, abi2, octeon3, mips64r2

而/lib/confd/lib/core/util/priv/syslog_nif.so是这样的:
Flags: 0x808e0227, noreorder, pic, cpic, abi2, fp64, octeon3, mips64r2
多了fp64, 因为toolchain的bug, 这个flag会导致so文件打不开.

在buildroot output目录下, 寻找所有的fp64标记的so

for f in `find -name *.so`; do echo $f; readelf -h $f | grep Flag | grep fp64; done > so.log

发现全部都是./build/confd-7.1.1.5/confd/lib/confd下面的so才有fp64标记.

调查结果是: confd的so是自己带的, 不是新的工具链编出来的.

2.1. 解决

这个问题的root cause是: cavium提供的gcc7.3, 以及binutils, 和老的gcc4.7不兼容

confd是第三方提供的, 他们不肯重新用新gcc来编译.
那只能自己直接修改老的so. 把ELF头的flag给改掉.

2.1.1. 用python直接修改二进制

先把so mmap, 然后修改其中的字段.
这个和ultraedit直接改二进制一样道理. 代码如下:

#!/usr/bin/env python2
# Cavium (Marvell) toolchain based on gcc 7.x introduced a change in ELF flags
# such that objects from old toolchain and new toolchain are not compatible.
# Specifically, the following change was made in glibc to align with binutils

# values:
#
# -#define EF_MIPS_HARD_FLOAT 0x00000200
# -#define EF_MIPS_SINGLE_FLOAT 0x00000400
# +#define EF_MIPS_HARD_FLOAT 0x00000800
# +#define EF_MIPS_SINGLE_FLOAT 0x00001000
#
# The normal solution to this problem is to make sure to recompile all objects
# with the new toolchain. However, in case of ConfD, the recompilation needs to
# happen by the third-party vendor (TailF) which is not open to taking in the
# new toolchain.
# We can workaround this problem by patching the flag values manually.
# Note: below we only change the 'HARD_FLOAT' value (0x200 -> 0x800), we don't
# seem to encounter the SINGLE_FLOAT case.


import mmap
import os
import struct
import sys
ELF32_HEADER_SIZE = 0x34
ELF64_HEADER_SIZE = 0x40
ELF_EI_MAG0 = 0x0
ELF_EI_MAG1 = 0x1
ELF_EI_MAG2 = 0x2
ELF_EI_MAG3 = 0x3
ELF_EI_CLASS = 0x4
ELF_EI_DATA = 0x5
ELF_E_MACHINE = 0x12 #size 2
ELF32_E_FLAGS = 0x24 #size 4
ELF64_E_FLAGS = 0x30 #size 4
if len(sys.argv) < 2:
    sys.stderr.write('Error: please pass filenames as arguments\n')
    sys.exit(1)
for filename in sys.argv[1:]:
    if not os.path.exists(filename):
        sys.stderr.write('Error: file %s does not exist\n' % filename)
        sys.exit(1)
for filename in sys.argv[1:]:
    with open(filename, "r+b") as f:
        mm = mmap.mmap(f.fileno(), ELF64_HEADER_SIZE)
        # check it's actually an ELF file
        if not (mm[ELF_EI_MAG0] == '\x7f' and
                mm[ELF_EI_MAG1] == 'E' and
                mm[ELF_EI_MAG2] == 'L' and
                mm[ELF_EI_MAG3] == 'F'
        ):
            sys.stderr.write('Error: not an ELF file: %s\n' % filename)
            sys.exit(1)
        # ELF32 or ELF64?
        class_ = struct.unpack('B', mm[ELF_EI_CLASS])[0]
        if class_ == 1:
            # ELF32
            flags_offset = ELF32_E_FLAGS
        elif class_ == 2:
            # ELF64
            flags_offset = ELF64_E_FLAGS
        else:
            sys.stderr.write('Error: invalid ELF class in file: %s\n' % filename)
            sys.exit(1)
        # endianness
        endianness = struct.unpack('B', mm[ELF_EI_DATA])[0]
        if endianness != 2:
            sys.stderr.write('Error: expected big-endian ELF file: %s\n' % filename)
            sys.exit(1)
        # MIPS only
        machine = struct.unpack('>H', mm[ELF_E_MACHINE:ELF_E_MACHINE+2])[0]
        if machine != 0x0008:
            sys.stderr.write('Error: not a MIPS ELF file: %s\n' % filename)
            sys.exit(1)
        # Patch hard-float flags
        flags = struct.unpack('>L', mm[flags_offset:flags_offset+4])[0]
        if flags & 0x200:

            newflags = flags & ~0x200 | 0x800
            print('File %s: changing flags from %x to %x' % (filename, flags, newflags))
            mm[flags_offset:flags_offset+4] = struct.pack('>L', newflags)
        else:
            print('File %s already has suitable flags: %x' % (filename, flags))
    mm.close()

注:

mmap提供把文件mmap到内存的功能, 这里返回的mm就像个数组
struct是python的一个包, 用来把二进制和pythone值之间进行转换. 比如struct.unpack(fmt, string)接受一个fmt, 表示要如何进行转换:

3. 指针强转和编译器的 strict aliasing原则

使用gcc7.3导致marvell.user运行失败

对应代码

解释:

135行, 入参cfg_swap是NULL
应该是大端, default_swap应该是1
传入hxctl_cfg_reg_write函数的结构体cfg_swap全是0 -- 为什么? 见下文
导致marvell芯片配置成小端模式, 实际应该是大端模式.
解决办法是加-fno-strict-aliasing

注:

根据解释, 打开strict-aliasing选项后, 编译器会认为不同类型的指针, 不可能指向同一个内存地址. 所以放心的做一些优化. 这要求代码也同样遵守这个规则: 不同类型的指针不能指向同一个内存. 指针强制转换会破坏这个rule, 在-fstrict-aliasing打开的情况下, 会出现未知结果. 所以为了更好的享受编译器优化, 建议不要在代码里做强制类型转换.
-fstrict-aliasing在较新的编译器里, 默认打开. 比如这次从gcc4.9升级到gcc7.3, 7.3就默认打开了这个选项.

3.1. 详细解释

3.1.1. 代码

#include <stdio.h>

typedef unsigned char __uint8_t;
typedef unsigned short int __uint16_t;
typedef unsigned int __uint32_t;
typedef __uint8_t uint8_t;
typedef __uint16_t uint16_t;
typedef __uint32_t uint32_t;

struct configuration {
    uint8_t value[4];
};

int bar(uint32_t data);
int baz(struct configuration x);

int foo(struct configuration *cfg_swap)
{
    struct configuration default_swap = {
        .value[0] = 0xaa,
        .value[1] = 0xbb,
        .value[2] = 0xcc,
        .value[3] = 0xdd,
    };
    int rc;

    if (!cfg_swap) {
        //printf("foo\n"); // adding the print fixes the problem too
        cfg_swap = &default_swap;
    }

    rc = bar(* (uint32_t *)cfg_swap); // the cast here triggers the issue
    //rc = baz(*cfg_swap); // passing cfg_swap without cast works correctly

    return rc;
}

3.1.2. 编译

对代码编译:

A: gcc -fpic -O2 -fno-strict-aliasing -c test.c -o test.o
反汇编

00000000 <foo>:
 0:    27bdffe0 addiu    sp,sp,-32
 4:    3c02aabb lui    v0,0xaabb
 8:    03a4200a movz    a0,sp,a0
 c:    3442ccdd ori    v0,v0,0xccdd
10:    ffbc0010 sd    gp,16(sp)
14:    3c1c0000 lui    gp,0x0
18:    ffbf0018 sd    ra,24(sp)
1c:    0399e021 addu    gp,gp,t9
20:    afa20000 sw    v0,0(sp)
24:    279c0000 addiu    gp,gp,0
28:    8f990000 lw    t9,0(gp)
2c:    0320f809 jalr    t9
30:    8c840000 lw    a0,0(a0)
34:    dfbf0018 ld    ra,24(sp)
38:    dfbc0010 ld    gp,16(sp)
3c:    03e00008 jr    ra
40:    27bd0020 addiu    sp,sp,32

B: gcc -fpic -O2 -c test.c -o test.o

00000000 <foo>:
 0:    27bdffe0 addiu    sp,sp,-32
 4:    ffbc0010 sd    gp,16(sp)
 8:    03a4200a movz    a0,sp,a0
 c:    ffbf0018 sd    ra,24(sp)
10:    3c1c0000 lui    gp,0x0
14:    0399e021 addu    gp,gp,t9
18:    279c0000 addiu    gp,gp,0
1c:    8f990000 lw    t9,0(gp)
20:    0320f809 jalr    t9
24:    8c840000 lw    a0,0(a0)
28:    dfbf0018 ld    ra,24(sp)
2c:    dfbc0010 ld    gp,16(sp)
30:    03e00008 jr    ra
34:    27bd0020 addiu    sp,sp,32

补充汇编知识

解释:

在B情况下, 调用bar的入参是寄存器a0, 而a0是前面赋值来的:movz a0,sp,a0, 即如果a0为0, 则把sp赋值给a0; 对应
```
  if (!cfg_swap)
      cfg_swap = &default_swap;
```
在foo函数入参是NULL的情况下, a0为sp的值, 而sp指向变量default_swap, 并为其保留了32byte的栈空间所以if (!cfg_swap)成立, cfg_swap指向栈上的default_swap, 从而bar(* (uint32_t *)cfg_swap)是把栈变量default_swap强转成uint32_t
编译器默认打开了strict-aliasing, 认为(uint32_t *)和struct configuration *cfg_swap不可能是同一块地址, 这是strict-aliasing的原则. 这也要求代码里不要对指针强转.
所以编译器认为既然你最后用的是uint32_t *, 没用struct configuration *, 所以对struct configuration类型的default_swap赋值没有意义, 也没有哪里用到. 就优化掉了.
从反汇编结果看, 没有aabbccdd的赋值
所以default_swap在栈上分配了空间, 但没有初始化, 其值随机; 到真实的marvell sdk代码, 值应该是1(代表大端), 但因为上述原因, 一般是0(代表小端). 导致配置错误.

对case A来说, 有-fno-strict-aliasing, 关闭了编译器关于strict-aliasing的假设, 编译器还是老老实实的把default_swap赋值成aabbccdd.

3.2. 参考

https://blog.csdn.net/dbzhang800/article/details/6720141 https://xania.org/200712/cpp-strict-aliasing https://cellperformance.beyond3d.com/articles/2006/06/understanding-strict-aliasing.html

4. cannot find crt1.o 问题

4.1. 现象

直接把gcc4.7替换成gcc7.3, 有找不到crt1.o错误

make dtc-dirclean
make -j1 dtc V=1

前面编译没错误, 但在链接时, 找不到crt1.o; 这个东西之前也见过, 估计是C Run Time的缩写

/repo1/yingjieb/ms/buildroot/output/host/bin/mips64-octeon-linux-gnu-gcc -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -Os -g2 -fPIC -I libfdt -I . -o convert-dtsv0 srcpos.o util.o convert-dtsv0-lexer.lex.o
/repo1/yingjieb/ms/buildroot/output/host/opt/ext-toolchain/bin/../lib/gcc/mips64-octeon-linux-gnu/7.3.0/../../../../mips64-octeon-linux-gnu/bin/ld: cannot find crt1.o: No such file or directory
/repo1/yingjieb/ms/buildroot/output/host/opt/ext-toolchain/bin/../lib/gcc/mips64-octeon-linux-gnu/7.3.0/../../../../mips64-octeon-linux-gnu/bin/ld: cannot find crti.o: No such file or directory
collect2: error: ld returned 1 exit status

4.2. 分析

/repo1/yingjieb/ms/buildroot/output/host/bin/mips64-octeon-linux-gnu-gcc是buildroot提供的"wrapper", 代码在buildroot/toolchain/toolchain-wrapper.c

yingjieb@FNSHA190 /repo1/yingjieb/ms/buildroot
$ ll /repo1/yingjieb/ms/buildroot/output/host/bin/mips64-octeon-linux-gnu-gcc
lrwxrwxrwx 1 yingjieb systemintegration 17 Aug 12 13:57 /repo1/yingjieb/ms/buildroot/output/host/bin/mips64-octeon-linux-gnu-gcc -> toolchain-wrapper

单独运行出错的那行命令, 加环境变量BR2_DEBUG_WRAPPER, 同样能复现:

yingjieb@FNSHA190 /repo1/yingjieb/ms/buildroot/output/build/dtc-1.4.7
$ BR2_DEBUG_WRAPPER=1 /repo1/yingjieb/ms/buildroot/output/host/bin/mips64-octeon-linux-gnu-gcc -mabi=n32 -D_LARGEFILE
_SOURCE -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -Os -g2 -fPIC -I libfdt -I . -o convert-dtsv0 srcpos.o util.
o convert-dtsv0-lexer.lex.o
Toolchain wrapper executing: CCACHE_BASEDIR='/repo1/yingjieb/ms/buildroot/output' '/repo1/yingjieb/ms/buildroot/output/host/bin/ccache' '/repo1/yingjieb/ms/buildroot/output/host/opt/ext-toolchain/bin/mips64-octeon-linux-gnu-gcc' '--sy
sroot' '/repo1/yingjieb/ms/buildroot/output/host/mips64-buildroot-linux-gnu/sysroot' '-mabi=n32' '-mnan=legacy' '-EB' '-pipe' '-mno-branch-likely' '-march=octeon3' '-mabi=n32' '-D_LARGEFILE_SOURCE' '-D_LARGEFILE64_SOURCE' '-D_FILE_OFF
SET_BITS=64' '-Os' '-g2' '-fPIC' '-I' 'libfdt' '-I' '.' '-o' 'convert-dtsv0' 'srcpos.o' 'util.o' 'convert-dtsv0-lexer.lex.o'
/repo1/yingjieb/ms/buildroot/output/host/opt/ext-toolchain/bin/../lib/gcc/mips64-octeon-linux-gnu/7.3.0/../../../../mips64-octeon-linux-gnu/bin/ld: cannot find crt1.o: No such file or directory
/repo1/yingjieb/ms/buildroot/output/host/opt/ext-toolchain/bin/../lib/gcc/mips64-octeon-linux-gnu/7.3.0/../../../../mips64-octeon-linux-gnu/bin/ld: cannot find crti.o: No such file or directory

说明这个wrapper最后还是调用了/repo1/yingjieb/ms/buildroot/output/host/opt/ext-toolchain/bin/mips64-octeon-linux-gnu-gcc, 默认传入一些参数--sysroot /repo1/yingjieb/ms/buildroot/output/host/mips64-buildroot-linux-gnu/sysroot -mabi=n32 -mnan=legacy -EB -pipe -mno-branch-likely -march=octeon3 -mabi=n32
在命令行还原这些参数如下:

yingjieb@FNSHA190 /repo1/yingjieb/ms/buildroot/output/build/dtc-1.4.7
$ /repo1/yingjieb/ms/buildroot/output/host/opt/ext-toolchain/bin/mips64-octeon-linux-gnu-gcc --sysroot /repo1/yingjieb/ms/buildroot/output/host/mips64-buildroot-linux-gnu/sysroot '-mabi=n32' '-mnan=legacy' '-EB' '-pipe' '-mno-branch-l
ikely' '-march=octeon3' '-mabi=n32' '-D_LARGEFILE_SOURCE' '-D_LARGEFILE64_SOURCE' '-D_FILE_OFFSET_BITS=64' '-Os' '-g2' '-fPIC' '-I' 'libfdt' '-I' '.' '-o' 'convert-dtsv0' 'srcpos.o' 'util.o' 'convert-dtsv0-lexer.lex.o'
/repo1/yingjieb/ms/buildroot/output/host/opt/ext-toolchain/bin/../lib/gcc/mips64-octeon-linux-gnu/7.3.0/../../../../mips64-octeon-linux-gnu/bin/ld: cannot find crt1.o: No such file or directory
/repo1/yingjieb/ms/buildroot/output/host/opt/ext-toolchain/bin/../lib/gcc/mips64-octeon-linux-gnu/7.3.0/../../../../mips64-octeon-linux-gnu/bin/ld: cannot find crti.o: No such file or directory

至此说明问题出在新的GCC7.3

4.3. buildroot

编译器在buildroot/output/host/opt/ext-toolchain/mips64-octeon-linux-gnu
buildroot会把编译器的sysroot, 拷贝到buildroot/output/host/mips64-buildroot-linux-gnu/sysroot, 但只保留lib

4.3.1. octeon3用的是lib32-fp

4.4. gcc4.7和gcc7.3的搜索路径不一样

用strace可以看出来

strace -o ld47.log /repo/yingjieb/ms/buildrootmlt/output/host/bin/mips64-octeon-linux-gnu-gcc -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -Os -g2 -fPIC -I libfdt -I . -o convert-dtsv0 srcpos.o util.o convert-dtsv0-lexer.lex.o
strace -o ld73.log /repo1/yingjieb/ms/buildroot/output/host/bin/mips64-octeon-linux-gnu-gcc -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -Os -g2 -fPIC -I libfdt -I . -o convert-dtsv0 srcpos.o util.o convert-dtsv0-lexer.lex.o

对比如下:

#前面几个路径类似, 在toolchain目录下找
host/opt/ext-toolchain/bin/../lib/gcc/mips64-octeon-linux-gnu/4.7.0/n32/octeon3/crt1.o
host/opt/ext-toolchain/bin/../lib/gcc/n32/octeon3/crt1.o
host/opt/ext-toolchain/bin/../lib/gcc/mips64-octeon-linux-gnu/4.7.0/../../../../mips64-octeon-linux-gnu/lib/mips64-octeon-linux-gnu/4.7.0/n32/octeon3/crt1.o
host/opt/ext-toolchain/bin/../lib/gcc/mips64-octeon-linux-gnu/4.7.0/../../../../mips64-octeon-linux-gnu/lib/../lib32-fp/crt1.o
#下面开始, 在sysroot下面找, sysroot是传参来的
host/mips64-buildroot-linux-gnu/sysroot/lib/mips64-octeon-linux-gnu/4.7.0/n32/octeon3/crt1.o
#gcc4.7找的是lib/../lib32-fp
host/mips64-buildroot-linux-gnu/sysroot/lib/../lib32-fp/crt1.o
#而gcc7.3找的是lib64../lib32-fp
host/mips64-buildroot-linux-gnu/sysroot/lib64/../lib32-fp/crt1.o

host/mips64-buildroot-linux-gnu/sysroot/usr/lib/mips64-octeon-linux-gnu/4.7.0/n32/octeon3/crt1.o
#GCC4.7到这里找到了
host/mips64-buildroot-linux-gnu/sysroot/usr/lib/../lib32-fp/crt1.o
#问题在这, lib32-fp里面是由cr1.o的, 但是相对于lib64目录的, 而buildroot没拷贝lib64目录.
host/mips64-buildroot-linux-gnu/sysroot/usr/lib64/../lib32-fp/crt1.o

4.5. 原因

buildroot拷贝sysroot的时候, 做了裁剪, 导致gcc7.3找不到crt1.o

4.6. 解决

/repo1/yingjieb/ms/buildroot/output/host/mips64-buildroot-linux-gnu/sysroot/usr
#加lib64软链接
ln -s lib lib64

升级GCC7.3问题解决