Smiley face

我的征尘是星辰大海。。。

The dirt and dust from my pilgrimage forms oceans of stars...

-------当记忆的篇章变得零碎,当追忆的图片变得模糊,我们只能求助于数字存储的永恒的回忆

作者:黄教授

二〇二五


一月二日 等待变化等待机会

  1. 终于成功配置了wireguard,赶紧记录一下配置:
    1. 服务器端配置
      
      [Interface]
      # 服务器私钥
      PrivateKey = ...
      # 服务器在Wireguard网络中的IP地址
      Address = 10.0.0.1/24
      ListenPort = 5182
      PostUp= iptables -A FORWARD -i wg0 -j ACCEPT; iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
      PostDown = iptables -D FORWARD -i wg0 -j ACCEPT; iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE
      
      # 定义客户端
      [Peer]
      # 客户端公钥
      PublicKey =...
      # 允许客户端访问的IP地址范围
      AllowedIPs = 10.0.0.2/32
      
      routing的配置:(这个应该是默认的)
      
      $ ip route
      default via 172.31.16.1 dev eth0 proto dhcp src 172.31.29.227 metric 100 
      10.0.0.0/24 dev wg0 proto kernel scope link src 10.0.0.1 
      172.31.0.2 via 172.31.16.1 dev eth0 proto dhcp src 172.31.29.227 metric 100 
      172.31.16.0/20 dev eth0 proto kernel scope link src 172.31.29.227 metric 100 
      172.31.16.1 dev eth0 proto dhcp scope link src 172.31.29.227 metric 100 
      
      DNS解析配置应该没有改变。
    2. 客户端配置
      
      [Interface]
      # 客户端私钥
      PrivateKey = ...
      # 客户端在Wireguard网络中的IP地址
      Address = 10.0.0.2/24
      
      DNS=172.31.29.227
      [Peer]
      # 服务器公钥
      PublicKey = ...
      # 服务器的公网IP地
      Endpoint = 54.146.251.127:5182
      AllowedIPs = 0.0.0.0/0
      
      这里我虽然设置了服务器的内网地址DNS=172.31.29.227作为DNS服务器,但是我随后又在DNS配置里设置为wireguard服务器ip所以,这个设置是不需要的!gemini对于这一点非常的肯定以至于我一开始将信将疑。
      
      $ resolvectl 
      Global
               Protocols: -LLMNR -mDNS DNSOverTLS=opportunistic DNSSEC=no/unsupported
        resolv.conf mode: foreign
      Current DNS Server: 10.0.0.1
             DNS Servers: 10.0.0.1
      
      Link 2 (enp0s31f6)
          Current Scopes: DNS
               Protocols: +DefaultRoute +LLMNR -mDNS DNSOverTLS=opportunistic DNSSEC=no/unsupported
      Current DNS Server: 218.85.157.99
             DNS Servers: 218.85.152.99 218.85.157.99 fe80::1%23667
      
      Link 3 (wlp109s0)
      Current Scopes: none
           Protocols: -DefaultRoute +LLMNR -mDNS DNSOverTLS=opportunistic DNSSEC=no/unsupported
      
      Link 766 (wg0)
      Current Scopes: none
           Protocols: -DefaultRoute +LLMNR -mDNS DNSOverTLS=opportunistic DNSSEC=no/unsupported
      
      而routing的配置是这样子的:
      
      $ ip route
      0.0.0.0/1 via 10.0.0.1 dev wg0 
      default via 192.168.1.1 dev enp0s31f6 proto dhcp metric 100 
      10.0.0.0/24 dev wg0 proto kernel scope link src 10.0.0.2 
      54.146.251.127 via 192.168.1.1 dev enp0s31f6 
      169.254.0.0/16 dev enp0s31f6 scope link metric 1000 
      192.168.1.0/24 dev enp0s31f6 proto kernel scope link src 192.168.1.6 metric 100 
      
      这里我仅仅额外根据wireguard官网说明让所有的流量都走服务器。而这个54.146.251.127 via 192.168.1.1 dev enp0s31f6 并不是wg-quick做的,而是我自己额外需要使用ssh直接访问aws/ec2服务器做的。
  2. 经过我的实验,大概有两点重要,一个是必须设置全局的DNS解析,也就是说要修改/etc/resolv.conf把wireguard的服务器地址设定为DNS服务nameserver 10.0.0.1。其次是路有表设置要保证所有的流量都走wireguard:
    
    sudo ip route add 0.0.0.0/1 dev wg0
    sudo ip route add 128.0.0.0/1 dev wg0
    
    最后是服务器端必须要设置NAT和foward,这个是普通外网访问internet必备的,我以前以为这个是vpn自动要去实现的,实际上wireguard作为一个极其简单的vpn软件仅仅实现了最重要的部分,也就是加密通道的部分,其余都要自己去做。这里是谷歌的人工智能gemini的解释,堪称经典:

    Is wireguard server "Forwarding" mainly for firewall?

    That's a very good question that gets at the heart of how networking works.

    While it's true that forwarding is a core function of a firewall, it's not exclusively a firewall function. In the context of a WireGuard server (or any router), "forwarding" refers to the basic network process of taking a packet received on one network interface and sending it out another network interface.

    Here's a breakdown:

    Forwarding in General:
    • Core Routing Function: At its most fundamental level, forwarding is what routers (including your WireGuard server) do. When a packet arrives at a router, it examines the destination IP address.3 If the destination is on a different network than the incoming interface, the router "forwards" the packet to the appropriate outgoing interface.
    • Essential for Inter-Network Communication: Without forwarding, devices on different networks (like your home network and the internet) could not communicate with each other.
    Forwarding in the Context of a WireGuard Server:
    • Connecting Private and Public Networks: Your WireGuard server acts as a bridge between the private WireGuard network (e.g., 10.8.0.0/24) and the public internet.
    • iptables -A FORWARD -i wg0 -j ACCEPT: This specific iptables rule tells the Linux kernel (which is the operating system of your server) to allow packets to be forwarded from the WireGuard interface (wg0) to other interfaces (and vice versa). This is absolutely essential for your WireGuard clients to be able to reach the internet.
    Relationship to Firewalls:
    • Firewalls Use Forwarding: Firewalls use forwarding as a mechanism to inspect and filter traffic.5 A firewall can have rules that say "forward packets from this IP to that IP only if they are on port 80 (HTTP)" or "block all forwarding from this network to that network."
    • Forwarding Exists Independently of Firewalls: The basic act of forwarding is a lower-level network function that exists regardless of whether you have a complex firewall setup. Even with no explicit firewall rules, the kernel still needs to know whether to forward packets or not.
    Analogy
    :

    Think of a post office:

    • Forwarding: The basic act of sorting mail and sending it to the correct destination is like forwarding.
    • Firewall: A security guard at the post office who checks packages for suspicious contents and decides whether to allow them to be delivered is like a firewall.

    The post office needs to forward mail even without a security guard. The security guard (firewall) uses the forwarding system to do its job of filtering.

    In summary:
    • Forwarding is a fundamental network function that allows devices on different networks to communicate.
    • WireGuard servers rely on forwarding to connect clients on the private VPN network to the public internet.
    • Firewalls use forwarding as a mechanism to filter traffic, but forwarding itself is not solely a firewall function.
    Therefore, even if you don't have a complex firewall configuration, you still need to enable forwarding on your WireGuard server for it to function correctly. The iptables -A FORWARD ... rule is essential for this.
  3. 我和gemini的对话看起来比豆包要靠谱一些,它强调了设立DNS服务器的重要性,就是说wireguard服务器还是首先配置好named:比如/etc/bind/named.conf.options需要
    
    listen-on { 10.0.0.1; }; // Listen on the WireGuard server's internal IP
    allow-query { 10.0.0.0/24; }; // Allow queries from the WireGuard network
    forwarders { 8.8.8.8; 1.1.1.1; }; // Forward unresolved queries (Google and Cloudflare DNS)
    //listen-on-v6 { any; }; #Comment this line to disable ipv6 listening
    
    我感觉这样子listen-on { 10.0.0.1; };就不会干扰原本wireguard自身的systemd-resolved的运作了。

一月二十三日 等待变化等待机会

  1. 浴景遐思
    
    翠树华亭掩香汤,
    霓裳才解玉肌凉。
    纤纤葱笋扬春水,
    袅袅胴体隔水香。
    暖雾氤氲思旧事,
    柔波潋滟梦情长。
    待得浴罢添新韵,
    浅笑嫣然映景光。
    
  2. 我发现gnome-sound-recorder录制mp3格式会失败,于是就想用ffmpeg做一个最最简单的录音机:
    
    ffmpeg -f alsa -i hw:1 /tmp/output.mp3
    
    效果不好的原因是我没有找到把麦克风关闭的方法。

一月二十六日 等待变化等待机会

  1. 我怀疑很多程序不能正常运行尤其是录屏软件可能是因为使用了wayland而不是X11系统,我尝试禁止它/etc/gdm3/custom.conf里。但是这个似乎引起了很多的问题,至少wine的程序不能正确运行了。
  2. apt的key的设定也是一个头疼的问题,豆包说的安装winehq是比较新的做法,因为旧的做法是什么https://dl.winehq.org/wine-builds/Release.key这个旧的做法被所谓的keyring取代了。
    
    $ sudo wget -o /etc/apt/keyrings/winehq-archive.key  https://dl.winehq.org/wine-builds/winehq.key
    $ wget -np /etc/apt/sources.list.d/ https://dl.winehq.org/wine-builds/ubuntu/dists/jammy/winehq-jammy.sources
    
  3. 我试图把录屏的视频加快播放,就是压缩一下
    
    $ $ ffmpeg -i deepseek-2025-01-26_17.20.48.mp4 -vf "setpts=0.5*PTS" wine-fix.mp4
    
    实际上不能靠帧数fps的改动来达到目的因为这样子时间是不变的,只有靠时间戳的改变。

二月二日 等待变化等待机会

  1. 我发现ollama是一个非常好的开源项目,因为对于大模型如何有上手的体会就是用这个工具,比如使用llama-tokenize可以形象看到使用不同的模型处理相同的输入得到的向量。而且顺便看到了deepseek的大模型就是使用qwen的模型蒸馏得到的,这个是它自己说的。
    
    $ ./llama-tokenize -m /usr/share/ollama/.ollama/models/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93  --prompt "how are you" 2>&1 | grep general.name
    general.name str              = DeepSeek R1 Distill Qwen 32B
    print_info: general.name     = DeepSeek R1 Distill Qwen 32B
    

二月十五日 等待变化等待机会

  1. 据说这个k-transformer可以提高运行效率。这里我看到了huggingface上的deepseek-r1的文宣
  2. 我写的科幻小说《新生 AI 的第一课》

二月十九日 等待变化等待机会

  1. 仅仅一个修改ollama默认模型路径就费了我不少力气!看来这个是唯一工作的吧?就是说用户名很重要:
    On Ubuntu, Ollama saves models by default to /usr/share/ollama/.ollama/models.
    
    sudo mkdir -p /etc/systemd/system/ollama.service.d; sudo nano /etc/systemd/system/ollama.service.d/override.conf
    
    这个内容是重要的,我之前没有修订用户就不可以
    
    [Service]
    User=nick
    Group=nick
    Environment="OLLAMA_MODELS=/home/nick/workspace/OLLAMA/.ollama/models"
    
    之前我尝试使用软链接,可是不行。
  2. 尝试增加swap大小。增加swapfile大小
    
    sudo swapoff -a
    sudo dd if=/dev/zero of=/swapfile bs=1M count=8192
    

二月二十四日 等待变化等待机会

  1. 我的简单实践显示只有llama.cpp适合我的使用,因为其他都太复杂,哪怕是ollama我都因为编译需要用到的go的版本号不同而不知所措,再加上无数的其他语言的障碍,我只能接受llama.cpp

三月四日 等待变化等待机会

  1. 买了一个二手服务器来玩一下,发现还真的是闹了笑话。
  2. 首先,就是VGA显示转HDMI大都是不兼容的。我买了两个都不行。
  3. 其次,我发现也许是Dell R720XD的iDRAC的重置键不起作用吧,我是没有办法重置iDRAC的设置。
  4. ubuntu等等的确是默认没有使用DHCP来派值IP吧?但是做关键的是启动以后卡在RAID的控制器等待你重新配置,这个就是卡在启动的问题吧。
  5. 学习Dell的服务器BIOS配置,有很多的功能,比之前用过的Quanta来的好。可以在BIOS里直接使用iDRAC配置为所谓的LOM1(Lan Of Motherboard1)来访问,因为我的网线不够,就不使用dedicated的模式了。
  6. 而最丢人的是我居然没有怎么用过LVM,以至于在df里看不到磁盘的容量而抱怨,实际上,可以很轻松的把未分配的磁盘加入VG。首先查看VG
    
    $ sudo vgdisplay 
      --- Volume group ---
      VG Name               ubuntu-vg
      System ID             
      Format                lvm2
      Metadata Areas        1
      Metadata Sequence No  3
      VG Access             read/write
      VG Status             resizable
      MAX LV                0
      Cur LV                1
      Open LV               1
      Max PV                0
      Cur PV                1
      Act PV                1
      VG Size               <1.82 TiB
      PE Size               4.00 MiB
      Total PE              476018
      Alloc PE / Size       476018 / <1.82 TiB
      Free  PE / Size       0 / 0   
      VG UUID               fvQTI5-0C8Q-9eVF-VkQ0-fmgh-w5XH-GewwFP
    
    而我的LV的情况是
    
    $ sudo lvdisplay 
      --- Logical volume ---
      LV Path                /dev/ubuntu-vg/ubuntu-lv
      LV Name                ubuntu-lv
      VG Name                ubuntu-vg
      LV UUID                3LexV3-Imjh-NB0K-oGf0-gVbz-PiF8-3S8Non
      LV Write Access        read/write
      LV Creation host, time ubuntu-server, 2025-02-22 07:55:25 +0000
      LV Status              available
      # open                 1
      LV Size                100.00 GiB
      Current LE             25600
      Segments               1
      Allocation             inherit
      Read ahead sectors     auto
      - currently set to     256
      Block device           253:0
    
    扩容后LV Size变成了最大值1.82 TiB。具体步骤是
    1. 首先要创建物理扇区
      
      $ sudo pvcreate /dev/sda3
      
    2. 然后把物理扇区假如VG
      
      $ sudo vgextend ubuntu-vg /dev/sda3
      
    3. 然后扩容
      
      $ sudo lvextend -l +100%FREE /dev/ubuntu-vg/ubuntu-lv
      
    4. 然后调整文件系统大小
      
      $ sudo resize2fs /dev/ubuntu-vg/ubuntu-lv
      
    5. 检验一下
      
      $ df -h
      Filesystem                         Size  Used Avail Use% Mounted on
      tmpfs                              101G  1.9M  101G   1% /run
      /dev/mapper/ubuntu--vg-ubuntu--lv  1.8T   12G  1.8T   1% /
      tmpfs                              504G     0  504G   0% /dev/shm
      tmpfs                              5.0M     0  5.0M   0% /run/lock
      /dev/sda2                          2.0G  246M  1.6G  14% /boot
      /dev/sda1                          1.1G  6.1M  1.1G   1% /boot/efi
      tmpfs                              101G  4.0K  101G   1% /run/user/1000
      

三月五日 等待变化等待机会

  1. 这里是一个标准的创建服务的流程
    1. Create a user and group for Ollama:
      
      sudo useradd -r -s /bin/false -U -m -d /usr/share/ollama ollama
      sudo usermod -a -G ollama $(whoami)
      
      这里的好处是服务的用户不是当前的用户,可以杜绝一些服务非法访问。比如我们看看ollama这个用户的权限:
      
      $ cat /etc/passwd | grep ollama
      ollama:x:998:998::/usr/share/ollama:/bin/false
      
      它的home我们限制死了,而且是不可登陆的。注意这里的/bin/false等效于/sbin/nologin。而作为ollama用户的home directory:
      
      $ sudo ls -dal /usr/share/ollama/
      drwxr-x--- 2 ollama ollama 4096 Mar  4 22:11 /usr/share/ollama/
      
      也只能在group里添加才能访问。
    2. Create a service file in /etc/systemd/system/ollama.service
      
      [Unit]
      Description=Ollama Service
      After=network-online.target
      
      [Service]
      ExecStart=/usr/bin/ollama serve
      User=ollama
      Group=ollama
      Restart=always
      RestartSec=3
      Environment="PATH=$PATH"
      
      [Install]
      WantedBy=default.target
      
      这里如果要添加什么新的环境启动变量,并不推荐直接修改service文件,而是修改一个所谓的override,就是创建 /etc/systemd/system/ollama.service.d/override.conf然后添加。
    3. Then start the service:
      
      sudo systemctl daemon-reload
      sudo systemctl enable ollama
      

三月十一日 等待变化等待机会

  1. DeepSeek-r1:671b的参数,这个是在llama.cpp命令行下的输出:
    
    build: 4858 (1e2f78a0) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
    main: llama backend init
    main: load the model and apply lora adapter, if any
    llama_model_loader: loaded meta data with 42 key-value pairs and 1025 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 (version GGUF V3 (latest))
    llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
    llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
    llama_model_loader: - kv   1:                               general.type str              = model
    llama_model_loader: - kv   2:                         general.size_label str              = 256x20B
    llama_model_loader: - kv   3:                      deepseek2.block_count u32              = 61
    llama_model_loader: - kv   4:                   deepseek2.context_length u32              = 163840
    llama_model_loader: - kv   5:                 deepseek2.embedding_length u32              = 7168
    llama_model_loader: - kv   6:              deepseek2.feed_forward_length u32              = 18432
    llama_model_loader: - kv   7:             deepseek2.attention.head_count u32              = 128
    llama_model_loader: - kv   8:          deepseek2.attention.head_count_kv u32              = 128
    llama_model_loader: - kv   9:                   deepseek2.rope.freq_base f32              = 10000.000000
    llama_model_loader: - kv  10: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
    llama_model_loader: - kv  11:                deepseek2.expert_used_count u32              = 8
    llama_model_loader: - kv  12:        deepseek2.leading_dense_block_count u32              = 3
    llama_model_loader: - kv  13:                       deepseek2.vocab_size u32              = 129280
    llama_model_loader: - kv  14:            deepseek2.attention.q_lora_rank u32              = 1536
    llama_model_loader: - kv  15:           deepseek2.attention.kv_lora_rank u32              = 512
    llama_model_loader: - kv  16:             deepseek2.attention.key_length u32              = 192
    llama_model_loader: - kv  17:           deepseek2.attention.value_length u32              = 128
    llama_model_loader: - kv  18:       deepseek2.expert_feed_forward_length u32              = 2048
    llama_model_loader: - kv  19:                     deepseek2.expert_count u32              = 256
    llama_model_loader: - kv  20:              deepseek2.expert_shared_count u32              = 1
    llama_model_loader: - kv  21:             deepseek2.expert_weights_scale f32              = 2.500000
    llama_model_loader: - kv  22:              deepseek2.expert_weights_norm bool             = true
    llama_model_loader: - kv  23:               deepseek2.expert_gating_func u32              = 2
    llama_model_loader: - kv  24:             deepseek2.rope.dimension_count u32              = 64
    llama_model_loader: - kv  25:                deepseek2.rope.scaling.type str              = yarn
    llama_model_loader: - kv  26:              deepseek2.rope.scaling.factor f32              = 40.000000
    llama_model_loader: - kv  27: deepseek2.rope.scaling.original_context_length u32              = 4096
    llama_model_loader: - kv  28: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
    llama_model_loader: - kv  29:                       tokenizer.ggml.model str              = gpt2
    llama_model_loader: - kv  30:                         tokenizer.ggml.pre str              = deepseek-v3
    llama_model_loader: - kv  31:                      tokenizer.ggml.tokens arr[str,129280]  = ["<|begin▁of▁sentence|>", "<�...
    llama_model_loader: - kv  32:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
    llama_model_loader: - kv  33:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
    llama_model_loader: - kv  34:                tokenizer.ggml.bos_token_id u32              = 0
    llama_model_loader: - kv  35:                tokenizer.ggml.eos_token_id u32              = 1
    llama_model_loader: - kv  36:            tokenizer.ggml.padding_token_id u32              = 1
    llama_model_loader: - kv  37:               tokenizer.ggml.add_bos_token bool             = true
    llama_model_loader: - kv  38:               tokenizer.ggml.add_eos_token bool             = false
    llama_model_loader: - kv  39:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
    llama_model_loader: - kv  40:               general.quantization_version u32              = 2
    llama_model_loader: - kv  41:                          general.file_type u32              = 15
    llama_model_loader: - type  f32:  361 tensors
    llama_model_loader: - type q4_K:  606 tensors
    llama_model_loader: - type q6_K:   58 tensors
    print_info: file format = GGUF V3 (latest)
    print_info: file type   = Q4_K - Medium
    print_info: file size   = 376.65 GiB (4.82 BPW) 
    load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
    load: special tokens cache size = 818
    load: token to piece cache size = 0.8223 MB
    print_info: arch             = deepseek2
    print_info: vocab_only       = 0
    print_info: n_ctx_train      = 163840
    print_info: n_embd           = 7168
    print_info: n_layer          = 61
    print_info: n_head           = 128
    print_info: n_head_kv        = 128
    print_info: n_rot            = 64
    print_info: n_swa            = 0
    print_info: n_embd_head_k    = 192
    print_info: n_embd_head_v    = 128
    print_info: n_gqa            = 1
    print_info: n_embd_k_gqa     = 24576
    print_info: n_embd_v_gqa     = 16384
    print_info: f_norm_eps       = 0.0e+00
    print_info: f_norm_rms_eps   = 1.0e-06
    print_info: f_clamp_kqv      = 0.0e+00
    print_info: f_max_alibi_bias = 0.0e+00
    print_info: f_logit_scale    = 0.0e+00
    print_info: n_ff             = 18432
    print_info: n_expert         = 256
    print_info: n_expert_used    = 8
    print_info: causal attn      = 1
    print_info: pooling type     = 0
    print_info: rope type        = 0
    print_info: rope scaling     = yarn
    print_info: freq_base_train  = 10000.0
    print_info: freq_scale_train = 0.025
    print_info: n_ctx_orig_yarn  = 4096
    print_info: rope_finetuned   = unknown
    print_info: ssm_d_conv       = 0
    print_info: ssm_d_inner      = 0
    print_info: ssm_d_state      = 0
    print_info: ssm_dt_rank      = 0
    print_info: ssm_dt_b_c_rms   = 0
    print_info: model type       = 671B
    print_info: model params     = 671.03 B
    print_info: general.name     = n/a
    print_info: n_layer_dense_lead   = 3
    print_info: n_lora_q             = 1536
    print_info: n_lora_kv            = 512
    print_info: n_ff_exp             = 2048
    print_info: n_expert_shared      = 1
    print_info: expert_weights_scale = 2.5
    print_info: expert_weights_norm  = 1
    print_info: expert_gating_func   = sigmoid
    print_info: rope_yarn_log_mul    = 0.1000
    print_info: vocab type       = BPE
    print_info: n_vocab          = 129280
    print_info: n_merges         = 127741
    print_info: BOS token        = 0 '<|begin▁of▁sentence|>'
    print_info: EOS token        = 1 '<|end▁of▁sentence|>'
    print_info: EOT token        = 1 '<|end▁of▁sentence|>'
    print_info: PAD token        = 1 '<|end▁of▁sentence|>'
    print_info: LF token         = 201 'Ċ'
    print_info: FIM PRE token    = 128801 '<|fim▁begin|>'
    print_info: FIM SUF token    = 128800 '<|fim▁hole|>'
    print_info: FIM MID token    = 128802 '<|fim▁end|>'
    print_info: EOG token        = 1 '<|end▁of▁sentence|>'
    print_info: max token length = 256
    load_tensors: loading model tensors, this can take a while... (mmap = true)
    
    我准备等补充了64GX8=512G内存再来尝试,因为量化一方面损失精度,又占用硬盘空间。以上包含了很多deepseek的有趣的设定参数。
  2. llama.cpp里很多基本的东西,我需要学习一下,就是说这些都是很普通的,但是是一个基本的,以前看到别人这么做,但是真的要自己做还是作不出来。首先是正确的设定各种debug手段:
    
    #define GGML_ABORT(...) ggml_abort(__FILE__, __LINE__, __VA_ARGS__)
    #define GGML_ASSERT(x) if (!(x)) GGML_ABORT("GGML_ASSERT(%s) failed", #x)
    
    不要小看这个宏,这个是大家都这么做的,因为使用的时候可以添加额外的输出信息:
    
    GGML_ASSERT(tokenizer && "Tokenizer not initialized. Call llama_vocab::init_tokenizer() first.");
    
    这里的&&保证了它作为GGML_ABORT的输出的附加部分,展开就是:
    
    if (!(tokenizer && "Tokenizer not initialized. Call llama_vocab::init_tokenizer() first."))
    ggml_abort("/home/nick/workspace/llama.cpp/src/llama-vocab.cpp", 2330, "GGML_ASSERT(%s) failed",
               "tokenizer && \"Tokenizer not initialized. Call llama_vocab::init_tokenizer() first.\"")
    
    这里我们再来学习一下ggml_abort,这个也是经典的c语言里如何展开不定参数的标准做法。注意这里是输出到stderr而不是stdout,这个是大家约定成俗的惯例。
    
    void ggml_abort(const char * file, int line, const char * fmt, ...) {
        fflush(stdout);
    
        fprintf(stderr, "%s:%d: ", file, line);
    
        va_list args;
        va_start(args, fmt);
        vfprintf(stderr, fmt, args);
        va_end(args);
    
        fprintf(stderr, "\n");
    
        ggml_print_backtrace();
        abort();
    }
    
    而这里的ggml_print_backtrace又是一个经典的例子,我记得我在十几年前看到Paypal里输出backtrace是非常佩服的,不过要把输出结果放到服务端,因为是remote gdb的debug吧?总之那个更加的麻烦。
    
    static void ggml_print_backtrace(void) {
        const char * GGML_NO_BACKTRACE = getenv("GGML_NO_BACKTRACE");
        if (GGML_NO_BACKTRACE) {
            return;
        }
        char attach[32];
        snprintf(attach, sizeof(attach), "attach %d", getpid());
        int pid = fork();
        if (pid == 0) {
            // try gdb
            execlp("gdb", "gdb", "--batch",
                "-ex", "set style enabled on",
                "-ex", attach,
                "-ex", "bt -frame-info source-and-location",
                "-ex", "detach",
                "-ex", "quit",
                (char *) NULL);
    //后面是其他平台的做法
    
    这里都是经典的手法,首先得到当前进程的pid使用getpid(),然后fork()里去调用gdb使用attach这个pid,这里学习使用gdb的--batch模式。不过美中不足的是,在ubuntu下这个操作是受限的,需要sudo的权限,所以,不知道我怎么能够得到当前操作系统的权限来做这个。这个是gemini的解释做法。在 /proc/sys/kernel/yama/ptrace_scope里定义的几种模式
    1. 0 (classic ptrace permissions): Anyone can ptrace any process, provided they have the necessary permissions (e.g., same user ID). This is the least secure setting.
    2. 1 (restricted ptrace): A process can only be ptraced by a parent process or a process running as root. This is the default setting in many modern Linux distributions, including Ubuntu 22.04.
    3. 2 (restricted ptrace admin): Only root can ptrace processes.
    4. 3 (no ptrace): No process can be ptraced, even by root.
    所以,临时改动一下
    
    echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
    
    永久改动/etc/sysctl.conf
    
    kernel.yama.ptrace_scope = 0
    
    再执行sudo sysctl -p。这样子就可以使用backtrace了。而更加的进一步的是直接使用glibc里的backtrace/backtrace_symbols这些函数是不需要用到ptrace syscall就没有权限的烦扰
    
    The backtrace function in the C/C++ standard library (specifically, in the GNU C Library, glibc) works by inspecting the call stack of the current process. It does not use the ptrace system call.
    Here's how backtrace typically works:
    
    1. Stack Frame Inspection:
    
        The function walks up the call stack by inspecting the frame pointer (or base pointer) registers.
        Each stack frame contains:
            The return address (the address of the instruction to execute after the function returns).
            Local variables and function arguments.
            A pointer to the previous stack frame.
        backtrace uses these frame pointers to trace the sequence of function calls that led to the current point in the program.
    
    2. Symbol Resolution (Optional):
    
        The backtrace_symbols function (which often accompanies backtrace) takes the addresses obtained from backtrace and attempts to resolve them to function names, source file names, and line numbers.
        This symbol resolution typically involves:
            Reading debugging information from the executable file (e.g., DWARF information).
            Using symbol tables to map addresses to function names.
    3. No ptrace Involvement:
    
        backtrace operates within the context of the current process. It inspects the process's own memory (the call stack).
        ptrace is used for debugging other processes or for controlling the execution of a child process. backtrace doesn't need to do that.
        Therefore, backtrace does not require any special privileges or capabilities.
    
    Key Points:
    
        backtrace is a self-inspection mechanism.
        It relies on the structure of the call stack, which is a fundamental part of how programs execute.
        Symbol resolution (using backtrace_symbols) enhances the output of backtrace by providing human-readable information.
    
    In summary:
    
    backtrace is a valuable tool for debugging and profiling, and it works by directly examining the process's call stack, without the need for the ptrace system call.
    
    这个就是神奇的工具函数!
    
    /* Store up to SIZE return address of the current program state in
       ARRAY and return the exact number of values stored.  */
    extern int backtrace (void **__array, int __size) __nonnull ((1));
    
    
    /* Return names of functions from the backtrace list in ARRAY in a newly
       malloc()ed memory block.  */
    extern char **backtrace_symbols (void *const *__array, int __size)
         __THROW __nonnull ((1));
    
    注意这里有一个宏__nonnull
    
    /* The nonnull function attribute marks pointer parameters that
       must not be NULL.  This has the name __nonnull in glibc,
       and __attribute_nonnull__ in files shared with Gnulib to avoid
       collision with a different __nonnull in DragonFlyBSD 5.9.  */
    #ifndef __attribute_nonnull__
    # if __GNUC_PREREQ (3,3) || __glibc_has_attribute (__nonnull__)
    #  define __attribute_nonnull__(params) __attribute__ ((__nonnull__ params))
    # else
    #  define __attribute_nonnull__(params)
    # endif
    #endif
    #ifndef __nonnull
    # define __nonnull(params) __attribute_nonnull__ (params)
    #endif
    
    而这个宏的作用是编译期的检查,gemini解释说这个是一个extension
    
    1. __nonnull ((1)) Attribute:
    
        Purpose: This attribute tells the compiler that the first parameter (__array) of the backtrace function must not be a null pointer.
        Mechanism: It leverages the __attribute__ ((__nonnull__ params)) GCC compiler extension.
        params: The (1) within the __nonnull ((1)) specifies that the first parameter (parameter index 1) is the one that should not be null. If there were multiple non-null parameters, you could list their indices (e.g., __nonnull ((1, 3))).
    
    2. Macro Expansion:
    
        __nonnull(params): This macro simply expands to __attribute_nonnull__ (params).
        __attribute_nonnull__(params): This macro checks if the compiler supports the __nonnull__ attribute:
            If using GCC 3.3 or later, or if the glibc header indicates __nonnull__ support, it expands to __attribute__ ((__nonnull__ params)).
            Otherwise, it expands to nothing (effectively removing the attribute).
    
    3. Compiler's Role:
    
        Compile-Time Warnings: When the __nonnull__ attribute is active, the compiler can generate warnings if it detects that the function is called with a null pointer as the first argument.
        Optimization: The compiler might also use this information to optimize code, as it can assume that the parameter is never null.
    
    所以,这里的宏是可以使用从1开始计数的参数位置来强制编译器检查参数不为空!神奇啊!一天学习一点就不得了了,何况学习了这么多!

三月十三日 等待变化等待机会

  1. 我把服务器内存加到最大的64G*24=1536G,但是llama.cpp始终不能使用hugepages。我设了一些断点发现似乎还是mmap的问题,因为尽管我分配了hugepage,但是/proc/meminfo里显示始终没有被使用,直到我看到manpage里有mmap俄flag是MAP_HUGETLB,但是添加这个标志始终都报错说是参数有误。这里是内核关于hugetlbpage的文档。我为了验证这个问题写了一个极其简单的测试例: 而这里遇到一个令人耻笑的问题,就是如果我不包含最戒本的string.h头文件,那么strerror就返回所谓的int,因为编译器会认为这个是链接的extern的函数。这个是写惯了C++的忘记了C语言的古老。 然后gemini就各种分析,先是说我需要mount hugepagefs这么一个mountpoint,而且我的文件必须在这个文件系统里sudo mount -t hugetlbfs none /mnt/hugepages,这倒是说的通。可是我连写文件都会失败,哪怕为了对齐2M使用sudo dd if=/dev/zero of=/mnt/hugepages/testfile bs=2M count=YOUR_COUNT
  2. 内核文件指向了一个库libhugetlbfs,我想从中寻找一些启发。可是第一步编译就抱怨链接-lgcc找不到,在排除了需要直接链接libgcc_s.so.1就是libgcc-s1这个包之后我意识到这个库是多平台都编译,应该不要费心的搞什么multiarch之类的,我就只编译本平台make BUILDTYPE=NATIVEONLY V=1

三月十四日 等待变化等待机会

  1. 很多时候就是一个再学习的过程,weak_alias这种GCC的指示我以前也接触过,但是总是感觉很复杂,也不太理解真实的用途,知道可以避免链接过程俄解决混乱,但是真正的用途其实是一种兼容的手段。这个例子就是简明扼要
    
    // Original function
    void real_function() {
        // ... implementation ...
    }
    
    // Weak alias
    void old_function() __attribute__ ((weak, alias ("real_function")));
    
    • In this example, old_function is a weak alias for real_function.
    • If a user defines a strong symbol named old_function, that definition will be used.
    • Otherwise, real_function will be used.
  2. Dell的R720XD服务器的RAID卡和我之前遇到的asus的BIOS有相似之处就是有些功能是不能在utility里做的,就是说要在BIOS的菜单才行,似乎这个Ctrl+R呼唤起来的RAID utility程序有些功能是做不了的,比如创建所谓的Virtual Disk之类的,或者把Physical Disk添加到VD里,这个让我想起了以前的工作,很多的硬件开发是非常困难的,而软件的不一致性让人非常的抓狂。而做一个RAID0的时间也许需要三天!而PD仅仅是2T,这个简直把人逼疯了。我还有几块旧的SATA硬盘要加上去真的不知道要怎么办?
  3. 我实验了这个库libhugetlbfs似乎不是很成熟,至少glibc/kernel也许不是很买账吧,因为我在内核log里看到很多的crash,我也不知道要怎么办?我很怀疑豆包/gemini给出的建议是否有实用性,也许压根没有被普遍的使用过?也许是服务器有超大内存时候才这么使用的?我再一次阅读它的命令行工具结合gemini的讲解,似乎有一点点开窍了,它的用法是这样子的:
    
    The Core Misunderstanding:
    
        hugetlbfs Is Not a Regular File System:
            This is the most crucial point. hugetlbfs is fundamentally different from regular file systems like ext4 or xfs.
            It's a memory-backed file system, meaning that files within it reside entirely in RAM (specifically, HugePages).
            You do not copy files to hugetlbfs in the same way you copy them to a disk-based file system.
    
    The Correct Usage of hugetlbfs:
    
        Allocation Within hugetlbfs:
            The primary purpose of hugetlbfs is to provide a mechanism for applications to allocate memory directly from HugePages.
            You create files within hugetlbfs to reserve and manage regions of HugePages.
            The files themselves do not contain data in the traditional sense; they represent memory regions.
        mmap and hugetlbfs:
            The typical usage pattern is to:
                Create a file within hugetlbfs with a size that matches the desired HugePages allocation.
                mmap that file with the MAP_HUGETLB flag.
            This establishes a direct mapping between the application's virtual memory and the allocated HugePages.
        Data Population:
            After mmap, you can then populate the mapped memory region with data.
            This data might come from another file, from network input, or be generated by the application.
            The important thing is that the data resides in RAM, specifically in the HugePages allocated through hugetlbfs.
    
    Why Copying Fails:
    
        No Disk Storage:
            hugetlbfs does not have an underlying disk storage mechanism.
            When you try to copy a file, the cp command attempts to write data to disk, which is not possible.
        Memory Allocation, Not File Copying:
            The kernel interprets file creation and writes within hugetlbfs as memory allocation requests.
            If the allocation fails (e.g., due to insufficient or non-contiguous HugePages), you'll get an error.
    
    那么这样意味着我如果要使用这个功能,我需要先把巨大的模型文件搬运到那个mount里,而且还不是简单文件拷贝,这个还真的不容易啊!

三月十五日 等待变化等待机会

  1. 说来惭愧,我白在存储公司呆过,结果连RAID0是怎么回事都理解错了,添加新硬盘当然会导致原有数据丢失啊,因为它是分片写到多块磁盘啊!结果就是要重装系统,我费了很大劲才明白,后面的SD卡插槽不是可以被引导的,只是用来存储报告或者更新固件之类的,外接USB也是如此,只有主板上的USB才能被引导,而且没有插上前BIOS也不会把这个选项暴露给你。这个和iDrac直接给你设置原理也是不同,应该是ipmi的boot next option之类的。总而言之,我以前天天干这个就忘了,而且现在也没有地方去找java1.6导致我无法正确运行javaws的.jnlp文件。这个远程安装还无法做。

三月十六日 等待变化等待机会

  1. 安装操作系统如果不是把/boot创建一个单独的分区往往会失败的原因在哪里?这个问题,我以前似懂非懂,表面上理解实际上还是有些模糊。就是说UEFI首先查找的是EFI分区,找到所谓的/EFI目录下的各个所谓的boot manager,比如ubuntu下有自己的efi实现和相应的grub.cfg。 $ ls /boot/efi/EFI/ubuntu/ BOOTX64.CSV grub.cfg grubx64.efi mmx64.efi shimx64.efi $ cat /boot/efi/EFI/ubuntu/grub.cfg search.fs_uuid f00dad3b-90a9-4155-aa92-0077432099f6 root hd0,gpt2 set prefix=($root)'/grub' configfile $prefix/grub.cfg $ blkid /dev/mapper/ubuntu--vg-lv--0: UUID="8e86f709-3b0d-40a2-8f9f-e9878a74a1f2" BLOCK_SIZE="4096" TYPE="ext4" /dev/mapper/ubuntu--vg-ubuntu--lv: UUID="34d3a3d3-29b1-4075-bf5e-707faa840b24" BLOCK_SIZE="4096" TYPE="ext4" /dev/sda2: UUID="f00dad3b-90a9-4155-aa92-0077432099f6" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="723d5e62-c97d-44a0-80b7-dd18f67eab4b" /dev/sda3: UUID="FrVw7a-ePu5-VBY8-H1MY-z2aC-3ZNO-PcRCm2" TYPE="LVM2_member" PARTUUID="df45327f-5439-4cba-8347-065ce38e7ce2" /dev/sda1: UUID="DE16-179A" BLOCK_SIZE="512" TYPE="vfat" PARTUUID="8f734327-9435-42a0-8d1e-e810171e6f76"

三月十七日 等待变化等待机会

  1. 我发现还是gemini厉害,编程的水平高!比如我的这个简单的程序还是有不少的错误,都是在gemini的帮助下完成的。 这里有几个要点:
    1. 首先,现代操作系统默认就是支持大文件系统,也就是说stat返回的文件大小off_t默认就是64bit,而我还在十几年前的老经验认为我需要定义个宏来打开这个选项。但是这里有一个关键就是off_t的类型并不是我想当然的int64_t而是long int,这个让我吃惊而引发了一系列的混乱,尽管gemini正确的指出了这一点,我始终认为是大文件开关需要打开。这个是不对的,因为为了兼容老代码,系统故意把off_t做成了long int,但是背地里它就是8位:
      
      static_assert(sizeof(off_t) == 8);
      
      这个放在代码里让我安心!
    2. 其次是一个关键点,就是mmap映射的内存的属性必须是MAP_SHARED|MAP_HUGETLB,我之前错误的以为我可以使用MAP_PRIVATE,这个是违反逻辑的,因为你做这个就是为了给人共享,否则就没有意义了,原因这个是文件系统而不是简单的进程内的内存的私有的使用。
    3. 然后这个是基本功的缺失。就是创建文件的标志是什么?我不想覆盖现有的那么就应该是O_EXCL,而我要创建新文件就要O_CREAT,而这个文件我必须要读写那么就是O_RDWR,可是创建好的文件给谁用呢?这个是umask,我居然忘记了,所以,应该是0666
      
      tgt_fd = open(argv[2], O_CREAT|O_RDWR|O_EXCL, 0666);
      ...
      void* ptr = mmap(NULL, tgtSize, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_HUGETLB, tgt_fd, 0);
      
    4. 对于大多数程序员来说也许read返回0或者小于要求的数目就应该判断为错误,可是这个是不对的,因为
      
      		On success, the number of bytes read is returned (zero indicates end of file), and the file position is advanced by
             this number.  It is not an error if this number is smaller than the number of bytes requested; this may happen  for
             example  because  fewer  bytes are actually available right now (maybe because we were close to end-of-file, or be‐
             cause we are reading from a pipe, or from a terminal), or because read() was interrupted by  a  signal.   See  also
             NOTES.
      
             On  error,  -1  is returned, and errno is set appropriately.  In this case, it is left unspecified whether the file
             position (if any) changes.
      
      不过我目前不想去写一个while loop去把这部分读的过程完善。以后再写吧。
    5. 这个代码几乎是纯C的,所以,g++/gcc都无所谓的编译。

三月十八日 等待变化等待机会

  1. 创建一个简单的repo
    
    echo "# hugecp" >> README.md
    git init
    git add README.md
    git commit -m "first commit"
    git branch -M main
    git remote add origin https://github.com/nickhuang99/hugecp.git
    git push -u origin main
    
  2. 给llama.cpp提建议 这个patch其实非常的简单,但是要集成加一个参数开关,我本来想提交后来感觉挺啰嗦的,因为这个功能也许很多人都不需要吧?
  3. 制作了一个粗早的英语视频中文视频,这里是B站中文镜像B站英文镜像
  4. 要求增加libhugetlbfs一个功能
  5. 录屏功能还是Kazam好用,simplescreenrecording麦克风噪音太大了!

三月十九日 等待变化等待机会

  1. 总是忘记stackoverflow的教训,当你在stack里一个大数组,比如我定义hugepage为一个G,那stack当然overflow了。
  2. 发现ollama的deepseek-r1:671b大模型原来是已经4bit量化过的。
    
    Yes, the 671B model in your llama.cpp output is indeed quantized. Here's how we can tell:
    
        File Type: The line print_info: file type = Q4_K - Medium indicates that the model has been quantized to the Q4_K format, specifically the Medium variant.   
    
  3. 这篇文章值得学习。
  4. 录了一个演示的视频。这个是B站的视频镜像
  5. 在huggingface上只能使用ssh来下载git clone。我必须先上传ssh-key。但是下载有问题
    
    GIT_LFS_SKIP_SMUDGE=1 git clone git@hf.co:deepseek-ai/DeepSeek-R1
    
  6. 这个是询问deepseek的关于模型加载的一些数据
    
    ### 关键时间消耗点
    1. **模型加载时间(主要瓶颈)**:
       - 若使用 **4-bit量化**(约335GB):
         - **NVMe SSD(3.5GB/s)**:约 **2分钟**
         - **SATA SSD(500MB/s)**:约 **11分钟**
         - **HDD(150MB/s)**:约 **38分钟**
       - 若使用 **未量化FP16模型(1.3TB)**:
         - 您的内存容量(1.5TB)刚好满足,但需要更长时间加载
    
    2. **内存初始化与预热**:
       - 约 **1-5分钟**(取决于CPU并行化能力)
    
    3. **上下文初始化**:
       - 约 **30秒-2分钟**
    
    ### 总预估时间范围
    - **最佳场景**(NVMe SSD + 4-bit量化):**3-7分钟**
    - **典型场景**(SATA SSD + 4-bit量化):**12-18分钟**
    - **最差场景**(HDD + FP16模型):**40分钟+**
    
    ### 优化建议
    1. **强制量化**:务必使用 `--n-gpu-layers 0` 禁用GPU加速
    2. **内存预分配**:添加 `--mlock` 参数锁定内存
    3. **并行加载**:设置 `--threads N`(N=物理核心数)
    4. **使用mmap加速**:通过 `--mmap` 启用内存映射
    
  7. 但是

三月二十日 等待变化等待机会

  1. 我按照玄武实验室的说法只用一个CPU和邻近的内存来跑
    
    $ sudo numastat -p 7188
    Per-node process memory usage (in MBs) for PID 7188 (llama-cli)
                               Node 0          Node 1           Total
                      --------------- --------------- ---------------
    Huge                    386048.00            0.00       386048.00
    Heap                        40.40            0.00           40.40
    Stack                        0.04            0.00            0.04
    Private                  19582.44            4.27        19586.71
    ----------------  --------------- --------------- ---------------
    Total                   405670.88            4.27       405675.15
    
    内存也是使用numa来绑定的
    
    $ numactl --hardware
    available: 2 nodes (0-1)
    node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38
    node 0 size: 774008 MB
    node 0 free: 318183 MB
    node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
    node 1 size: 774086 MB
    node 1 free: 1564 MB
    node distances:
    node   0   1 
      0:  10  20 
      1:  20  10 
    
    可是效果并不明显。
  2. 让gemini写了一个进度条的函数加载时候可以看看进度。
    
    void update_progress(int progress) {
            int bar_length = 40; // Modify this to change the bar's length
            int filled_length = (int)(bar_length * progress / 100.0);
            char bar[bar_length + 1]; // +1 for the null terminator
            for (int i = 0; i < bar_length; i++) {
                    if (i < filled_length) {
                            bar[i] = '=';
                    } else {
                            bar[i] = '-';
                    }
            }
            bar[bar_length] = '\0'; // Null-terminate the string
            printf("\r[%s] %d%%", bar, progress);
            fflush(stdout); // Ensure output is written immediately
    }
    
    另一个好处是我现在如果要绑定numa,我直接从现有的hugetlbfs里拷贝速度飞快。
    
    $ ~/Downloads/hugecp /mnt/hugepages/deepseek-r1-numa-node0 /mnt/hugepages/deepseek-r1-no-numa
    [===-------------------------------------] 8%
    

三月二十一日 等待变化等待机会

  1. 这么长时间才知道ollama的DS模型是Q4_K量化的模型,只好来huggingface下载Q8的模型,至少要好一些,但是这里对于中国是处于封锁状态的,无意中发现有一个中国的镜像,这个是下载的全部做法步骤,第一步git clone一定要浅浅的clone,否则就一直结束不了。
    
    GIT_LFS_SKIP_SMUDGE=1 git clone https://cnb.cool/ai-models/deepseek-ai/DeepSeek-R1-GGUF/DeepSeek-R1-Q8_0.git
    cd DeepSeek-R1-Q8_0/
    git lfs pull
    
    当然前提是你要安装git lfs。 而它的原生的没有量化的模型是在https://cnb.cool/ai-models/deepseek-ai/DeepSeek-R1.git,我也想实验一下这个。
  2. 反复看到这个Yarn,不明白什么意思,下载一篇论文《YaRN: Efficient Context Window Extension of Large Language Models》看看。首先是这个perplexity的概念,我发现DeepSeek的解释更好一些,因为我可以拷贝原文

    Perplexity is a measurement used primarily in the fields of natural language processing (NLP) and information theory to evaluate how well a probability model predicts a sample. In simpler terms, it assesses how "perplexed" or uncertain a model is when making predictions.

    Key Points:

    1. In NLP: Perplexity is commonly used to evaluate language models. It measures how well the model predicts a sequence of words. A lower perplexity indicates that the model is more confident and accurate in its predictions, while a higher perplexity suggests greater uncertainty.

    2. Mathematical Definition: For a language model, perplexity is defined as the exponential of the cross-entropy loss. If the model assigns a probability P(w1,w2,...,wn)P(w_1, w_2, ..., w_n) to a sequence of words, the perplexity is:

      Perplexity=21Ni=1Nlog2P(wiw1,w2,...,wi1)\text{Perplexity} = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log_2 P(w_i | w_1, w_2, ..., w_{i-1})}

      Here, NN is the number of words in the sequence.

    3. Interpretation: A perplexity of kk means that the model is as confused as if it had to choose uniformly and independently among kk possibilities. For example, a perplexity of 1 means the model is perfectly confident (ideal case), while higher values indicate worse performance.

    4. Applications: Perplexity is widely used to compare different language models, tune hyperparameters, and assess the quality of text generation systems.

    In summary, perplexity is a useful metric for understanding the performance of probabilistic models, especially in NLP, where it quantifies how well a model predicts a given sequence of words.

    这里对于这个公式要简单的理解一下,就是你每次对于下一个词的概率如果都是100%的话,那么Perplexity就是1,因为指数总和是0。而信心来自于概率大小,比如只有50%的话,总和就是1,而perplexity就是2,所以说越大信心越差啊。 另一个概念是siinusoidal position,以下是DeepSeek的解释:
    "Sinusoidal position" typically refers to a method of encoding positional information in sequences, commonly used in neural networks, particularly in transformer models like those in natural language processing (NLP). This technique is known as sinusoidal positional encoding.

    Key Points:

    1. Purpose:

      • In models like transformers, the input data (e.g., words in a sentence) is processed in parallel rather than sequentially. This means the model doesn't inherently know the order of elements in the sequence.

      • Sinusoidal positional encoding is used to inject information about the position of each element in the sequence into the model, helping it understand the order of elements.

    2. How It Works:

      • For each position pospos in the sequence and each dimension ii of the embedding, a unique sinusoidal function is applied:

        PE(pos,2i)=sin(pos100002id)\text{PE}(pos, 2i) = \sin\left(\frac{pos}{10000^{\frac{2i}{d}}}\right)PE(pos,2i+1)=cos(pos100002id)\text{PE}(pos, 2i+1) = \cos\left(\frac{pos}{10000^{\frac{2i}{d}}}\right)

        Here, dd is the dimensionality of the embedding, and pospos is the position in the sequence.

      • This creates a unique pattern for each position, allowing the model to distinguish between different positions in the sequence.

    3. Properties:

      • Uniqueness: Each position has a unique encoding.

      • Relative Position: The sinusoidal nature of the encoding allows the model to learn relative positions more easily, as the encoding for a position pos+kpos + k can be derived from the encoding for position pospos using trigonometric identities.

      • Smoothness: The encoding varies smoothly across positions, which helps the model generalize better.

    4. Applications:

      • Sinusoidal positional encoding is widely used in transformer-based models like BERT, GPT, and others for tasks such as machine translation, text generation, and more.

    Example:

    For a sequence of words in a sentence, each word's embedding is combined with its sinusoidal positional encoding to provide the model with both the word's meaning and its position in the sentence.

    In summary, sinusoidal positional encoding is a technique to provide positional information to models that process sequences in parallel, ensuring they can understand and utilize the order of elements in the sequence.

  3. 读AI的论文真的是很难,因为每一个段落都有一大堆的基本概念要学习。而每一个基本概念又是一大堆的数学公式。这里列举了一系列的相对位置编码 Currently, the most popular relative positional encodings are T5 Relative Bias [30], RoPE [34], XPos [35], and ALiBi都附有论文,如果要去读的话。感觉这个RoPE似乎见过,不妨下载《ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING》看看。

三月二十二日 等待变化等待机会

  1. 我不确定deepseek的模型加载函数以及执行的模型计算图是否是这个所谓的llm_build_deepseek2,我决定在它的callback里添加gprofiler,这个是谷歌的工具:
    
    sudo apt install libgoogle-perftools-dev
    
    在代码里需要头文件#include <gperftools/profiler.h>于是我在graph_get_cb的最后添加
    
    	ProfilerStop();
    	ProfilerStart(name);
    
    因为每一个计算函数都在最后调用回调函数,希望这个能够给我一个概念。 但是编译过程需要添加新的链接,我尝试在cmake里添加这些变量cmake .. -DCMAKE_LINK_LIBRARY_FLAG="-lprofiler" -DCMAKE_SHARED_LINKER_FLAGS="-lprofiler",似乎不行,我准备再添加
    
    CMAKE_EXE_LINKER_FLAGS,CMAKE_MODULE_LINKER_FLAGS,CMAKE_STATIC_LINKER_FLAGS
    
    为了debug,我需要打印所有的编译命令,就使用-DCMAKE_VERBOSE_MAKEFILE=on。询问DeepSeek才意识到问题,因为我以前总是遇到这种问题一直也没有很好的解决方案。
    1. 比如llama.cpp里有一个动态库libllama-mmap.so,而我在其中添加了ProfilerStart,虽然在编译这个动态库时候我有使用-lprofiler,可是编译器作为优化有可能不承认需要。于是在ldd libllama-mmap.so你看不到libprofiler.so的依赖,那么在接下来的其他llama.cpp可执行程序链接libllama-mmap.so就会报错,因为libprofiler不在依赖的列表里。
    2. 解决的办法是强制的需求,这个是DS的建议:-Wl,--no-as-needed -lprofiler -Wl,--as-needed,这样子不管编译器如何优化先加进去再说。注意这个是动态库的语法,静态库不能这样子。
      
      $ cmake .. -DCMAKE_LINK_LIBRARY_FLAG="-lprofiler" -DCMAKE_SHARED_LINKER_FLAGS="-Wl,--no-as-needed -lprofiler -Wl,--as-needed" -DCMAKE_STATIC_LINKER_FLAGS="-lprofiler" -DCMAKE_EXE_LINKER_FLAGS="-lprofiler" -DCMAKE_VERBOSE_MAKEFILE=on
      
    3. 不去修改源代码的CMakeLists.txt的话,我添加都是所谓i的"global flag",而它们是放在开始的,而链接却是从左到右按照需求来做的,于是我的flag被忽视了。怎么办呢?
    4. 另一个思路是使用CXXFLAGS,但是因为是cmake产生的Makefile,这个必须在cmake名利里添加:cmake .. -DCMAKE_CXX_FLAGS="-lprofiler",可是这做法依旧是把标志放在第一位的global的flag。
    5. 最后只能去修改造成问题的那个可执行文件的CMakeList.txt就是examples/main/CMakeLists.txt在其中的添加了profiler
      
      target_link_libraries(${TARGET} PRIVATE common llama profiler ${CMAKE_THREAD_LIBS_INIT})
      
      注意这里是动态库的名字,系统会根据平台编译器产生正确的标志。
    但是当我开始运行gprofiler的时候我却发现它的CPUprofile不断打印,真的是很讨厌,也关不掉。一怒之下决定放弃,因为我看llama.cpp自身也有性能检测,不如使用原生的。白忙了一上午

三月二十三日 等待变化等待机会

  1. 花了非常多的精力才提交这个hugepage的PR
    1. 使用llama.cpp的命令行参数实在是太复杂了,要改很多地方,我改到一半就放弃了。
    2. 它的这个参数还是很好的一个设计 那么具体怎么使用呢? 这个做法非常的值得学习。
    3. 那么在CMakeLists.txt里如何添加一个新的开关呢?
      
      if (GGML_USING_HUGE_PAGE_1G)
      message(STATUS " GGML_USING_HUGE_PAGE_1G is set in cmake")
          target_compile_definitions(llama PRIVATE GGML_USING_HUGE_PAGE_1G=1)
      endif()
      
      这样子在运行cmake .. -DGGML_USING_HUGE_PAGE_1G=1就可以打开这个开关,使用message(STATUS " GGML_USING_HUGE_PAGE_1G is set in cmake")可以来调试。target_compile_definitions需要指定目标文件,这里的llama是定义的动态库。
    4. 在代码里也可以使用#pragma message("这个可以用来调试宏")
    5. 在make里使用VERBOSE=1来查看编译命令,或者现在有新的项目直接产生compile_commands.json
  2. 关于pagesize对齐的问题,我觉得使用这个(size + page_size - 1) & ~(page_size - 1)是会害死人的!DeepSeek肯定了这一点!

    Why the Formula Works for Powers of Two

    The formula (size + page_size - 1) & ~(page_size - 1) works because:

    1. page_size is a power of two (e.g., 4096, 8192).

    2. ~(page_size - 1) creates a bitmask with all higher bits set to 1 and the lower bits set to 0.

    3. The bitwise AND (&) operation effectively rounds down to the nearest multiple of page_size.

    For example, if page_size = 4096 (0x1000 in hex):

    • page_size - 1 = 4095 (0xFFF in hex).

    • ~(page_size - 1) = 0xFFFFF000 (all bits set except the lower 12 bits).

    • The formula rounds up to the nearest multiple of 4096.


三月二十四日 等待变化等待机会

  1. 更新了我的hugecp工具,但是我测试后才意识到如果是safetensor文件我还需要先把它转化为pytorch文件格式或者说GGUF。DeepSeek说
    
    pip install torch safetensors transformers
    
  2. 我当前从llama.cpp的github里fork了一个repo,但是我本机依然想要更新代码,要怎么做呢? DeepSeek说我可以指向upstream
    
    git remote add upstream https://github.com/ggml-org/llama.cpp
    $ git remote -vv
    origin	https://github.com/nickhuang99/llama.cpp (fetch)
    origin	https://github.com/nickhuang99/llama.cpp (push)
    upstream	https://github.com/ggml-org/llama.cpp (fetch)
    upstream	https://github.com/ggml-org/llama.cpp (push)
    
    这样子的话,我可以更新代码$ git pull upstream master

三月二十五日 等待变化等待机会

  1. 我偶然发现一个有趣的项目llama.c,它背后是一个小故事(TinyStories),真的是有趣的小故事。
  2. 作者提到了这篇论文《Training Compute-Optimal Large Language Models》
  3. 总是coredump怎么办,第一步是编译debug类型:cmake .. -DCMAKE_BUILD_TYPE=Debug,然后要激活ptrace,就要echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope

三月二十六日 等待变化等待机会

  1. 我差点忘了提到使用1G的hugepage似乎和2M有很大的不同,因为在内核启动参数必须明确这三个参数,而且是缺一不可的,
    
    default_hugepagesz=1G hugepagesz=1G hugepages=671
    
    也就是说,对于默认的2M的hugepage,你完全可以使用sysctl达到动态分配。
  2. 手机上安装的termux我想和手机用户共享文件需要先执行termus-setup-storage就创建了所谓的storage/shared可以访问所有其他程序共享的目录了。
  3. 浏览别人的PR,就有一种自惭形秽的感觉,这个改进完全够得上是一篇复杂的wiki文章了!更不要说理解它的作用了,我根本不明白,真正的不明觉厉!
  4. 昨天提交的bug,大神一晚上就修正了。我是一点也不明白为什么这么改,目前仅仅理解了在类似训练场景下batch的数目不能少于context。至于为什么也不理解。我觉得我应该还是回到llama.c来学习,因为这个更加的简单。
  5. 我一开始测试一个5G的模型文件使用我的2M的hugepage结果发现似乎毫无改进,本来以为有什么bug,后来才想明白,因为5G不到的内存才需要三四千个page,这个tlb的加成是很少的,因为一个4K的page可以容纳指针500个,那么最多的就是10几个TLB的page的查找很有可能是无关痛痒的,只有当page数达到几十万个才有很大的改进吧?而且和内存的速度也有关系吧?对于老的机器,内存速度满于是内存访问反而成为瓶径,所以,有人认为大模型是一个memory-bound的程序,计算反而不突出了。这个是我解释。

三月二十七日 等待变化等待机会

  1. llama.c的最最基本的维度就是这几个参数:
    
    typedef struct {
        int dim; // transformer dimension
        int hidden_dim; // for ffn layers
        int n_layers; // number of layers
        int n_heads; // number of query heads
        int n_kv_heads; // number of key/value heads (can be < query heads because of multiquery)
        int vocab_size; // vocabulary size, usually 256 (byte-level)
        int seq_len; // max sequence length
    } Config;
    

三月二十九日 等待变化等待机会

  1. pip install torch总是失败,问了豆包可以使用国内的镜像:
    
    pip install torch tochvision torchaudio -i https://pypi.tuna.tsinghua.edu.cn/simple
    
    结果安装版本找不到,只好强制版本:
    
    $ pip install torchvision=0.21.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
    
  2. 豆包就是我的良师益友,一个无比耐心的导师,对我的任何问题都是有问必答,这个是关于量化的算法的讨论实验,我最后很高兴的发现其实我的想法和llama.cpp的做法是一模一样的。看来我并非是胡思乱想,但凡有一丁点想法的人都能想到这个。

三月三十日 等待变化等待机会

  • 原来llama.cpp背后是一个ggml,这个是一个不大的库
    1. 学习计算机二十几年居然不了解浮点数的存储格式是不是丢人呢?所以,我今天第一次听说了据说是英特尔定义的bf16格式的浮点数(brain16),它的好处倒是很容易理解就是为了不溢出,损失精度。
      
      /**
       * Converts brain16 to float32.
       *
       * The bfloat16 floating point format has the following structure:
       *
       *       ┌sign
       *       │
       *       │   ┌exponent
       *       │   │
       *       │   │      ┌mantissa
       *       │   │      │
       *       │┌──┴───┐┌─┴───┐
       *     0b0000000000000000 brain16
       *
       * Since bf16 has the same number of exponent bits as a 32bit float,
       * encoding and decoding numbers becomes relatively straightforward.
       *
       *       ┌sign
       *       │
       *       │   ┌exponent
       *       │   │
       *       │   │      ┌mantissa
       *       │   │      │
       *       │┌──┴───┐┌─┴───────────────────┐
       *     0b00000000000000000000000000000000 IEEE binary32
       *
       * For comparison, the standard fp16 format has fewer exponent bits.
       *
       *       ┌sign
       *       │
       *       │  ┌exponent
       *       │  │
       *       │  │    ┌mantissa
       *       │  │    │
       *       │┌─┴─┐┌─┴──────┐
       *     0b0000000000000000 IEEE binary16
       *
       * @see IEEE 754-2008
       */
      
      明白了这个存储就理解它的转化方式的代码了,使用union确实是最好的一个方式,只不过我现在不太清楚这个和big endian/little endian会不会有影响?
      
      typedef struct { uint16_t bits; } ggml_bf16_t;
      static inline float ggml_compute_bf16_to_fp32(ggml_bf16_t h) {
          union {
              float f;
              uint32_t i;
          } u;
          u.i = (uint32_t)h.bits << 16;
          return u.f;
      }
      
      从高精度到低精度损失是必然的,那么就是很简单的代码,可是反过来我没有想到是这么的复杂,原来就是因为浮点数有所谓的非规格化的超级小数的缘故,简而言之就是当指数都是0是一种特殊的意义来表达非常小的小数,总之这一段代码是看不明白的:
      
          static inline float ggml_compute_fp16_to_fp32(ggml_fp16_t h) {
              const uint32_t w = (uint32_t) h << 16;
              const uint32_t sign = w & UINT32_C(0x80000000);
              const uint32_t two_w = w + w;
      
              const uint32_t exp_offset = UINT32_C(0xE0) << 23;
          #if (defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L) || defined(__GNUC__) && !defined(__STRICT_ANSI__)) && (!defined(__cplusplus) || __cplusplus >= 201703L)
              const float exp_scale = 0x1.0p-112f;
          #else
              const float exp_scale = fp32_from_bits(UINT32_C(0x7800000));
          #endif
              const float normalized_value = fp32_from_bits((two_w >> 4) + exp_offset) * exp_scale;
      
              const uint32_t magic_mask = UINT32_C(126) << 23;
              const float magic_bias = 0.5f;
              const float denormalized_value = fp32_from_bits((two_w >> 17) | magic_mask) - magic_bias;
      
              const uint32_t denormalized_cutoff = UINT32_C(1) << 27;
              const uint32_t result = sign |
                  (two_w < denormalized_cutoff ? fp32_to_bits(denormalized_value) : fp32_to_bits(normalized_value));
              return fp32_from_bits(result);
          }
      
      我也懒得研究了,不过好消息是这个复杂的转化过程居然使用查表来解决!因为bf16说到底可以看作是一个16bit的整数,总共就是65535个,所以,直接创建一个表!而查询的时候就是把它看做整数。这个的确是一个高明的技巧。
      
      struct ggml_context * ggml_init(struct ggml_init_params params) {
          static bool is_first_call = true;
      
          ggml_critical_section_start();
      
          if (is_first_call) {
              // initialize time system (required on Windows)
              ggml_time_init();
      
              for (int i = 0; i < (1 << 16); ++i) {
                  union {
                      uint16_t u16;
                      ggml_fp16_t fp16;
                  } u = {i};
                  ggml_table_f32_f16[i] = GGML_COMPUTE_FP16_TO_FP32(u.fp16);
              }
      
              is_first_call = false;
          }
      
          ggml_critical_section_end();
      ...
      
      用查表的方式,我本来想应该使用union来把浮点数转为整数,可是暴力拷贝是不是有big endian/little endian的问题呢?也许正好相反吧这个反而是避免了endian的问题?我没有这方面的实践经验总是想不清楚。我问了豆包说使用union是更加好的做法,能够避免endian的困扰。
      
      inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {
          uint16_t s;
          memcpy(&s, &f, sizeof(uint16_t));
          return ggml_table_f32_f16[s];
      }
      
      感觉这个代码有问题,忍不住要改一下
    2. 这个转换就比较正常了。
      
      /**
       * Converts float32 to brain16.
       *
       * This is binary identical with Google Brain float conversion.
       * Floats shall round to nearest even, and NANs shall be quiet.
       * Subnormals aren't flushed to zero, except perhaps when used.
       * This code should vectorize nicely if using modern compilers.
       */
      static inline ggml_bf16_t ggml_compute_fp32_to_bf16(float s) {
          ggml_bf16_t h;
          union {
              float f;
              uint32_t i;
          } u;
          u.f = s;
          if ((u.i & 0x7fffffff) > 0x7f800000) { /* nan */
              h.bits = (u.i >> 16) | 64; /* force to quiet */
              return h;
          }
          h.bits = (u.i + (0x7fff + ((u.i >> 16) & 1))) >> 16;
          return h;
      }
      
      原来安静nan和信号nan是真的存在的两种浮点数啊!两者都是nan,其指数位都为1,而尾数位都不全为0,区别是前者的尾数最高位为1,而后者尾数最高位为0。前者可以是除以0或者负数开平方根,而后者通常是为初始化等的非法操作?
    3. 我的这个feature request还是立刻就有人相应啊!
    4. 在把DeepSeek-V3大模型转为gguf中总是遇到这个错误。找了好久最后还是问gemini才问到结果。就是The weight model.layers.0.mlp.down_proj.weight_scale_inv in the DeepSeek-v3 model, and similar weights found in some other transformer architectures, is related to a technique called Scaled ReLU or SwiGLU activation functions.这里是细节。于是我就提了一个建议
    5. 今天累得够呛!
    6. 要实现这个功能还是要先读论文
    7. 在gemini的帮助下,我们认为llama.cpp本来就实现了这个SwiGLU,它就是所谓的ffn_down,所以,这个不是cpp的问题,纯粹就是python的convert问题,而只需要加一个映射就行了。
      
      diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
      index c605e4d0..aef18b25 100755
      --- a/convert_hf_to_gguf.py
      +++ b/convert_hf_to_gguf.py
      @@ -4407,7 +4407,7 @@ class DeepseekV2Model(Model):
                   else:
                       return []
       
      -        return [(self.map_tensor_name(name), data_torch)]
      +        return [(self.map_tensor_name(name, (".weight_scale_inv", ".weight", ".bias")), data_torch)]
       
           def prepare_tensors(self):
               super().prepare_tensors()
      diff --git a/gguf-py/gguf/tensor_mapping.py b/gguf-py/gguf/tensor_mapping.py
      index 8d4a2b03..72c67db7 100644
      --- a/gguf-py/gguf/tensor_mapping.py
      +++ b/gguf-py/gguf/tensor_mapping.py
      @@ -388,6 +388,7 @@ class TensorNameMap:
                   "encoder.layer.{bid}.mlp.down_layer",                     # jina-bert-v2
                   "encoder.layers.{bid}.mlp.dense_4h_to_h",                 # chatglm
                   "model.layers.h.{bid}.mlp.c_proj",                        # exaone
      +            "model.layers.{bid}.mlp.down_proj.weight_scale_inv",      # deepseek-v3
               ),
       
               MODEL_TENSOR.FFN_DOWN_EXP: (
      
      这个问题似乎是解决了,但是我又遇到了其他的问题。
      
        File "/home/nick/Downloads/llama.cpp/./convert_hf_to_gguf.py", line 4396, in modify_tensors
          datas.append(self._experts[bid][ename])
      KeyError: 'model.layers.3.mlp.experts.99.down_proj.weight'
      
      晚上太冷了。
    8. 大侠指出这个实际上是DeepSeek提供的fp8的quantization的解决办法。那么我的修改还对吗?我等待大侠的评价

    三月三十一日 等待变化等待机会

    1. 这位大神的说法让我信服了。所以,要么是自己改,这个是DeepSeek-V3的推理工具。或者就是从HuggingFace下载现成的BF16版本。
    2. 怎么知道你需要重启内核呢?因为从/proc/cmdline里看到的内核版本是
      
      $ cat /proc/cmdline
      BOOT_IMAGE=/vmlinuz-5.15.0-134-generic root=/dev/mapper/ubuntu--vg-lv--0 ro default_hugepagesz=1G hugepagesz=1G hugepages=800
      
      可是在grub.cfg里看到的是
      
      				echo    'Loading Linux 5.15.0-135-generic ...'
                      linux   /vmlinuz-5.15.0-135-generic root=/dev/mapper/ubuntu--vg-lv--0 ro  default_hugepagesz=1G hugepagesz=1G hugepages=800
                      echo    'Loading initial ramdisk ...'
                      initrd  /initrd.img-5.15.0-135-generic
      
      证明我有安装了新版内核需要重启。服务器重启实在是漫长的过程!
    3. 所以,接着昨天的故事,再次尝试DeepSeek-V3的推理工具,因为昨天我的NVidia驱动没有安装完毕,所以脚本拒绝。我需要安装本地的python库torch:
      
      pip install -r ./requirements.txt
      
      python是非常令人讨厌的东西,因为它的版本非常的敏感,所有的跨平台的东西都是如此,因为本后和不跨平台老老实实的动态库是一样的,只不过后者直接就崩溃不让你有任何幻想。这个其实更好。
    4. 此外,我把ollama网站下载的deepseek-r1:671b使用llama.cpp来运行发现可能是程序有错误,因为输出不对完全错误。
    5. 对于huggingface,实在是没有什么好办法,这个是对于中国IP直接抵制了。比如curl -I https://huggingface.co是不相应的。最后只能浪费我的流量来用vpn了。前提是我需要NFS,因为本机磁盘是不够的。
      
      $ cat /etc/exports
      /data/models	*(rw,async,rw,no_subtree_check,all_squash,anonuid=1000,anongid=1000)
      
      这个async能够保证客户端在网路中断不至于不相应,当然我要在客户端使用soft选项
      
      sudo mount -t nfs 192.168.1.7:/data/models -o soft ./dellserver
      
      然后就可以使用git lfs pull下载huggingface大模型了。前提是要彻底删除之前的git repo,要重新运行clone。
    6. 可以使用-ngl(no GPU layers)参数来降低分配给GPU的比例。我要实验一下。
    7. 证实nvdia驱动不成功,下载官方驱动又要删除之前的sudo apt-get purge nvidia-*
    8. 最后我决定还是让ubuntu来选择:
      
      ubuntu-drivers devices
      
      == /sys/devices/pci0000:40/0000:40:02.0/0000:42:00.0/0000:43:10.0/0000:45:00.0 ==
      modalias : pci:v000010DEd0000102Dsv000010DEsd0000106Cbc03sc02i00
      vendor   : NVIDIA Corporation
      model    : GK210GL [Tesla K80]
      driver   : nvidia-driver-418-server - distro non-free
      driver   : nvidia-driver-470-server - distro non-free
      driver   : nvidia-driver-470 - distro non-free recommended
      driver   : nvidia-driver-450-server - distro non-free
      driver   : nvidia-driver-390 - distro non-free
      driver   : xserver-xorg-video-nouveau - distro free builtin
      
      因此,我选择nvidia-driver-470 ,但是这个也可能就是llama.cpp无法使用的一个原因吧?
    9. 意外操作重启后整理一下思路。目前我看到的这个
      
      $ git remote -vv
      origin	https://cnb.cool/ai-models/deepseek-ai/DeepSeek-R1.git (fetch)
      origin	https://cnb.cool/ai-models/deepseek-ai/DeepSeek-R1.git (push)
      
      查看config.json有这么一段很重要:
      
        "quantization_config": {
          "activation_scheme": "dynamic",
          "fmt": "e4m3",
          "quant_method": "fp8",
          "weight_block_size": [
            128,
            128
          ]
        },
      
      这个和大侠指出的DeepSeek-V3的说明是相符的:
      
      ## FP8 Weight Documentation
      
      DeepSeek-V3 natively supports FP8 weight format with 128x128 block scaling.
      
      ### FP8 Configuration
      
      The FP8 weight file introduces a `quantization_config` field to describe the quantization method. Below is an example configuration:
      
      ```json
      "quantization_config": {
        "activation_scheme": "dynamic",
        "fmt": "e4m3",
        "quant_method": "fp8",
        "weight_block_size": [128, 128]
      }
      ```
      - **Quantization Format**:
        - Format type: `fp8` and `e4m3` (corresponding to `torch.float8_e4m3fn`).
        - Weight block size: `128x128`.
      - **Activation Quantization Scheme**:
        - Utilizes dynamic activation quantization (`dynamic`).
      
      而这里要再理解所谓的Dequantization
      
      ### Dequantization Method
      
      The FP8 weight file includes a `weight_scale_inv` field, which stores the dequantization scale for each weight block.
      
      - **Storage Format**: `float32 Tensor`, stored alongside the weight data.
      - **Dequantization Formula**:
        - If the weight block is not aligned to 128, it is zero-padded to 128 before calculating the scale. After quantization, the padded portion is removed.
        - The dequantization process is performed as: `(128x128 weight block) * weight_scale_inv`.
      
      Through dequantization of the FP8 weights, runtime operations enable online quantization at a granularity of `per-token-per-128-channel`.
      
      ---
      
      能不能理解为:真正的weight都已经quantized为fp8了,而所谓的weight_scale_inv作为一个fp32存储是为了做dequantization来做的?最后一句是看不懂的。

    四月一日 等待变化等待机会

    1. 遇到一个棘手的问题,就是git lfs pull立即就返回了,之前下载成功了。但是git status,那么是否是本地文件和目标文件内容有差别,或者说下载文件被破坏了呢?我问了gemini,最后还是自己查帮助尝试git lfs fsck,似乎gemini并不知道有这个命令,它认为是verify不知道是独撰还是版本不同。
    2. 之前下载的模型文件的需要从Q8的量化改为bf16,需要使用DeepSeek提供的这个脚本,可是它竟然需要我的GPU显存很大6G都把不够,很可能是脚本使用整个块文件导致的,gemini耐心帮助我试图减少显存使用量。结果huggingface.co当机了!这个简直是令人叹息啊!
    3. 原来你可以使用git lfs ls-file -las来查看每一个大文件的sha256sum,所以,你可以手动来检验一下,所以,尽管git status总是报告文件以及修改,依然可以放心。
    4. 提交了一个PR修改一个小例子的参数。
    5. 点击lfs的大文件内容可以看到类似的oid:
      
      version https://git-lfs.github.com/spec/v1
      
      oid sha256:c2388e6b127ce6664e35c5e2529c3ce4bfc99f4f7fb6fa48e92b29ed5e4922af
      
      size 5234139343
      
      然后我在本机看看这个文件的oid:
      
      $ git lfs ls-files -l | grep model-00001-of-000163.safetensors
      c2388e6b127ce6664e35c5e2529c3ce4bfc99f4f7fb6fa48e92b29ed5e4922af * model-00001-of-000163.safetensors
      
      $ if [[ "$(sha256sum model-00001-of-000163.safetensors | cut -d' ' -f1)" == "c2388e6b127ce6664e35c5e2529c3ce4bfc99f4f7fb6fa48e92b29ed5e4922af" ]]; then echo "Match"; else echo "No Match"; fi
      Match
      
      所以,lfs没有问题,应该是git status的问题,所以,可以大胆的使用git reset --hard

    四月二日 等待变化等待机会

    1. 看到一段代码想去使用avx512去优化,后来才明白四个int只用128就行,而这个是所谓的SSE,而实际上是四个int64_t就是256bit,那么这样就可以使用吗?我尝试着这么做:
      
      #include <immintrin.h> // Include for AVX-512 intrinsics
          // int ne[GGML_MAX_DIMS];
          // int nb[GGML_MAX_DIMS];
      
          // ne[axis0] = a->ne[0];
          // ne[axis1] = a->ne[1];
          // ne[axis2] = a->ne[2];
          // ne[axis3] = a->ne[3];
          __m256i ne_vec = _mm256_loadu_si256((__m256i*)a->ne);
          // nb[axis0] = a->nb[0];
          // nb[axis1] = a->nb[1];
          // nb[axis2] = a->nb[2];
          // nb[axis3] = a->nb[3];
          __m256i nb_vec = _mm256_loadu_si256((__m256i*)a->nb);
          // result->ne[0] = ne[0];
          // result->ne[1] = ne[1];
          // result->ne[2] = ne[2];
          // result->ne[3] = ne[3];
          _mm256_storeu_si256((__m256i*)result->ne, ne_vec);
          // result->nb[0] = nb[0];
          // result->nb[1] = nb[1];
          // result->nb[2] = nb[2];
          // result->nb[3] = nb[3];
          _mm256_storeu_si256((__m256i*)result->nb, nb_vec);
      
      可是这样子真的可以吗?尽管我使用了诸如-march=native -march=alderlake -mavx2 -mavx依旧我得到编译错误:
      
      /usr/lib/gcc/x86_64-linux-gnu/12/include/avxintrin.h:927:1: error: inlining failed in call to ‘always_inline’ ‘_mm256_loadu_si256’: target specific option mismatch
        927 | _mm256_loadu_si256 (__m256i_u const *__P)
            | ^~~~~~~~~~~~~~~~~~
      
      这个实验是失败的。优化是很难的。
    2. safetensors的文件也许是必须要有一个config.json,因为它并不是一个self-explainable的格式,一些重要的信息都储存在这里比如:
      
      $ cat config.json 
      {
        "architectures": [
          "DeepseekV3ForCausalLM"
        ],
        "attention_bias": false,
        "attention_dropout": 0.0,
        "auto_map": {
          "AutoConfig": "configuration_deepseek.DeepseekV3Config",
          "AutoModel": "modeling_deepseek.DeepseekV3Model",
          "AutoModelForCausalLM": "modeling_deepseek.DeepseekV3ForCausalLM"
        },
        "aux_loss_alpha": 0.001,
        "bos_token_id": 0,
        "eos_token_id": 1,
        "ep_size": 1,
        "first_k_dense_replace": 3,
        "hidden_act": "silu",
        "hidden_size": 7168,
        "initializer_range": 0.02,
        "intermediate_size": 18432,
        "kv_lora_rank": 512,
        "max_position_embeddings": 163840,
        "model_type": "deepseek_v3",
        "moe_intermediate_size": 2048,
        "moe_layer_freq": 1,
        "n_group": 8,
        "n_routed_experts": 256,
        "n_shared_experts": 1,
        "norm_topk_prob": true,
        "num_attention_heads": 128,
        "num_experts_per_tok": 8,
        "num_hidden_layers": 61,
        "num_key_value_heads": 128,
        "num_nextn_predict_layers": 1,
        "pretraining_tp": 1,
        "q_lora_rank": 1536,
        "qk_nope_head_dim": 128,
        "qk_rope_head_dim": 64,
        "quantization_config": {
          "activation_scheme": "dynamic",
          "fmt": "e4m3",
          "quant_method": "fp8",
          "weight_block_size": [
            128,
            128
          ]
        },
        "rms_norm_eps": 1e-06,
        "rope_scaling": {
          "beta_fast": 32,
          "beta_slow": 1,
          "factor": 40,
          "mscale": 1.0,
          "mscale_all_dim": 1.0,
          "original_max_position_embeddings": 4096,
          "type": "yarn"
        },
        "rope_theta": 10000,
        "routed_scaling_factor": 2.5,
        "scoring_func": "sigmoid",
        "seq_aux": true,
        "tie_word_embeddings": false,
        "topk_group": 4,
        "topk_method": "noaux_tc",
        "torch_dtype": "bfloat16",
        "transformers_version": "4.46.3",
        "use_cache": true,
        "v_head_dim": 128,
        "vocab_size": 129280
      }
      
      
      我猜想DeepSeek上下文长度的最大限制的一个原因就是位置编码的限制"max_position_embeddings": 163840。而另一个我才意识到的是tokenizer.json是包含了vocabulary的单独的文件的:
      
         "normalizer": {
              "type": "Sequence",
              "normalizers": []
          },
          "pre_tokenizer": {
              "type": "Sequence",
              "pretokenizers": [
                  {
                      "type": "Split",
                      "pattern": {
                          "Regex": "\\p{N}{1,3}"
                      },
                      "behavior": "Isolated",
                      "invert": false
                  },
                  {
                      "type": "Split",
                      "pattern": {
                          "Regex": "[一-龥぀-ゟ゠-ヿ]+"
                      },
                      "behavior": "Isolated",
                      "invert": false
                  },
                  {
                      "type": "Split",
                      "pattern": {
                          "Regex": "[一-龥぀-ゟ゠-ヿ]+"
                      },
                      "behavior": "Isolated",
                      "invert": false
                  },
                  {
                      "type": "Split",
                      "pattern": {
                          "Regex": "[!\"#$%&'()*+,\\-./:;<=>?@\\[\\\\\\]^_`{|}~][A-Za-z]+|[^\r\n\\p{L}\\p{P}\\p{S}]?[\\p{L}\\p{M}]+| ?[\\p{P}\\p{S}]+[\r\n]*|\\s*[\r\n]+|\\s+(?!\\S)|\\s+"
                      },
                      "behavior": "Isolated",
                      "invert": false
                  },
                  {
                      "type": "ByteLevel",
                      "add_prefix_space": false,
                     "trim_offsets": true,
                      "use_regex": false
                  }
              ]
          },
          "post_processor": {
              "type": "ByteLevel",
              "add_prefix_space": true,
              "trim_offsets": false,
              "use_regex": true
          },
          "decoder": {
              "type": "ByteLevel",
              "add_prefix_space": true,
              "trim_offsets": true,
              "use_regex": true
          },
          "model": {
              "type": "BPE",
              "dropout": null,
              "unk_token": null,
              "continuing_subword_prefix": null,
              "end_of_word_suffix": null,
              "fuse_unk": false,
              "byte_fallback": false,
              "vocab": {
                  "<|begin▁of▁sentence|>": 0,
                  "<|end▁of▁sentence|>": 1,
                  "<|▁pad▁|>": 2,
                  "!": 3,
                  "\"": 4,
                  "#": 5,
                  "$": 6,
                  "%": 7,
                  "&": 8,
                  "'": 9,
                  "(": 10,
                  ")": 11,
                  "*": 12,
                  "+": 13,
                  ",": 14,
      ...
      
      这个是gemini的解说:
      1. "normalizer": This section specifies how the input text is preprocessed to ensure consistency before tokenization. The "type": "Sequence" indicates that a series of normalization steps could be applied in order. However, the "normalizers": [] being empty means that in this particular tokenizer configuration, no normalization is being performed. Normalization steps could include things like lowercasing, removing accents, or handling Unicode variations.
      2. "pre_tokenizer": This section defines how the input text is initially split into smaller units before the main tokenization model is applied. Here, it's a "type": "Sequence" with a list of several pre-tokenizers:
        1. The first three "Split" pre-tokenizers use regular expressions ("Regex": "\\p{N}{1,3}" and "Regex": "[一-龥぀-ゟ゠-ヿ]+" appearing twice) to isolate sequences of 1 to 3 digits and blocks of Chinese, Japanese, or Korean characters. "behavior": "Isolated" suggests these matches will be treated as separate units.
        2. The fourth "Split" pre-tokenizer uses a more complex regular expression to handle various scenarios, including splitting based on punctuation, symbols, and whitespace.
        3. The final "ByteLevel" pre-tokenizer is a common technique that ensures every byte of the input is represented in the vocabulary, which can be helpful for handling unknown characters and various encodings. "add_prefix_space": false means it won't automatically add a space before the first word.
      3. "post_processor": This section defines how the tokens are processed after the main tokenization but before being input to the model or when converting back to text. Here, it's using "ByteLevel" again, but with "add_prefix_space": true, indicating a space will be added before the first token when processing.
      4. "decoder": This section specifies how the numerical tokens are converted back into human-readable text. It uses "ByteLevel" with "add_prefix_space": true and "trim_offsets": true, indicating that when decoding, a prefix space will be added, and offsets might be trimmed. This mirrors the post-processing to ensure the original text can be reconstructed.
      5. "model": This crucial section defines the actual tokenization algorithm used by the model. Here, "type": "BPE" indicates that the model uses Byte-Pair Encoding. BPE is a subword tokenization algorithm that learns merges of frequent byte pairs.
        1. "dropout": null means no dropout is applied during tokenization.
        2. "unk_token": null suggests there's no specific unknown token defined here (it might be handled at a lower level by the ByteLevel).
        3. "continuing_subword_prefix" and "end_of_word_suffix" are often used in BPE to indicate if a token is part of a larger word.
        4. "byte_fallback": false means it won't fall back to byte-level encoding in certain situations.
        5. The "vocab" subsection contains the actual vocabulary of tokens that the model can process, along with their corresponding integer IDs. You can see examples like "<|begin of sentence|>": 0, "!": 3, "\"": 4, and many others.
      一句话就是vocabulary是一个前提,没有这个你还根本够不着transformer呢。
    3. huggingface设定如下:
      git remote set-url origin https://YOUR_USERNAME:YOUR_ACCESS_TOKEN@huggingface.co/OWNER/REPOSITORY_NAME.git

    四月三日 等待变化等待机会

    1. 抓取vocab:
      
      $ grep -b "\"vocab\":" tokenizer.json 
      214112:        "vocab": {
      nick@dellserver:/data/models/DeepSeek-R1-bf16$ dd if=tokenizer.json of=vocab.dd skip=214112 bs=1
      7633490+0 records in
      7633490+0 records out
      7633490 bytes (7.6 MB, 7.3 MiB) copied, 16.5162 s, 462 kB/s
      
    2. 下载了一篇雄文,非常的宏大,我不敢幻想我能看懂,但是先保存一下。

    四月四日 等待变化等待机会

    1. 有一个遥远的传说,就是关于std::string的,人们嫌弃的效率不如原始的char buf[],于是投机使用把它的小存储的15个字节的buffer当作原始字符串来用。结果我今天终于看到了。
      
      std::string common_token_to_piece(const struct llama_vocab * vocab, llama_token token, bool special) {
          std::string piece;
          piece.resize(piece.capacity());  // using string internal cache, 15 bytes + '\n'
          const int n_chars = llama_token_to_piece(vocab, token, &piece[0], piece.size(), 0, special);
      ...
      
      这里面的这段代码有多少玄机呢?
      1. 首先,什么是std::stringoperator[]
        
              // Element access:
              /**
               *  @brief  Subscript access to the data contained in the %string.
               *  @param  __pos  The index of the character to access.
               *  @return  Read-only (constant) reference to the character.
               *
               *  This operator allows for easy, array-style, data access.
               *  Note that data access with this operator is unchecked and
               *  out_of_range lookups are not defined. (For checked lookups
               *  see at().)
               */
              _GLIBCXX20_CONSTEXPR
              const_reference
              operator[] (size_type __pos) const _GLIBCXX_NOEXCEPT
              {
        	__glibcxx_assert(__pos <= size());
        	return _M_data()[__pos];
              }
        
        这个是唯一正确的使用15enum { _S_local_capacity = 15 / sizeof(_CharT) };_CharT的buffer的做法,因为你不容易跳过string结构在buffer之前定义的各种其他成员,那是错误的冒险。
      2. 其次,在你使用operator[](size_type __pos)函数的时候,传入的参数__pos是会被glibc动态检查的,所以,要保证不至于buffer overflow。那么要达到使用静态本地buffer的目的,也就是15个字节的快速分配好处,就要保证string的size()足够。那么在你还没有拷贝任何数据到string的时候你要怎么保证你的string::size()是够用的呢?最简单的就是它的capacity(),这个是在默认就是15的好的选择。
        
        size_type
              capacity() const _GLIBCXX_NOEXCEPT
              {
        	return _M_is_local() ? size_type(_S_local_capacity)
        	                     : _M_allocated_capacity;
              }
        
        而这个至关重要的效率就在于我们要保证不使用动态分配的buffer,也就是_S_local_capacity,这里的
        
              bool
              _M_is_local() const
              {
        	if (_M_data() == _M_local_data())
        	  {
        	    if (_M_string_length > _S_local_capacity)
        	      __builtin_unreachable();
        	    return true;
        	  }
        	return false;
              }
        
        __builtin_unreachable();是一个强大的双刃剑因为它可以阻止编译器的自作聪明也可以暴露你的愚蠢。
        __builtin_unreachable is a powerful tool for giving the compiler hints about code that should never be executed. This allows for better optimization and can help suppress certain warnings. However, it's crucial to use it correctly, as incorrect usage leads to undefined behavior.
        目前我看到gcc/clang都支持。
      3. 最后是operator[]返回的类型是一个引用,那么正确的使用方法是怎么样子的呢?首先要resize然后再拷贝,因为拷贝之后你是无法改变它的大小,因为改变大小意味着清零。
        
            string s;
            const string src = "no more than 15";
            if (src.size() > 15) {
                cout << "src size more than 15!" << endl;
                return -1;
            }
            s.resize(src.size());
            memcpy(&s[0], src.c_str() , src.size());   
        
        但是这个做法究竟比string的赋值拷贝有何优势呢?这里肯定不是重点,重点是我们想要借用string的内部buffer来快速的使用而已。
    2. 在这一点上DeepSeek是超过gemini的,我喜欢它的总结:

      Advantages of std::string_view over const std::string&:

      1. Flexibility with String Types:

        • std::string_view can be constructed from any contiguous character sequence (std::string, C-strings, literals, char arrays) without copies or allocations.

        • const std::string& requires implicit conversion to std::string for non-std::string inputs (e.g., "hello" creates a temporary std::string), which may incur allocation overhead (especially for non-SSO cases).

      2. Performance for Substrings:

        • Creating a substring with std::string_view is O(1) (adjust pointer/size), while std::string::substr is O(n) (copies data).

      3. Avoiding Temporary Allocations:

        • Passing non-std::string arguments (e.g., literals) to a const std::string& parameter forces a temporary std::string allocation. std::string_view avoids this entirely.

      4. Generic Interfaces:

        • std::string_view works with any string-like data (including custom types), promoting code reusability. const std::string& ties the interface to std::string.

      5. No Null Termination Requirement:

        • std::string_view can represent non-null-terminated substrings (useful for parsing), whereas std::string always expects null-terminated data.

      Lifetime Considerations:

      • Both std::string_view and const std::string& are non-owning, but const std::string& is bound to an existing std::string object, which may extend the lifetime of temporaries in some cases (e.g., func("literal") keeps the temporary alive for the function call).

      • std::string_view has no such guarantees—you must ensure the underlying data outlives the string_view.

      When to Use:

      • Prefer std::string_view for read-only, non-owning string parameters (especially in performance-critical code or when accepting diverse string sources).

      • Use const std::string& if you need to interface with existing code relying on std::string or require null termination.

      Summary: std::string_view offers greater flexibility, efficiency, and generality for non-owning string access, but requires careful lifetime management.

    3. 我昨天总结了一下那篇论文,这个是和豆包的讨论,主要是我在问豆包。结果就是我认为作者的贡献在于探索并不一定开拓了一个新的实现路径,因为我觉得按照奥卡姆剃刀原则没有必要去搞一个更加复杂的模型,如果线性模型够用的话。

    四月六日 等待变化等待机会

    1. 突然对于豆包或者gemini里为什么语音输入不工作想要探寻一下,但是首先还是要先确保麦克风是工作的,于是问豆包找到一个基本的测试网站,然后无耻的把它的代码拷贝下来给豆包做了一个最最简单的麦克风摄像头的测试网页。这个可以方便在使用视频会议中做基本测试用吧?
    2. 对于huggingface的封锁我开始感到气愤了,但是也许只是流量的控制,这个是无可奈何的,另一个思路是自己实现,我尝试问豆包看看,可是它还抓不住重点,仅仅停留在json文件的解析阶段,我不确定是否这么简单?
    3. 终于安装了豆包的chromium的浏览器插件,而且中国现在也在抵制美国的IP用户,然后语音输入就可以了。尽管不如手机的效果,也很好了。只是桌面版只有windows,我在wine里安装失败,我想是不是我要安装wine32的环境呢?
      
      $ sudo apt install wine32 wine64 libc6:i386 libstdc++6:i386
      $ WINEARCH=win32 WINEPREFIX=~/.wine32 winecfg
      
      作为验证:$ WINEPREFIX=~/.wine32 wine winecfg,但是显然的豆包的安装程序是64位的,问题是你不能指望一个复杂的windows程序能够在ubuntu上运行。
    4. 相比于豆包,编程方面我更加信任gemini,这个算法我还是要静下心来研究看看,否则我不自己去实现的话,下载huggingface的大模型真的把我折磨死了。不知道要多少天才能下载完啊。我又补充了DeepSeek-V3的关于量化的文档,gemini有了一些小小的调整。这个看起来还是费了不少的精神,代码不麻烦,但是细节很要命,因为这么大的文件,运行下去不知道结果如何,运行起来也不知道如何。这两个python脚本是关键fp8_cast_bf16.pykernel.py
    5. 似乎huggingface使用这个形式能够极大的提高下载速度:
      
      $ git remote -vv
      origin	https://USER NAME:ACCESS TOKEN@huggingface.co/arcee-ai/DeepSeek-R1-bf16.git (fetch)
      

    四月七日 等待变化等待机会

    1. 这篇DeepSeekdeV3的技术报告我早就该读一读了。
    2. 下载这个bf16的DeepSeek-R1大模型花了不少钱和时间,结果我等不及了要求gemini给我写一个CPU处理的小工具

    四月八日 等待变化等待机会

    1. github的key折腾的够呛,可能是我的key-agent设置,也可能是需要升级加密算法,或者就是必须使用https而不是ssh,总之过程是这样子的:
      1. 首先,还是要保证这个包的安装吧?因为ubuntu没有直接的安装包,必须这么做
        
        sudo apt update
        sudo apt install libsecret-1-0 libsecret-1-dev libglib2.0-dev build-essential
        cd /usr/share/doc/git/contrib/credential/libsecret
        sudo make
        git config --global credential.helper /usr/share/doc/git/contrib/credential/libsecret/git-credential-libsecret
        
        原因:Unlike some older Ubuntu versions, Ubuntu 22.04 doesn't have a readily installable git-credential-libsecret package via apt
      2. 最好是按照github的要求使用高级算法:
        
        $ ssh-keygen -t ed25519 -C "your_email@example.com"
        $ eval "$(ssh-agent -s)"
        ssh-add ~/.ssh/id_ed25519
        
        然后在github的ssh-key页面把~.ssh/id_ed25519.pub拷贝到github上。
      3. 注意这里repo的remote url不能是ssh格式,我不知道为什么,只能是https:
        
        $ git remote set-url origin https://github.com/{USERNAME}/{REPONAME}.git
        $ git push origin
        
      似乎隔一阵子都有一些安全的机制在改变。
    2. 我耗费了一百多美金的流量下载的bf16的格式似乎不完全,但是git lfs pull直接就返回,这个真的是头疼!
    3. 所以,凡事都不能在一棵树上吊死,我的planB就是自己写一个小工具来转换,这个实际上是gemini写的,我改的,当然设计改进还是我的。 但是这个还需要使用llama.cpp的convert_hf_to_gguf.py来再转换为gguf格式,它需要至少
      1. config.json
      2. model.safetensors.index.json
      3. tokenizer.json
      等等配置才能工作。这个代码以后肯定是要改进的,不过现在先使用一下看看行不行。
    4. free -h显示有很多的内存被used,但是又是available,所以,应该是在cached,可以使用
      
      $ echo 1 | sudo tee /proc/sys/vm/crop_caches
      
      之所以要做这个是因为我使用sysctl vm.nr_hugepages=1350无法分配更多的pages,因为只看到/proc/meminfo里只有70个,这个是hugepagesz=1G的情况。
    5. 为什么sysctl立刻就返回了,但是hugepages实际上并没有分配呢?
      
      $ cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages 
      282
      $ echo 400 |sudo tee /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages 
      400
      $ cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
      282
      
      所以,大致上这个就是和内核通讯的方式。

    四月九日 等待变化等待机会

    1. 我自己转的模型也许有错误,也许是内存不够,因为bf16的DeepSeek-R1:671B文件/内存需要1.3T,这个可能是超过了我的上限了,总之输出不正常。这个是另一个思路,在aws/s3上存储,首先使用s3fs来创建一个mount point。
      
      $ sudo s3fs www.staroceans.org:/DeepSeek /mnt/s3-repo/ -o allow_other -o use_path_request_style  -o default_acl=public-read -o use_cache=/cache -o parallel_count=10 -o ensure_diskfree=2000 -o multipart_size=50
      
      我发现我的aws/ec2的磁盘不能把cache放在/tmp上因为它是一个tmpfs只有几十k。只有在根目录还有几个G的空间。我为了防止磁盘被耗尽强制预留2000MB的磁盘空间。 这里必须把密码文件放在${HOME}/.aws/credentials只不过这里的${HOME}是root
    2. 我决定放弃满血DeepSeek,转而使用蒸馏版的小模型。那么笔记本上运行要快的多,只不过我又要安装驱动。
      
      $ ubuntu-drivers devices
      == /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
      modalias : pci:v000010DEd000028A1sv00001558sd0000A650bc03sc00i00
      vendor   : NVIDIA Corporation
      driver   : nvidia-driver-535-server-open - distro non-free
      driver   : nvidia-driver-535-server - distro non-free
      driver   : nvidia-driver-560-open - third-party non-free
      driver   : nvidia-driver-560 - third-party non-free
      driver   : nvidia-driver-565-open - third-party non-free
      driver   : nvidia-driver-565 - third-party non-free
      driver   : nvidia-driver-570 - third-party non-free recommended
      driver   : nvidia-driver-535-open - third-party non-free
      driver   : nvidia-driver-545-open - distro non-free
      driver   : nvidia-driver-535 - third-party non-free
      driver   : nvidia-driver-570-server - distro non-free
      driver   : nvidia-driver-525 - third-party non-free
      driver   : nvidia-driver-550 - third-party non-free
      driver   : nvidia-driver-550-open - third-party non-free
      driver   : nvidia-driver-570-server-open - distro non-free
      driver   : nvidia-driver-555 - third-party non-free
      driver   : nvidia-driver-570-open - third-party non-free
      driver   : nvidia-driver-545 - third-party non-free
      driver   : nvidia-driver-555-open - third-party non-free
      driver   : xserver-xorg-video-nouveau - distro free builtin
      
      gemini建议还是安装nvidia-driver-570 ,但是这个一定会安装dkms,然后内核又要被修改,我发现它是把我的内核降级了,这个实在是讨厌,很多问题就出在这个内核版本改变的问题上。而且我还是最好重启再安装cuda。 我问DeepSeek是否必须要安装cuda,它说:
      1. Option 1: Install Minimal CUDA Libraries (Recommended)

        1. install
          
          sudo apt install nvidia-cuda-toolkit
          
        2. verify
          
          dpkg -L nvidia-cuda-toolkit | grep libcublas
          
        3. build
          
          cd llama.cpp
          make clean
          make LLAMA_CUBLAS=1
          
      2. Option 2: Use OpenCL/CLBlast (No CUDA)

        1. Install OpenCL libraries:
          
          sudo apt install nvidia-opencl-dev ocl-icd-opencl-dev clinfo
          
        2. Verify OpenCL detection:
          
          clinfo  # Check if your GPU appears
          
        3. Build llama.cpp with OpenCL:
          
          cd llama.cpp
          make clean
          make LLAMA_CLBLAST=1
          
        4. Run with OpenCL:
          
          ./main -m <model> -n 128 -ngl 50 --gpu-layers 50
          

      Which Should You Choose?

      ApproachProsCons
      Minimal CUDAFaster performance, easy setupStill requires CUDA libraries
      OpenCLAvoids CUDA entirelySlower, less optimized for NVIDIA GPUs

    四月十日 等待变化等待机会

    1. 看了llama.cpp的模型解析机制才明白它的地层就是所谓的gglm的库的工作。这里我想来比较一下1.5b大模型和7b大模型的区别。 那么差别是什么呢? 去除无关的部分列表如下:
      item1.5b7b
      qwen2.embedding_length15363584
      qwen2.feed_forward_length896018944
      qwen2.attention.head_count1228
      qwen2.attention.head_count_kv24
      tokenizer.ggml.tokensarr[str,151936]arr[str,152064]
      tokenizer.ggml.token_typearr[i32,151936]arr[i32,152064]
      file size1.04 GiB (5.00 BPW)4.36 GiB (4.91 BPW)
      n_embd15363584
      n_head1228
      n_head_kv24
      n_gqa67
      n_embd_k_gqa256512
      n_embd_v_gqa256512
      n_ff896018944
      model params1.78 B7.62 B
      n_vocab151936152064
      CPU_Mapped model buffer size1059.89 MiB4460.45 MiB
      CPU KV buffer size112.00 MiB224.00 MiB
      KV self size112.00 MiB, K (f16): 56.00 MiB, V (f16): 56.00 MiB224.00 MiB, K (f16): 112.00 MiB, V (f16): 112.00 MiB
      CPU compute buffer size299.75 MiB304.00 MiB

    四月十三日 等待变化等待机会

    1. 简单的看看DeepSeek-R1的权重发现几乎没有一个byte是0的,这个让人吃惊,可是也是在意料之中吧?早先那些针对GPT的大模型里都是0的权重肯定都是过时的想法了。说不定DeepSeek还是某种加型式,我这里的是紧密的意思,因为i所有权重都找不到一个0的情况不是很特殊吗?
    2. gguf如何存储各种各样的数据的? 这里就是这么几种情况
      1. 简单类型放在data里,反正是一个万能的容器。
      2. 某种类型的数组,也是放在data里,那么如何取出来呢?这里有一个关键就是如何存放类型了,这个是经典的c++的解决办法,因为传统的c程序员只能存放类型的长度,无法对于类型动态的存储,于是就有混淆,比如都是4个byte的数字到底是浮点数还是整数呢?这个是经典的C++模板的威力,首先定义个万能模板,就是所谓的general的类型 然后我们依据不同类型的特例化 不过美中不足的是作者使用一个map来存储长度,我本来想着在存储的时候自动把类型的长度写下来存储起来,但是各有利弊。
    3. llama的vocabulary也就是llama_vocab,它要看这些属性:LLM_KV_TOKENIZER_MODELLLM_KV_TOKENIZER_PRE以及它的数量LLM_KV_TOKENIZER_TOKEN_TYPE_COUNT 而DeepSeek-R1的模型类型是gpt2,之前的类型是qwen2,这个模型需要查看LLM_KV_TOKENIZER_MERGES,而这里的所谓的merge是存储的一对用空格分开的所谓的分词或者称之为subword。于是我们的词汇表实际上是存储为一对分词。怎么使用呢?
      1. 针对用户输入,首先要使用regex来过滤unicode_regex_split,而每一个大模型的tokenizer都会附带一个regex来
      2. llm_tokenizer_bpe是llama内部存储的regex,但是照例说这个应该是从safetensors一起配套的tokenizer.json里读到的,但是似乎llama存了一个固定的,而且有好几个版本,这个是PRE_TYPE_DEEPSEEK_LLM 这个是PRE_TYPE_DEEPSEEK3_LLM 这个是PRE_TYPE_DEEPSEEK_CODER 我们来看看t在DeepSeek-R1的tokenizer.json里怎么写的吧? 换言之,DeepSeek-R1是从DeepSeek-V3继承来的,这个当然是废话。也就是说照例说应该是使用convert_hf_to_gguf.py直接从tokenizer.json里面拷贝来的,而llama.cpp是直接拷贝到代码里的,这个应该是大模型的核心紧密的部分,没法改变,因为token是核心,每个模型稍微变一点点就鸡同鸭讲了,这个当然也是废话。
      但是我们第一步还到不了这里因为要先把UTF-8变为所谓的unicode。这个就是unicode_cpts_from_utf8,这个代码是比较经典的,因为没有什么更好的做法,都是一个最多四个byte的转换,当然要去判断第一个byte的长度bit,我在很多地方看到过,也没有听说有什么更高效的办法,也许将来芯片可以去做吧?
    4. 针对regex里有所谓的collapsed的做法,代码里是看这个表达式是否在regex里出现: 很显然的DeepSeek-R1有这个"\\p{N}{1,3}",所以是collapsed。这里是关于它的讨论,说老实话我一眼看过去是看不明白的,但是我知道这里的水很深,非常的麻烦,尤其我至今也搞不明白表达式本身是否应该使用unicode,比如中文你要怎么表达呢?而且还有跨平台,因为似乎C++的实现也有支持的问题,以前听说std的实现不如boost的,有重大的效率问题等等。总而言之是很复杂的问题。
      
      The original unicode text is converted to a single-byte extended ASCII, using the following transformation:
      I.e., all single-byte codepoints less than 128 remain the same, while the rest of the codepoints are converted (collapsed) to one of the following values: 0xD0, 0xD1, 0xD2, 0xD3 depending on their codepoint category:
      Next, the category patterns in the original regex are replaced to match the set of the collapsed byte + the sub-128 codepoints from that category:
      For example, the regex \p{N} becomes [\xD1\x30-\39]
      We now apply the "collapsed" regex onto the "collapsed" string using std::regex to get the splitting offsets
      This algorithm only works if there are no non-ASCII characters in the original regex. For example, we cannot apply this to: "[一-龥ࠀ-一가-퟿]+|\\p{N}+", but luckily there seem to be no such patterns. Regex use either unicode characters or unicode categories but never both
      If a regex contains non-ASCII characters and does not contain unicode categories, we apply std::wregex
      
      这里大侠使用工具脚本从unicode官方网站下载了这个。我的感觉是所有的unicode的codepoints,然后脚本把合法的分段,因为不连续吧?更加有用的是把unicode进行了分类,到底是数字还是字符还是标点等等。这个是非常繁琐的工作,但是你也没有什么好办法。
    5. 所以,一言以蔽之说是一个简单的说要把unicode_cpts_from_utf8但是里面的工作很多,这里还有模型自身的要求。这里unicode_regex_split_custom_gpt2是模型自身要求的regex,难道说大侠把regex实现了一遍?应该说是针对早期的特殊简单regex的特例实现了一下,后来各个大模型千奇百怪的regex怎么可能把轮子都造一遍呢?但是很关键的是如何解决所谓的std::regex / std::wregex,我曾经尝试过一次非常的混乱,似乎有大侠说不要使用后者,可是你能避免吗?
    6. 其实最终的目的是什么?是根据模型自带的regex来做所谓的分词,我的理解就是每个模型训练的时候是针对特定语言的,而且是特别的分词来训练的,所以,这个来不得半点差错,这个当然是废话。所以,我们在使用模型自带的regex得到这个函数unicode_regex_split_stl的返回的一个所谓的bpe_offset就是一个分词的偏移向量或者说数组,我自然的想那就是要做merge的了。
    7. 我现在开始怀疑DeepSeek的regex是不是正确,因为各个online的regex网站都认为它不对。 这个就是很简单的任何一个除了下划线_以外的标点符号跟着无限长的字符。再次牢记在character classes或者说方括号里只有这几个特殊字符[]\-^,而且不允许有[]递归。就是说[]是顺序的。如果你在代码里不使用Raw的话肯定是很难受的。
    8. 折腾了好久分别拷问gemini和deepseek,总算有些开窍了,这个\p{N}{1,3}是指的unicode被划分为数字的字符,而这个不包含一二三四五六七八九十,它们是所谓的Lo,letter of others

    四月十四日 等待变化等待机会

    1. 昨天是头脑极度混乱的状态下,因为每走一步都遇到很多问题导致不断的偏离轨道去解决旁枝末节的基础。任何搞科研和工程实践中如果基础不够雄厚都是如此,因为你会在所有的细节上暴露自己的无知与愚蠢,然后你的能力体现在以多快的速度从这些被牵扯的问题里拉回来。虽然不是所有的细枝末节都应该要解决的,可是有很多暴露出的问题是核心,一旦被搁置也许导致的后果是不可预测的。
    2. 我现在回想起来应该继续看tokenzier的merge如何被使用,这个原理我在llama.c那一页代码实现transformer里就看到过,只不过当时不敢相信。现在我们来证实一下原理吧。
    3. 但是我依然需要完成regex里的重要的unicode分类的问题,这个是gemini的解答:

      Common Unicode Categories

      
      L: Letter
          Matches any kind of letter from any script.
          Examples: a, A, あ, 漢字
      P: Punctuation
          Matches any kind of punctuation character.
          Examples: ., ,, !, ?, ", '
      S: Symbol
          Matches any kind of symbol.
          Examples: $, %, +, ©, ★
      M: Mark
          Matches a character that combines with or modifies another character.
          Examples: Accents, diacritics (e.g., in á, è, or combining characters in some Asian scripts).
      

      Full List of General Categories

      
      L: Letter
          Lu: Uppercase Letter
          Ll: Lowercase Letter
          Lt: Titlecase Letter
          Lm: Modifier Letter
          Lo: Other Letter
      M: Mark
          Mn: Nonspacing Mark
          Mc: Spacing Combining Mark
          Me: Enclosing Mark
      N: Number
          Nd: Decimal Digit
          Nl: Letter Number
          No: Other Number
      P: Punctuation
          Pc: Connector Punctuation
          Pd: Dash Punctuation
          Ps: Open Punctuation
          Pe: Close Punctuation
          Pi: Initial Punctuation
          Pf: Final Punctuation
          Po: Other Punctuation
      S: Symbol
          Sm: Math Symbol
          Sc: Currency Symbol
          Sk: Modifier Symbol
          So: Other Symbol
      Z: Separator
          Zs: Space Separator
          Zl: Line Separator
          Zp: Paragraph Separator
      C: Other
          Cc: Control
          Cf: Format
          Co: Private Use
          Cs: Surrogate
      
    4. 在明白了category之后再来审视DeepSeek的regex,就感觉不是很能理解,我去除了多余的escape之后,再把三组合在一起:
      1. 这个太复杂,我们再拆分一下,根据regex101.com的提示:
        1. 1st Alternative 这里揭示出我的理解错误在character class或者说[]之间
          Negation (^): As you mentioned, inside a character class, ^ has a special meaning only when it's the very first character after the opening square bracket [. In that position, it negates the character class (meaning "match any character not in this set"). If ^ appears anywhere else within the character class, it's treated as a literal ^ character.
          所以,^_没有特殊含义。 这些特殊字符就是本身再加上至少一个英文字母。
        2. 2nd Alternative 这个现在也比较清楚了,只多一个开头的非换行符,非字母,非标点,非符号,然后跟随至少一个字母或者控制符(mark)
        3. 3rd Alternative 注意第一个空格字符
          matches the character with index 32102016 or408) literally (case sensitive)
          gemini替DeepSeek辩护
          
          Why Not Just \s?
          
          You'd think that \s would be the most straightforward way to represent a space in a regex, and often it is. However, there can be subtle reasons or contexts where a literal space is used, or where there might be a need to be very specific:
          
              Clarity in a Specific Context: In some cases, a regex might be written to be extremely explicit, especially if it's dealing with a situation where other whitespace characters (like tabs or newlines) could cause problems. Using a literal space ensures that only a space character is matched.
          
              Potential for \s Redefinition: In some very specific or unusual regex environments, the definition of \s might be altered or extended to include other characters beyond the standard whitespace set. While rare, this could lead to unexpected behavior if you rely on \s when you truly only want a space.
          
              Historical Reasons/Legacy: Some older regex patterns or systems might have been written before Unicode support and the modern understanding of whitespace were fully standardized.
          
          反正我是不信。就是至少一个空格带着若干标点或者字母以及若干回车符
        4. 4th Alternative 这个应该是无争议的吧?就是一堆无意义的空白字符
        5. 5th Alternative 白字符加上非非白字符,那是什么?
          \S matches any non-whitespace character (equivalent to [^\r\n\t\f\v \u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff])
          gemini说
          (?!\S): This is a negative lookahead assertion. It asserts that what immediately follows the matched whitespace is not a non-whitespace character. In other words, it asserts that the whitespace is followed by either the end of the string or another whitespace character.
          关于?!我很久以前用过又要补课了

          Okay, I'll explain what ?! means in regex, referring to the information in the Canvas document, "Regex with \s for Space Character".

          ?! in regex is used for a negative lookahead assertion.

          Here's a more detailed explanation:

          • Lookahead Assertion: A lookahead assertion is a type of zero-width assertion. This means it doesn't consume any characters in the string being searched; instead, it asserts a condition about what comes after the current position.
          • Negative Lookahead (?!): The negative lookahead assertion (?!...) asserts that the expression inside the parentheses cannot be matched after the current position.

          In simpler terms, (?!...) checks if a certain pattern does not follow the current position in the string. If the pattern does follow, the match fails. If the pattern does not follow, the match succeeds.

      2. 东亚文字(只有中日没有韩国,韩文不算东亚文字,纯粹是自创的脱离中华文化圈的蛮夷文化)(以上解读不代表我的观点纯粹是DeepSeek-R1的regex内涵)
      3. 一到三个数字
    5. 总而言之,DeepSeek的regex没有意义,我从llama.cpp里扣出来的似乎才对

    四月十五日 等待变化等待机会

    1. 终于明白大神为什么费劲实现这个Unicode Category了,因为C++的regex不支持啊!只能自己做!这个是DeepSeek的解释:
      The C++ Standard Library's std::regex engine (as defined in <regex>) does not natively support Unicode property escapes like \p{N} (for numeric characters) or other Unicode categories (e.g., \p{L} for letters). This is a known limitation of the default regex syntax in C++, which is based on ECMAScript syntax (the default grammar) and lacks Unicode property support.
      对于这个\p{N}失败的原因是
      Why \p{N} Fails
      • The \p{...} syntax is part of Unicode Technical Standard #18 (UTS18) and is supported by regex engines like PCRE, ICU, or Perl, but not by the C++ Standard Library's std::regex.

      • In ECMAScript-compatible mode (the default for std::regex), \p is interpreted as a literal p, and {N} is parsed as an invalid range quantifier (e.g., {min,max}), leading to the "invalid range" error.

      DeepSeek给出了根本上语法就没有实现。所以,要么自己实现要么依赖第三方库。我这才明白为什么自己去创建那个UnicodeData.txt的表,这个是没有办法啊。
    2. 关于DeepSeek自带的regex和llama.cpp修改的版本差异对比是这样子的 一模一样!连那个容易被忽略的空格符都是一样的。那么我为什么怀疑它的正确性呢?原因当然是其中的意义在哪里?当然首先我问了DeepSeek和Gemini如果同样的这个对于python和C++的regex引擎表现是否不同,gemini认为差别主要就是UnicodeCategory的不支持上。
    3. 类似的在www.regex101.com上不能选择ECMAScript,这样子结果似乎是对的,但是regex101.com也许实现有瑕疵吧?就是它的python的falvor对于\\-在普通python字符串里依然不认为它是被转义的。这个已经远远的偏离了我的目的。就是说如果我使用不受代码语言escape干扰的regex在PCRE2下能够正确覆盖所有的匹配我就应该是可以信服DeepSeek的regex没有问题了,至于说我在我本地的python脚本里测试不能达到所有的预期是gemini说的。我不妨问问DeepSeek交叉验证一下。结果DeepSeek太牛了,它不但会生成更加详细的测试例,而且有agent功能就是直接运行python代码把结果都分析了而且还提出改正的结果以及改正后的测试,我已经彻底被替代了,因为之前我还是一个工具人执行拷贝粘贴代码及其运行结果的的功能,现在彻底被代替了。
    4. 对于非python程序员来说你写一个例子命名为regex.py就会导致什么问题呢?无限递归的import啊!不过很多错误到底是python的还是表达式的,我也不知道。这个重要吗?

    四月十九日 等待变化等待机会

    1. 不论一个算法写的有多么清晰与详细,对于没有编程经验的人来说依然是天书,类似的,哪怕是略有经验的,一知半解的也依然百思不得其解。首先是问题的理解,如果你压根不明白问题在哪里,那你压根儿不应该看这个代码。对于实践者来说,为了部署与平台一致性不愿意使用boost的regex是可以理解的。毕竟你不希望任何一个编程小白为了能够使用ollama而安装boost的regex的动态库吧?哪怕是部署静态库也是不可接受的,也许还有效率的考虑。总而言之,因为std::regex对于unicode的支持不充分,就是unicode category没有实现,导致大神费劲周折想出了这么一个相当的复杂的算法。我看了差不多一个星期才大概理解了吧?
    2. 首先是所谓的collapsed英文我始终没有真正的理解,对于一个Unicode的category比如\p{N}要怎么理解?这个是只有在Unicode下才能来起作用的category,那么照理说你应该交给regex来处理,可是regex没有实现,怎么办?我能否简单的用[0-9]替代呢?这个应该是我的直觉,看起来是诱人的,因为如果我把regex的pattern也从UTF-8的编码或者ASCII的单字节转为Unicode,也就是说我不使用regex而是使用wregex来处理已经从UTF-8编码转为Unicode的输入,这样子不是就可以了吗?这里也许还有很多的不确定性,但是大神走的是另一条路。不是膨胀而是收缩,就是把unicode缩减为单字节。比如费心费力的把Unicode所有的字符都制作了一个,当然是把所有的code都分类成{L,P,S,M,N}这么几大类,这个实际上就是实现了原始的分类。可是关键的是细节,我大概都能猜出一些路径,但是依旧卡壳了。因为归根结底你这么做并不能解决复杂的regex的pattern究竟想要干什么,你到底是要这些还是不要这些在表达式里是千奇百怪的,而且作为一个character class的表达式是不允许嵌套的,这就是大神费劲的自己去解析原版的regex来查看这个\p{L}之类的是否是在character class内,就是在[]内,而最最关键的一点,我却没有理解到,因为大神使用了看似多余的做法,替换regex pattern的\p{N}的动作是双重的,既使用等效表达式[0-9],又使用自己定义的category的符号标志,而这个就是迷惑我的地方,直到我使用gdb看运行期数据才恍然大悟,其实,你把unicode缩减为五大类的标志位,那么对于class category它们就是literal了,而这里大神为了不知名的优化刻意把ASCII的unicode对应部分免除了这个流程导致了一个现象,就是ASCII和自定义的这五大类的标志并行。 如此复杂和未充分优化的算法让我要吐血了。因为我自己也不会明白我写的是什么。因为当最后把Unicode的category进行替换之后,你会看到还有之前过滤掉的ASCII混合在一起,这就是为什么最后又要反复调用unicode转为UTF-8的原因。这里我的疑惑就是难道大模型的regex语法是只针对unicode吗?我现在累死了想不清楚了。

    四月二十一日 等待变化等待机会

    1. 有些最最基本的我却缺乏概念!什么是wchar_t,这个简直是让人汗颜啊!我自己存的都是吃灰的从来没有用过啊!
      wchar_t — type for wide character representation (see wide strings). It has the same size, signedness, and alignment as one of the integer types, but is a distinct type. In practice, it is 32 bits and holds UTF-32 on Linux and many other non-Windows systems, but 16 bits and holds UTF-16 code units on Windows. The standard used to require wchar_t to be large enough to represent any supported character code point. However, such requirement cannot be fulfilled on Windows, and thus it is considered as a defect and removed.
      我终于明白我为什么记不住这个wchar_t因为在Windows下它是有缺陷的所以弃而不用,这个算是给我找了一个记不住的理由!
    2. 这个ASCII码表本来应该是熟记于心的但是总是有几个控制字符记不住:
      ASCII values characters

      iscntrl
      iswcntrl

      isprint
      iswprint

      isspace
      iswspace

      isblank
      iswblank

      isgraph
      iswgraph

      ispunct
      iswpunct

      isalnum
      iswalnum

      isalpha
      iswalpha

      isupper
      iswupper

      islower
      iswlower

      isdigit
      iswdigit

      isxdigit
      iswxdigit

      decimal hexadecimal octal
      0–8 \x0\x8 \0\10 control codes (NUL, etc.) ≠0 0 0 0 0 0 0 0 0 0 0 0
      9 \x9 \11 tab (\t) ≠0 0 ≠0 ≠0 0 0 0 0 0 0 0 0
      10–13 \xA\xD \12\15 whitespaces (\n, \v, \f, \r) ≠0 0 ≠0 0 0 0 0 0 0 0 0 0
      14–31 \xE\x1F \16\37 control codes ≠0 0 0 0 0 0 0 0 0 0 0 0
      32 \x20 \40 space 0 ≠0 ≠0 ≠0 0 0 0 0 0 0 0 0
      33–47 \x21\x2F \41\57 !"#$%&'()*+,-./ 0 ≠0 0 0 ≠0 ≠0 0 0 0 0 0 0
      48–57 \x30\x39 \60\71 0123456789 0 ≠0 0 0 ≠0 0 ≠0 0 0 0 ≠0 ≠0
      58–64 \x3A\x40 \72\100 :;<=>?@ 0 ≠0 0 0 ≠0 ≠0 0 0 0 0 0 0
      65–70 \x41\x46 \101\106 ABCDEF 0 ≠0 0 0 ≠0 0 ≠0 ≠0 ≠0 0 0 ≠0
      71–90 \x47\x5A \107\132 GHIJKLMNOP
      QRSTUVWXYZ
      0 ≠0 0 0 ≠0 0 ≠0 ≠0 ≠0 0 0 0
      91–96 \x5B\x60 \133\140 [\]^_` 0 ≠0 0 0 ≠0 ≠0 0 0 0 0 0 0
      97–102 \x61\x66 \141\146 abcdef 0 ≠0 0 0 ≠0 0 ≠0 ≠0 0 ≠0 0 ≠0
      103–122 \x67\x7A \147\172 ghijklmnop
      qrstuvwxyz
      0 ≠0 0 0 ≠0 0 ≠0 ≠0 0 ≠0 0 0
      123–126 \x7B\x7E \172\176 {|}~ 0 ≠0 0 0 ≠0 ≠0 0 0 0 0 0 0
      127 \x7F \177 backspace character (DEL) ≠0 0 0 0 0 0 0 0 0 0 0 0
    3. 在古老的C程序员习惯里声明一个数组并初始化往往就是最简单的工作,但是不定长度要怎么做呢?就空白吧,可是这个是一个好习惯吗?它毕竟是一个运行期的东西,为什么不使用std::initializer_list
      
      An object of type std::initializer_list is a lightweight proxy object that provides access to an array of objects of type const T (that may be allocated in read-only memory). 
      
      当然我似乎找不出一定要的理由,比如
      
      const int arr[] = {1,2,3,4,5};
      static_assert(sizeof(arr) == sizeof(int)*5, "sizeof array");
      
      C程序员是可以做到使用std::begin(arr),当然这里的constexpr并不是必须的,因为const是足够的,只不过我要使用static_assert,来证明编译必须通过性而已。
      
              constexpr int arr[] = {1,2,3,4,5};
              static_assert(sizeof(arr) == sizeof(int)*5, "sizeof array");
              static_assert(*std::begin(arr) == 1, "arr[0] == 1");
      
    4. 大侠的注释里说wregex不能识别\s
      
      // std::wregex \s does not mach non-ASCII whitespaces, using 0x0B as fallback
      std::wstring wtext(cpts.begin(), cpts.end());
      for (size_t i = 0; i < wtext.size(); ++i) {
      	if (wtext[i] > 0x7F && unicode_cpt_flags_from_cpt(wtext[i]).is_whitespace) {
      		wtext[i] = 0x0B;
      	}
      }
      
      代码里这些所谓的whitespace是这么定义的:
      
      const std::unordered_set unicode_set_whitespace = {
      0x000009,
      0x00000A,
      0x00000B,
      0x00000C,
      0x00000D,
      0x000020,
      0x000085,
      0x0000A0,
      0x001680,
      0x002000,
      0x002001,
      0x002002,
      0x002003,
      0x002004,
      0x002005,
      0x002006,
      0x002007,
      0x002008,
      0x002009,
      0x00200A,
      0x002028,
      0x002029,
      0x00202F,
      0x00205F,
      0x003000,
      };
      
      可是根据Unicode的定义 但是看大侠的脚本的注释才找到这个文档 这个我将信将疑。这里的关键是在调用std::wcout之前和之后都要调用fwide(stdout, -1); 才能保证随后的std::cout能够使用窄字符了。 结果呢?从regex的角度来看,这个是无厘头啊!有很多人认为\s应该包含这些所谓的控制符。 注意\u0085这个whitespace不被承认,这个就是大侠所做的,而我花了一个早上才搞明白!还有很多其他的不被承认的whitespace字符
      
      $ ./test7
      about to process:space ,nextline
      ,linetab
              ,linefeed
      ,tab    !NEL\u0085�!rriage
      with regex:\s
      [0x20]
      [0xa]
      [0xb]
      [0xc]
      [0xd]
      [0x9]
      about to process:space ,nextline
      ,linetab
              ,linefeed
      ,tab    !NEL\u0085�!rriage
      with regex:\s+
      [0x20]
      [0xa]
      [0xb]
      [0xc]
      [0xd]
      [0x9]
      about to process:space ,nextline
      ,linetab
              ,linefeed
      ,tab    !NEL0x85�!carriage
      with regex:\s
      [0x20]
      [0xa]
      [0xb]
      [0xc]
      [0xd]
      [0x9]
      about to process:space ,nextline
      ,linetab
              ,linefeed
      ,tab    !NEL0x85�!carriage
      with regex:\s+
      [0x20]
      [0xa]
      [0xb]
      [0xc]
      [0xd]
      [0x9]
      nick@nick-sager:/tmp$ 
      
    5. 一个早上都在为一个个拦路虎所困扰,有些是白痴问题,比如这个gemini解释的很好
      • The form feed (\f) likely moved the cursor to the beginning of a new line.
      • The carriage return (\r) moved the cursor back to the beginning of the current line (the one where the comma was printed after the form feed).
      • The subsequent characters "tab" overwrote the beginning of that line.
      • The tab (\t) then moved the cursor further to the right before printing the "!".
      • The characters "carriage" were printed before the carriage return moved the cursor back.
    6. 但是这个问题呢? 这个问题要怎么解释呢?
    7. 提了一个GCC的bug。问题就在于这个明显是库的问题,因为MSVC是正确的
    8. 提完了bug,我都快累死了。一个早上就在这个泥潭里打转转,我都彻底忘记要干什么了。哦,我大概明白了大神所说的wregex的\s仅仅支持ASCII码的whitespace的意思了,就是说没有问题只要是在127以内的这些所谓的whitespace是可以抓到的,但是超过部分就不行了!
    9. 偶然看到以前的一个文档,这个应该对于我现在的服务器上的RAID卡是有帮助的。

    四月二十三日 等待变化等待机会

    1. 一个貌似很简单的问题,但是就是解决不了,让我怀疑OpenCV究竟能做什么?
    2. 奋战了一个晚上,终于gemini完成了我的期待,我把代码存放在这里。这个中间修正了多少bug呢?简直是难以计数,我中间都几乎放弃了好几次。太累了。

    四月二十六日 等待变化等待机会

    1. 这个编程的强度还是很大的,尽管我自己几乎没有怎么写代码,可是我的工作强度还是很大的,因为我也需要运行debug还有有时候必须看看gemini的代码,自从它上一次胡说八道SGF之后我还是要对它小心一些,否则最后带到沟里去也不知道,所幸所有的问题都在测试中呈现。这句是计算机世界里很难造假的原因,太容易发现了,因为能够重现的假象一定很难造。如果真的造成了,那很可能是真相。总而言之,关于图像处理的稳定工作我想放一下,等到真的围棋棋盘到了再说,实际的测试环境也许更容易,因为一开始的棋盘是简单的,只有中盘才是复杂的。
    2. 针对ubuntu下存取webcam的照片访问/dev/videoX和使用ioctl的权限问题,我实在是不原因使用sudo,因为太危险,而且debug环境也有问题,一个简单的解决办法就是把我自己加入到video这个group里:
      
      $ sudo usermod -a -G video $(id -nu)
      
      当然这样做你要明白一件事就是在调用webcam的v4l2接口它应该是使用所谓的effective user id吧,总而言之,你是在video这个group下,那么本地的文件读写又成了问题,我本来以为video也是一个用户,才发现它只是一个group,那么我只能把文件输出到/tmp或者预先设定了other写权限的地方。

    四月二十七日 等待变化等待机会

    1. 老爸的《卜算子》
      《卜算子》春夏之交
      
      云雾谷雨天,
      芳菲春无力。
      杜鹃啼叫布谷声,
      催播田种地。
      *****
      旧叶换新芽,
      花谢结果子。
      原野翠绿雨乍晴,
      夏来南风起。
      
      我修改了一下:
      《卜算子--春夏之交》
      
      云雾谷雨天,
      春去芳菲寂。
      杜鹃啼血布谷寒,
      催人耕田地。
      *****
      旧叶换新芽,
      花谢结青子。
      原野翠绿雨乍晴,
      风暖夏将至。
      
    2. 这个是标准的做法
      
      #include 
      #include 
      #include 
      
      class GEMError : public std::runtime_error {
      public:
        GEMError(const std::string &message, const char* file, int line, const char* function)
            : std::runtime_error(constructMessage(message, file, line, function)) {}
      
      private:
        static std::string constructMessage(const std::string &message, const char* file, int line, const char* function) {
          std::stringstream ss;
          ss << message << " in " << function << " at " << file << ":" << line;
          return ss.str();
        }
      };
      
      // Macro for throwing GEMError exceptions with source code information
      #define THROWGEMERROR(message) \
      throw GEMError(message, __FILE__, __LINE__, __PRETTY_FUNCTION__);
      
      class SGFError : public std::runtime_error {
      public:
        SGFError(const std::string &message) : std::runtime_error(message) {}
      };
      
      感觉gemini的代码生成过程有重大问题,就是说它的code diff总是出错。

    Smiley face