标准容器网络模式下流量从 Pod 发出后经过 Pod 网络空间的 iptables 规则处理通过 veth pair 到达宿主机网络空间再由宿主机 iptables 完成路由转发与地址转换最终从宿主机网卡发出而 Cilium Native eBPF Host Routing 模式下通过引入bpf_redirect_peer()与bpf_redirect_neigh()函数实现数据包从物理网卡的 TC ingress 直接重定向到 Pod veth 的 TC ingress跳过宿主机 veth pair 设备驱动的收包处理从而显著提升网络性能。参考文章基于 eBPF 的主机路由利用 eBPF 实现虚拟以太网设备优化区分三种类型的 eBPF 重定向部署流程使用 Cilium 代替 Kube-Proxy部署 k8s 集群时跳过 kube-proxy 安装通过 Cilium 代替 kube-proxy。Cilium 通过挂载 BPF cgroup 程序来实现基于 Socket 的负载均衡即 host-reachable 服务。Socket LB 依赖的 BPF cgroup 挂载connect4、sendmsg4 等是 cgroup v2 的特性。cgroup v1 不支持这些挂载类型所以使用 kube-proxy replacement 必须开启 cgroup v2。通过 Kind 快速生成集群并部署 Cilium Native 模式使用 cilium native 模式并且通过kubeProxyReplacementtrue代替 kube-proxy#!/bin/bash set -v # 1. Prepare NoCNI kubernetes environment cat EOF | HTTP_PROXY HTTPS_PROXY http_proxy https_proxy kind create cluster --namecilium-kpr-ebpf --imagekindest/node:v1.27.3 --config- kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 networking: ## 关闭集群默认 CNI 部署 disableDefaultCNI: true ## 关闭集群 kubeproxy 部署 kubeProxyMode: none nodes: - role: control-plane - role: worker - role: worker EOF # 2. Remove kubernetes node taints controller_node_ipkubectl get node -o wide --no-headers | grep -E control-plane|bpf1 | awk -F {print $6} kubectl taint nodes $(kubectl get nodes -o name | grep control-plane) node-role.kubernetes.io/control-plane:NoSchedule- # 3. Install CNI[Cilium v1.17.15] cilium_versionv1.17.15 docker pull cilium/cilium:$cilium_version docker pull cilium/operator-generic:$cilium_version kind load docker-image cilium/cilium:$cilium_version cilium/operator-generic:$cilium_version --name cilium-kpr-ebpf helm repo add cilium https://helm.cilium.io ; helm repo update helm install cilium cilium/cilium --set k8sServiceHost$controller_node_ip --set k8sServicePort6443 --version 1.17.15 --namespace kube-system --set image.pullPolicyIfNotPresent --set debug.enabledtrue --set debug.verbosedatapath flow kvstore envoy policy --set bpf.monitorAggregationnone --set monitor.enabledtrue --set ipam.modecluster-pool --set cluster.namecilium-kpr-ebpf --set kubeProxyReplacementtrue --set routingModenative --set autoDirectNodeRoutestrue --set ipv4NativeRoutingCIDR10.0.0.0/8 --set bpf.masqueradetrue # 6. Separate namesapce and cgroup v2 verify [https://github.com/cilium/cilium/pull/16259 https://docs.cilium.io/en/stable/installation/kind/#install-cilium] #for container in $(docker ps -a --format table {{.Names}} | grep cilium-kpr-ebpf);do docker exec $container ls -al /proc/self/ns/cgroup;done #mount -l | grep cgroup docker info | grep Cgroup Version | awk $1$1创建测试 Pod本质是 Nginx仅用于通过访问 svc 时抓包使用apiVersion: apps/v1 kind: StatefulSet metadata: labels: app: nginx name: pod spec: replicas: 3 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - image: burlyluo/nettool:latest name: nettoolbox env: - name: NETTOOL_NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName securityContext: privileged: true affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app: nginx topologyKey: kubernetes.io/hostname --- apiVersion: v1 kind: Service metadata: name: pod spec: type: NodePort selector: app: nginx ports: - name: http port: 80 targetPort: 80 nodePort: 32000查看部署结果rootnetwork-demo:~# kubectl get pods -A -o wide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE default pod-0 1/1 Running 0 111s 10.0.1.231 cilium-kpr-ebpf-worker2 default pod-1 1/1 Running 0 106s 10.0.0.153 cilium-kpr-ebpf-worker default pod-2 1/1 Running 0 100s 10.0.2.19 cilium-kpr-ebpf-control-plane kube-system cilium-7jjcl 2/2 Running 0 4m39s 172.18.0.3 cilium-kpr-ebpf-control-plane kube-system cilium-envoy-9jmzg 1/1 Running 0 4m39s 172.18.0.3 cilium-kpr-ebpf-control-plane kube-system cilium-envoy-pthdk 1/1 Running 0 4m39s 172.18.0.4 cilium-kpr-ebpf-worker kube-system cilium-envoy-s7j2t 1/1 Running 0 4m39s 172.18.0.2 cilium-kpr-ebpf-worker2 kube-system cilium-mds5q 2/2 Running 0 4m39s 172.18.0.2 cilium-kpr-ebpf-worker2 kube-system cilium-operator-7bfd9d69f4-fvvnp 1/1 Running 0 4m39s 172.18.0.2 cilium-kpr-ebpf-worker2 kube-system cilium-operator-7bfd9d69f4-pb9w6 1/1 Running 0 4m39s 172.18.0.4 cilium-kpr-ebpf-worker kube-system cilium-psx5j 2/2 Running 0 4m39s 172.18.0.4 cilium-kpr-ebpf-worker kube-system coredns-5d78c9869d-2r5k2 1/1 Running 0 10m 10.0.0.57 cilium-kpr-ebpf-worker kube-system coredns-5d78c9869d-hzvn9 1/1 Running 0 10m 10.0.0.67 cilium-kpr-ebpf-worker kube-system etcd-cilium-kpr-ebpf 1/1 Running 0 10m 172.18.0.3 cilium-kpr-ebpf-control-plane kube-system kube-apiserver-cilium-kpr-ebpf 1/1 Running 0 10m 172.18.0.3 cilium-kpr-ebpf-control-plane kube-system kube-controller-manager-cilium-kpr-ebpf 1/1 Running 0 10m 172.18.0.3 cilium-kpr-ebpf-control-plane kube-system kube-scheduler-cilium-kpr-ebpf 1/1 Running 0 10m 172.18.0.3 cilium-kpr-ebpf-control-plane验证效果查询 Cilium 详细信息1.查询 Cilium 详细运行状态## 可通过 cilium status --verbose 查询最详细的运行状态 rootnetwork-demo:~# kubectl exec -n kube-system cilium-7jjcl -- cilium status KVStore: Disabled Kubernetes: Ok 1.27 (v1.27.3) [linux/amd64] Kubernetes APIs: [EndpointSliceOrEndpoint, cilium/v2::CiliumClusterwideNetworkPolicy, cilium/v2::CiliumEndpoint, cilium/v2::CiliumNetworkPolicy, cilium/v2::CiliumNode, cilium/v2alpha1::CiliumCIDRGroup, core/v1::Namespace, core/v1::Pods, core/v1::Service, networking.k8s.io/v1::NetworkPolicy] ## 已通过 cilium 代替 kubeproxy KubeProxyReplacement: True [eth0 172.18.0.3 172:18:0:1::3 fe80::a43a:82ff:fe6d:b272 (Direct Routing)] Host firewall: Disabled SRv6: Disabled CNI Chaining: none CNI Config file: successfully wrote CNI configuration file to /host/etc/cni/net.d/05-cilium.conflist Cilium: Ok 1.17.15 (v1.17.15-4206eaa5) NodeMonitor: Listening for events on 8 CPUs with 64x4096 of shared memory Cilium health daemon: Ok IPAM: IPv4: 3/254 allocated from 10.0.2.0/24, IPv4 BIG TCP: Disabled IPv6 BIG TCP: Disabled BandwidthManager: Disabled ## 使用 native 模式 eBPF 主机路由 Routing: Network: Native Host: BPF Attach Mode: TCX Device Mode: veth Masquerading: BPF [eth0] 10.0.0.0/8 [IPv4: Enabled, IPv6: Disabled] Controller Status: 26/26 healthy Proxy Status: OK, ip 10.0.2.84, 0 redirects active on ports 10000-20000, Envoy: external Global Identity Range: min 256, max 65535 Hubble: Ok Current/Max Flows: 4095/4095 (100.00%), Flows/s: 108.07 Metrics: Disabled Encryption: Disabled Cluster health: 3/3 reachable (2026-05-02T06:26:46Z) Name IP Node Endpoints Modules Health: Stopped(0) Degraded(0) OK(55)2.查询 Cilium Endpoint 信息在 Cilium 中Endpoint 术语含义Cilium 为容器分配 IP。一个 Pod 中可以包含多个容器多个容器共享同一个 Pod IP。所有共享同一地址的容器被分组在一起Cilium 将其称为一个 Endpoint。每个节点的 Cilium Agent 只管理本节点的 Endpoint所以不同节点的 cilium endpoint list 输出不同本次以 Controller 节点 Pod 作为示例rootnetwork-demo:~# kubectl exec -n kube-system cilium-7jjcl -- cilium endpoint list ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[value]) IPv4 STATUS ENFORCEMENT ENFORCEMENT 10 Disabled Disabled 25489 k8s:appnginx 10.0.2.19 ready k8s:io.cilium.k8s.namespace.labels/metadata.namedefault k8s:io.cilium.k8s.policy.clustercilium-kpr-ebpf k8s:io.cilium.k8s.policy.serviceaccountdefault k8s:io.kubernetes.pod.namespacedefault 786 Disabled Disabled 1 k8s:node-role.kubernetes.io/control-plane ready k8s:node.kubernetes.io/exclude-from-external-load-balancers reserved:host 2559 Disabled Disabled 4 reserved:health 10.0.2.110 ready3.查询 Cilium Service 信息在 Cilium 中Service 术语含义k8s svc 在 Cilium eBPF Map 中实际转发状态Cilium 代替了 k8s kube-proxyCilium 用 eBPF map 替代 iptables 实现的 k8s svc 负载均衡表rootnetwork-demo:~# kubectl exec -n kube-system cilium-7jjcl -- cilium service list ID Frontend Service Type Backend 1 10.96.0.1:443/TCP ClusterIP 1 172.18.0.3:6443/TCP (active) 2 10.96.58.81:443/TCP ClusterIP 1 172.18.0.3:4244/TCP (active) 3 10.96.0.10:53/UDP ClusterIP 1 10.0.0.57:53/UDP (active) 2 10.0.0.67:53/UDP (active) 4 10.96.0.10:53/TCP ClusterIP 1 10.0.0.57:53/TCP (active) 2 10.0.0.67:53/TCP (active) 5 10.96.0.10:9153/TCP ClusterIP 1 10.0.0.57:9153/TCP (active) 2 10.0.0.67:9153/TCP (active) 6 10.96.231.166:80/TCP ClusterIP 1 10.0.1.231:80/TCP (active) 2 10.0.0.153:80/TCP (active) 3 10.0.2.19:80/TCP (active) 7 172.18.0.3:32000/TCP NodePort 1 10.0.1.231:80/TCP (active) 2 10.0.0.153:80/TCP (active) 3 10.0.2.19:80/TCP (active) 8 0.0.0.0:32000/TCP NodePort 1 10.0.1.231:80/TCP (active) 2 10.0.0.153:80/TCP (active) 3 10.0.2.19:80/TCP (active)查询 iptables 规则1.查询使用 kube-proxy 集群的 iptables 规则这里的查询环境使用 Calico VXLan 模式的 k8s 集群## 只看 NodePort 3112 部分 rootcalico-vxlan:~# kubectl get svc -n deepflow-otel-spring-demo web-shop NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE web-shop NodePort 172.96.248.195 none 18090:3112/TCP 176d## 在 iptables 规则中查询 Service NodePort 3112 相关内容 rootce-demo-1:~# iptables-save | grep 3112 -w ## 规则一 ## 额外匹配了目标地址是 localhost 的流量例如 curl 127.0.0.1:3112 ## nfacct 是内核模块提供的网络包计数器。这里只是挂的 localhost_nps_accepted_pkts 的计数器仅用于统计 ## 区分流量来源方便做监控 -A KUBE-NODEPORTS -d 127.0.0.0/8 -p tcp -m comment --comment deepflow-otel-spring-demo/web-shop:http-shop -m tcp --dport 3112 -m nfacct --nfacct-name localhost_nps_accepted_pkts -j KUBE-EXT-YBYZULGZYV76JYRW ## 规则二 ## 外部流量访问 NodeIP:3112 时在 PREROUTING 阶段会被引流到 KUBE-NODEPORTS ## 所有目标端口为 3112 的 TCP 包统一跳转到 KUBE-EXT-YBYZULGZYV