在系統(tǒng)領(lǐng)域中,最具挑戰(zhàn)性的問題通常是組件之間的邊界定位。其中,virtio-net前后端的定界尤為困難。當(dāng)網(wǎng)絡(luò)報(bào)文從內(nèi)核發(fā)送到virtio-net后端,或者從virtio-net后端發(fā)送到內(nèi)核時(shí),這一路徑難以進(jìn)行觀測(cè)。一些復(fù)雜的網(wǎng)絡(luò)抖動(dòng)問題很可能是由于網(wǎng)卡隊(duì)列不正常工作引起的。為了解決這類問題,我們基于eBPF技術(shù)擴(kuò)展了網(wǎng)卡隊(duì)列的可觀測(cè)能力,使得virtio網(wǎng)卡前后端的定界問題不再困擾。
virtio-net 前后端驅(qū)動(dòng)簡(jiǎn)介
virtio-net (后面稱為 virtio 網(wǎng)卡)通常由兩個(gè)組件組成:virtio driver(也稱為virtio前端)和virtio device(也稱為virtio后端)。virtio前端運(yùn)行在客戶機(jī)的內(nèi)核中,而virtio后端可以由宿主機(jī)的內(nèi)核承擔(dān)。virtio網(wǎng)卡通常支持多隊(duì)列,包括發(fā)送隊(duì)列和接收隊(duì)列。每個(gè)隊(duì)列通過(guò)三個(gè) ring 來(lái)實(shí)現(xiàn),即avail ring、used ring和desc ring?,F(xiàn)在我們將重點(diǎn)介紹 virtio 網(wǎng)卡前端的報(bào)文發(fā)送和接收流程,以更好地理解整個(gè)工作流程。
virtio 網(wǎng)卡前端發(fā)送報(bào)文
virto網(wǎng)卡前端發(fā)送報(bào)文主要流程包括:
a.start_xmit:virtio網(wǎng)卡驅(qū)動(dòng)的報(bào)文發(fā)送入口函數(shù)會(huì)首先清理已發(fā)送的報(bào)文,即通過(guò)調(diào)用free_old_xmit_skbs函數(shù)來(lái)釋放描述符中的報(bào)文,直到avail->idx等于used->idx為止;
b.xmit_skb:主要是為報(bào)文添加vnet_hdr頭部信息,并將skb以scatter-gather形式顯示,以記錄報(bào)文數(shù)據(jù)的地址和長(zhǎng)度信息;
c.virtqueue_add_outbuf:進(jìn)行DMA映射,將scatter-gather記錄的報(bào)文數(shù)據(jù)地址和長(zhǎng)度信息添加到desc環(huán)中,并增加avail->idx的值;
d.virtqueue_notify:當(dāng)發(fā)送隊(duì)列存在數(shù)據(jù),則通知后端。

virtio 網(wǎng)卡前端接收?qǐng)?bào)文
virito網(wǎng)阿卡前端接收?qǐng)?bào)文主要流程包括:
a.網(wǎng)卡硬中斷:硬中斷會(huì)將napi加入到CPU的處理隊(duì)列,并啟用中斷抑制,以及觸發(fā)軟中斷;
b.net_rx_action:網(wǎng)絡(luò)軟中斷入口函數(shù);
c.virtnet_poll:這個(gè)函數(shù)是virtio網(wǎng)卡的NAPI poll的回調(diào)函數(shù)。如果當(dāng)前隊(duì)列是發(fā)送隊(duì)列,它將清理發(fā)送隊(duì)列,也就是執(zhí)行virtnet_poll_cleantx函數(shù)。如果當(dāng)前隊(duì)列是接收隊(duì)列,它將進(jìn)行報(bào)文的接收;
d.virtnet_receive:根據(jù)used->idx的值,從描述符環(huán)中讀取報(bào)文數(shù)據(jù),并更新last_used_idx。內(nèi)核會(huì)為報(bào)文數(shù)據(jù)分配skb,并進(jìn)入GRO流程,進(jìn)行報(bào)文的合并;e.try_fill_recv:要給desc環(huán)添加空的內(nèi)存區(qū)域,并增加avail->idx的值,以確保接收隊(duì)列始終有可用的內(nèi)存;
f.virtqueue_napi_complete:當(dāng)接收的報(bào)文數(shù)量少于預(yù)定的budget(一般為64)時(shí),表示沒有更多的數(shù)據(jù)可以接收。這時(shí),調(diào)用virtqueue_napi_complete來(lái)表示單次napi處理完畢。同時(shí),通過(guò)virtqueue_enable_cb_prepare來(lái)關(guān)閉中斷抑制。

網(wǎng)卡隊(duì)列可觀測(cè)
經(jīng)過(guò)前面的分析,我們了解到virtio網(wǎng)卡隊(duì)列中的幾個(gè)重要參數(shù),即avail->idx、used->idx和last_used_idx。使用這些參數(shù),我們可以清晰地了解網(wǎng)卡隊(duì)列當(dāng)前包含的報(bào)文數(shù)量,并進(jìn)一步得到以下可觀測(cè)指標(biāo):
a.發(fā)送隊(duì)列報(bào)文數(shù):表示尚未被virtio網(wǎng)卡后端發(fā)送的報(bào)文數(shù)量。計(jì)算方法是avail->idx - used->idx;
b.接收隊(duì)列報(bào)文數(shù):表示尚未被virtio網(wǎng)卡前端接收的報(bào)文數(shù)量。計(jì)算方法是used->idx - last_used_idx;
c.網(wǎng)卡隊(duì)列的last_used_idx:表示virtio網(wǎng)卡后端處理報(bào)文的進(jìn)度;
d.隊(duì)列飽和度:表示當(dāng)前網(wǎng)卡隊(duì)列使用量,計(jì)算方法是隊(duì)列報(bào)文數(shù)/隊(duì)列長(zhǎng)度。
我們將可觀測(cè)的代碼集成在了rtrace的工具里,rtrace是龍蜥社區(qū)推出的系統(tǒng)工具集SysAK的一個(gè)網(wǎng)絡(luò)診斷分析工具,關(guān)于rtrace的具體原理,我們將在下回分析,eBPF 具體代碼請(qǐng)參考代碼:
https://gitee.com/anolis/sysak/blob/opensource_branch_sync/source/tools/detect/net/rtrace/src/bpf/virtio.bpf.c
virtio 網(wǎng)卡隊(duì)列指標(biāo)采集的主要流程如下:
a.rtrace掛載eBPF采集程序到內(nèi)核dev_id_show和dev_port_show函數(shù);
b.rtrace周期性讀取/sys/class/net/[interface]/dev_id和/sys/class/net/[interface]/dev_port兩個(gè)文件,其中dev_id文件用來(lái)表示采集發(fā)送隊(duì)列信息,dev_port文件用來(lái)表示采集接收隊(duì)列信息;
c.當(dāng)讀取文件時(shí),會(huì)觸發(fā)內(nèi)核執(zhí)行dev_id_show和dev_port_show兩個(gè)函數(shù)。由于已經(jīng)掛載了eBPF采集程序,內(nèi)核會(huì)先執(zhí)行eBPF采集程序;
d.eBPF采集程序通過(guò)解析dev_id_show和dev_port_show入?yún)truct net_device獲取網(wǎng)卡隊(duì)列vring,然后從vring中解析出avail idx、used idx、隊(duì)列長(zhǎng)度和last_used_idx;
e.將數(shù)據(jù)發(fā)送給rtrace做進(jìn)一步處理。

故障檢測(cè)
下面是rtrace采集的網(wǎng)卡隊(duì)列信息輸出。
我們可以看到0926的1號(hào)發(fā)送隊(duì)列的飽和度和last_used_idx分別是0.05%/3593,0928的1號(hào)發(fā)送隊(duì)列的飽和度和last_used_idx分別是0.07%/3593,可以看到發(fā)送隊(duì)列的飽和度在增加,但是last_used_idx在多個(gè)采集周期內(nèi)保持不變。因此,可以確定1號(hào)發(fā)送隊(duì)列出現(xiàn)了故障。
隨后我們修復(fù)了1號(hào)發(fā)送隊(duì)列故障,可以看見在0906的1號(hào)發(fā)送隊(duì)列飽和度和last_used_idx分別是0.00%/3599,隊(duì)列里面不再有駐留的報(bào)文,恢復(fù)了正常。
0924 SendQueue0.05%/35930.00%/8520.00%/45060.00%/16000.00%/4570.00%/5090.00%/31400.00%/13520.00%/3860.00%/4100.00%/17140.00%/17580.00%/16190.00%/4460.00%/35770.00%/24430.00%/460.00%/940.00%/2120.00%/2310.00%/1460.00%/1480.00%/2260.00%/640.00%/1090.00%/840.00%/780.00%/560.00%/870.00%/880.00%/850.00%/52 RecvQueue0.00%/28050.00%/132970.00%/4750.00%/3670.00%/123780.00%/1300.00%/2220.00%/111200.00%/3550.00%/30160.00%/1330.00%/1800.00%/129800.00%/103630.00%/28250.00%/6500.00%/1510.00%/5050.00%/51800.00%/2000.00%/266700.00%/1690.00%/10420.00%/98200.00%/95860.00%/33740.00%/2290.00%/14020.00%/87960.00%/1170.00%/3010.00%/275 0925 SendQueue0.05%/35930.00%/8520.00%/45060.00%/16000.00%/4570.00%/5090.00%/31400.00%/13520.00%/3860.00%/4100.00%/17140.00%/17580.00%/16190.00%/4460.00%/35770.00%/24440.00%/460.00%/940.00%/2120.00%/2310.00%/1460.00%/1480.00%/2260.00%/640.00%/1090.00%/840.00%/780.00%/560.00%/870.00%/890.00%/850.00%/52 RecvQueue0.00%/28050.00%/132970.00%/4750.00%/3670.00%/123780.00%/1300.00%/2220.00%/111200.00%/3550.00%/30160.00%/1330.00%/1800.00%/129800.00%/103630.00%/28250.00%/6500.00%/1510.00%/5050.00%/51800.00%/2000.00%/266700.00%/1690.00%/10420.00%/98200.00%/95860.00%/33740.00%/2290.00%/14020.00%/87960.00%/1170.00%/3030.00%/275 0926 SendQueue0.05%/35930.00%/8520.00%/45060.00%/16000.00%/4570.00%/5090.00%/31400.00%/13520.00%/3860.00%/4100.00%/17140.00%/17580.00%/16190.00%/4460.00%/35770.00%/24440.00%/460.00%/940.00%/2120.00%/2310.00%/1460.00%/1480.00%/2260.00%/640.00%/1090.00%/840.00%/780.00%/560.00%/870.00%/910.00%/850.00%/52 RecvQueue0.00%/28050.00%/132970.00%/4750.00%/3670.00%/123780.00%/1300.00%/2220.00%/111200.00%/3550.00%/30160.00%/1330.00%/1800.00%/129800.00%/103630.00%/28250.00%/6500.00%/1510.00%/5050.00%/51800.00%/2000.00%/266700.00%/1690.00%/10420.00%/98200.00%/95860.00%/33740.00%/2290.00%/14020.00%/87960.00%/1170.00%/3050.00%/275 0927 SendQueue0.07%/35930.00%/8520.00%/45060.00%/16000.00%/4570.00%/5090.00%/31400.00%/13520.00%/3860.00%/4100.00%/17140.00%/17580.00%/16190.00%/4460.00%/35770.00%/24440.00%/460.00%/940.00%/2120.00%/2310.00%/1460.00%/1480.00%/2260.00%/640.00%/1090.00%/840.00%/780.00%/560.00%/870.00%/930.00%/850.00%/52 RecvQueue0.00%/28050.00%/132980.00%/4750.00%/3670.00%/123780.00%/1300.00%/2220.00%/111200.00%/3550.00%/30160.00%/1330.00%/1800.00%/129800.00%/103630.00%/28250.00%/6500.00%/1510.00%/5050.00%/51800.00%/2000.00%/266700.00%/1690.00%/10420.00%/98200.00%/95860.00%/33740.00%/2290.00%/14020.00%/87960.00%/1170.00%/3070.00%/275 0928 SendQueue0.07%/35930.00%/8520.00%/45060.00%/16000.00%/4570.00%/5090.00%/31400.00%/13520.00%/3860.00%/4140.00%/17140.00%/17580.00%/16190.00%/4460.00%/35770.00%/24450.00%/460.00%/940.00%/2120.00%/2310.00%/1460.00%/1490.00%/2260.00%/640.00%/1090.00%/840.00%/780.00%/560.00%/870.00%/960.00%/870.00%/52 RecvQueue0.00%/28050.00%/132980.00%/4750.00%/3670.00%/123780.00%/1300.00%/2220.00%/111200.00%/3550.00%/30160.00%/1330.00%/1800.00%/129800.00%/103630.00%/28250.00%/6500.00%/1510.00%/5050.00%/51800.00%/2050.00%/266700.00%/1690.00%/10420.00%/98200.00%/95860.00%/33740.00%/2290.00%/14020.00%/87970.00%/1180.00%/3090.00%/275 0929 SendQueue0.07%/35930.00%/8520.00%/45060.00%/16000.00%/4570.00%/5090.00%/31400.00%/13520.00%/3860.00%/4140.00%/17140.00%/17580.00%/16190.00%/4460.00%/35770.00%/24450.00%/460.00%/940.00%/2120.00%/2310.00%/1460.00%/1490.00%/2260.00%/640.00%/1090.00%/840.00%/780.00%/560.00%/870.00%/980.00%/870.00%/52 RecvQueue0.00%/28050.00%/132980.00%/4750.00%/3670.00%/123780.00%/1300.00%/2220.00%/111200.00%/3550.00%/30160.00%/1330.00%/1800.00%/129800.00%/103630.00%/28250.00%/6500.00%/1510.00%/5050.00%/51800.00%/2050.00%/266700.00%/1690.00%/10420.00%/98200.00%/95860.00%/33740.00%/2290.00%/14020.00%/87970.00%/1180.00%/3110.00%/275 0930 SendQueue0.07%/35930.00%/8520.00%/45060.00%/16000.00%/4570.00%/5090.00%/31400.00%/13520.00%/3860.00%/4140.00%/17140.00%/17580.00%/16190.00%/4460.00%/35770.00%/24450.00%/460.00%/940.00%/2120.00%/2310.00%/1460.00%/1490.00%/2260.00%/640.00%/1090.00%/840.00%/780.00%/560.00%/870.00%/1000.00%/870.00%/52 RecvQueue0.00%/28050.00%/132980.00%/4750.00%/3670.00%/123780.00%/1300.00%/2220.00%/111200.00%/3550.00%/30160.00%/1330.00%/1800.00%/129800.00%/103630.00%/28250.00%/6500.00%/1510.00%/5050.00%/51800.00%/2050.00%/266700.00%/1690.00%/10420.00%/98200.00%/95860.00%/33740.00%/2290.00%/14020.00%/87970.00%/1180.00%/3130.00%/275 //...省略 0906 SendQueue0.00%/35990.00%/8560.00%/45110.00%/16020.00%/4650.00%/5100.00%/31400.00%/13520.00%/3860.00%/4200.00%/17160.00%/17660.00%/16190.00%/4480.00%/35780.00%/24510.00%/460.00%/940.00%/2120.00%/2310.00%/1480.00%/1490.00%/2260.00%/640.00%/1090.00%/850.00%/870.00%/560.00%/870.00%/1010.00%/1030.00%/52 RecvQueue0.00%/28070.00%/132990.00%/4770.00%/3690.00%/123780.00%/1400.00%/2230.00%/111200.00%/3550.00%/30320.00%/1420.00%/1800.00%/129800.00%/103630.00%/28250.00%/6520.00%/1510.00%/5050.00%/51800.00%/2050.00%/266700.00%/1700.00%/10570.00%/98200.00%/95860.00%/33740.00%/2300.00%/14140.00%/88000.00%/1180.00%/3270.00%/275
總結(jié)
在virtio網(wǎng)卡中,前端和后端之間通過(guò)共享的網(wǎng)卡隊(duì)列進(jìn)行通信。為了更好地理解和觀測(cè)網(wǎng)卡隊(duì)列的狀態(tài)和性能指標(biāo),通過(guò)觀測(cè)avail idx、used idx、last_used_idx等指標(biāo),我們可以對(duì)virtio網(wǎng)卡的性能進(jìn)行評(píng)估和優(yōu)化。同時(shí),這些指標(biāo)也為我們提供了對(duì)網(wǎng)卡隊(duì)列狀態(tài)的深入理解,有助于進(jìn)行故障排查和性能調(diào)優(yōu)。
-
數(shù)據(jù)
+關(guān)注
關(guān)注
8文章
7335瀏覽量
94744 -
網(wǎng)卡
+關(guān)注
關(guān)注
4文章
339瀏覽量
28901 -
程序
+關(guān)注
關(guān)注
117文章
3846瀏覽量
85217
原文標(biāo)題:eBPF 技術(shù)實(shí)踐之 virtio-net 網(wǎng)卡隊(duì)列可觀測(cè)
文章出處:【微信號(hào):LinuxDev,微信公眾號(hào):Linux閱碼場(chǎng)】歡迎添加關(guān)注!文章轉(zhuǎn)載請(qǐng)注明出處。
發(fā)布評(píng)論請(qǐng)先 登錄
基于OpenTelemetry的全鏈路追蹤微服務(wù)可觀測(cè)性實(shí)踐
RDMA設(shè)計(jì)40:隊(duì)列管理及連接建立功能驗(yàn)證與分析
RDMA設(shè)計(jì)25:隊(duì)列管理模塊之發(fā)送模塊詳細(xì)設(shè)計(jì)分析
RDMA設(shè)計(jì)26:隊(duì)列管理模塊設(shè)計(jì)之接收隊(duì)列模塊詳細(xì)分析
RDMA設(shè)計(jì)24:隊(duì)列管理模塊設(shè)計(jì)
光伏“可觀”功能效果如何量化?——效益與技術(shù)實(shí)現(xiàn)深度評(píng)估
Amphenol Ve - NET?:汽車多千兆位差分連接器系統(tǒng)的卓越之選
IBM被 2025年 Gartner? 可觀測(cè)性平臺(tái)魔力象限? 評(píng)為領(lǐng)導(dǎo)者
教學(xué)實(shí)習(xí)基地氣象觀測(cè)系統(tǒng):架起理論與實(shí)踐的 “氣象橋梁”
NVMe IP高速傳輸卻不依賴XDMA設(shè)計(jì)之九:隊(duì)列管理模塊(上)
基于eBPF的Kubernetes網(wǎng)絡(luò)異常檢測(cè)系統(tǒng)
RabbitMQ消息隊(duì)列解決方案
RDMA簡(jiǎn)介5之RoCE V2隊(duì)列分析
NVME控制器之隊(duì)列管理模塊
NVME控制器之隊(duì)列管理模塊
eBPF技術(shù)實(shí)踐之virtio-net網(wǎng)卡隊(duì)列可觀測(cè)
評(píng)論