当前位置:网站首页>Troubleshooting kubernetes - 10s delay
Troubleshooting kubernetes - 10s delay
2022-04-22 02:43:00 【East wind whistling】
The phenomenon ¶¶
For the first time Single sign on System ( Hereinafter referred to as CAS) visit Demand system , Will wait 10s To get into Demand system The page of .
Background Overview ¶¶
1. Single sign on System Located in the company's traditional environment ;
2. Demand system Located in the company K8S Container platform (openshift) On ;
3. company K8S The container platform is located in the company's private cloud environment ( The Internet is SDN);
4. ...
The analysis process ¶¶
The problem is complicated , It mainly lies in the complexity of the network architecture . As mentioned above , The interaction between the two systems , involves :
1. Traditional network architecture
2. Private cloud SDN Network architecture
3. Based on the private cloud OpenShift OVS(OpenVSwitch) SDN The Internet
A simple example , If the container A To access external systems , The network flow should be like this :
eth0( Containers A Network card of ) → vethA → br0→ tun0 → (NAT) 1→ ( Private cloud network ) → ( Traditional networks F5 -> Traditional network related virtual machine )
remarks : br0: Connect to pod Container of OVS Bridge equipment .OpenShift SDN A set of non subnet specific flow rules are also configured on this bridge . tun0: OVS The internal port (br0 On the port 2). It is assigned to the cluster subnet gateway address , For external network access .OpenShift SDN To configure netfilter And routing rules , Allow access from the cluster subnet NAT Access to external networks . NAT: Network address translation The private cloud network in the back is not familiar with the traditional environment network , So there is no detailed description , In fact, there are many network nodes .
At present, the first step to do is Narrow the scope of -- Try to narrow down the scope of doubt as much as possible . Convenient for further positioning .
The specific operation steps are as follows :
The first stage Capture and analyze ¶
After preliminary analysis , The following conclusions can be drawn :
1. User pass CAS Page click Demand system , At this time, I will bring Ticket Jump to Demand system :
"GET http://itweb.cloud.example.com.cn/login.jsp?ticket=XX-1144737-F6gZZyxhe0IfKxBJS4zjuf9Csz4-cas2 HTTP/1.1"
2. We first pass Chrome Of F12 Develop tools to view , It is found that the request above takes time 10s, And 10s It's all in Waitting(TTFR) Here's the picture :
3. The above request is in Demand system Internal processing , And we have learned that this problem does not exist before loading the container , So basically eliminate Demand system The possibility of applying the problem . Then you need to grab Demand system Of APP In the container ( hereinafter referred to as pod a) The network packet , Determine if there is a network problem .
4. adopt tcpdump Command to grab pod a All traffic of the network card . And reproduce the problem during this period .
5. adopt WireShark Open for analysis , Because I know that there is a problem with the above request , So it's for this request TCP flow . give the result as follows :
6. The time in the red of the icon above , You can see clearly , pod a(IP by : 10.131.0.244) Upon receipt of the above request , Back to http code 302, Then proceed TCP Of 3 The second handshake . Then something went wrong , PSH+ACK Your action is in 10s Only after !!!
7. This time, , According to the source IP(pod A) And purpose IP( It can be understood as K8S Ingress Of tun0 Of IP), The preliminary judgment is : Containers -> Ingress In between 10s Delay of .
remarks : About K8S Capture packets on the network , I'll write another article later : 《K8S On the network packet capture 3 Ways of planting 》. Coming soon ~
The second stage Refine the analysis and exclusion phase ¶
Last stage , It is preliminarily determined that : Containers -> Ingress In between 10s Delay of .
Then it is planned to further analyze the network between the participants in the follow-up . For this purpose 2 Set of plans :
1. exclusions , because pod A and Ingress Not on the same virtual machine , And the interaction between these two virtual machines will involve : OpenShift Of OVS Network and private cloud SDN The Internet . So I hope that through pod A Dispatch to Ingress On the host , Observe the results for exclusion .
a. If it is dispatched to the same host , There is still the problem , Then exclude Private cloud SDN The possibility of problems ;
b. If the problem is solved after scheduling , So it could be : OpenShift Of OVS Network or private cloud SDN Network problems .
2. Detailed analysis . It was just a simple catch before pod A My bag . The subsequent plan will capture all network nodes involved , Include :
a. pod A
b. pod A Host
c. Private cloud SDN Related network equipment
d. Ingress Host
e. Ingress pod
But at this stage, we encountered various difficulties , Make this 2 None of the plans were finally implemented .
So we continue to hope that through the previous network package , Analyze to more details , We from 2 In every way :
1. Please take a look at this network package , The feedback from the network group teacher is : The basic conclusion is pod A The problem of , No need to grab other bags
2. contact Demand system and CAS Project team teacher , Learn more about business process details . Looking forward to more details .
And this 2 Great achievements have also been made in three aspects !
The third stage Business process sorting ¶
remarks : Focus on Log in to the demand system for the first time This business process . Some users visit first CAS, Sign in CAS, Re pass CAS Jump into the demand system ; Some users visit first CAS, Don't log in , adopt CAS Jump to the requirements system , In turn, log in to verify and enter the requirements system ; Some users directly access the demand system , Jump to single sign on , Enter the demand system after login and authentication . There are three situations above , It's essentially the same . Is the demand system needs and CAS Interaction . Next, choose one of the most commonly used processes to explain .
1. The user accesses and logs in Single sign on System ;
a. At this point, the user will get a Ticket, Examples of formats are as follows :
XX-1144737-F6gZZyxhe0IfKxBJS4zjuf9Csz4-<instancename>
b. During this period, users will not access the requirements system
2. User pass CAS Page click Demand system , At this time, I will bring Ticket Jump to Demand system :
"GET http://itweb.cloud.example.com.cn/login.jsp?ticket=XX-1144737-F6gZZyxhe0IfKxBJS4zjuf9Csz4-cas2 HTTP/1.1"
3. Demand system Received the request , Will pay a return visit CAS verification :
“GET http://10.1.XX.XX:XXXX/casserver/serviceValidate?hostnameVerifier=org.jasig.cas.client.ssl.AnyHostnameVerifier&ticket= XX-1144737-F6gZZyxhe0IfKxBJS4zjuf9Csz4-cas2&encoding=UTF-8&service=http%3A%2F%2Fitweb.cloud.example.com.cn%2F”
4. CAS After verification, return the result to Demand system :
a. verification adopt , Normal login , Get into Demand system The main page ;
b. verification Not through , Tips You do not have access to the system .
The teacher of the project team highlighted , According to their logs , It's No 3 The pace is slow 10s.
The fourth stage location ¶
Communicating with teachers of network group and project group has benefited a lot :
- Network group teacher : The basic conclusion is pod A The problem of
- Project team teacher : It's No 3 Step ( Demand system return visit CAS verification ) slow 10s.
We decided to check the previously captured network packets again , Take a closer look CAS After accessing the single sign on System , What happened on the Internet . And this time , We finally got a clue !
No longer focus on one TCP flow , Instead, focus on receiving the request , What happened .
1. Upon receipt of the request , pod A To visit CAS 10.1.XX.XX, Notice this time : Not direct access CAS Of IP, Instead, do the reverse first DNS analysis !!!
Here's the picture :
2. First reverse DNS analysis , DNS server No information was returned , 5s Overtime . ( The network flow is relatively long , No pictures , Anyway, next 5s I didn't see it DNS server There is a return message ). Start the second reverse DNS analysis , Here's the picture :
3. two DNS After the anti parsing fails , The third time, don't do the reverse DNS Parsed , It's direct access to , And get results quickly . So every time I wait 10s Before entering the system . Here's the picture :
The reason summary ¶¶
Demand system visit Single sign on 10.1.XX.XX:XXXX It took 10s, Because that request will be reversed dns analysis . Result analysis 2 No success .dns A parsing timeout is 5s… The third time, it won't be resolved. You can directly access . So every time I wait 10s.
remarks : In fact, it should not be Demand system Can do reverse DNS analysis , It's deployed in WebLogic Middleware , WebLogic The middleware will reverse DNS analysis . The reason why I know this , Because a life insurance company once produced DNS Something's wrong , I found it during the investigation Massive reverse DNS analysis come from weblogic. If you have time, you can write another related article .
Solution ¶¶
1. Try to add a parameter that prohibits reverse parsing in the startup item of the demand system , The test does not take effect .
2. Openshift Each node will start a dnsmasq The process is used as an internal part of the cluster dns Handle , take worker Node dnsmasq Add the following configuration and restart . Make the reverse DNS Parsing succeeded . Then the problem is solved .
ptr-record=XX.XX.1.10.in-addr.arpa, 10.1.XX.XX
summary ¶¶
thus , We combed it completely Need to log in to the system for the first time 10s The problem of , There is a large amount of information in the analysis process , There are many links involved . In fact, the reason for the final positioning , And also K8S Network of , New technologies such as private cloud networks don't matter . The problem is still the old one .
- Hard stuffing of traditional software into containers is strongly discouraged , There's more than one hole ;
- K8S After adoption , Network complexity will increase significantly , analysis K8S Online problems , Network analysis means are essential ;
- In the process of analyzing the problem , company , I must learn from you , Communicate more with other teachers , Divergent thinking , Avoid walking into a dead end .
版权声明
本文为[East wind whistling]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204211408404333.html
边栏推荐
- Friends don't see how to add, write it down first.
- Swift 泛型的使用
- Basic commands and practice of DOS command line
- (advanced usage) C language string function
- MySQL execution process
- Financial information security training - 22 / 4 / 19 (Part I)
- Nocalhost for dapr remote debugging
- Analysis on the development status of meta universe
- Delphi自动适应屏幕大小
- 国何以立
猜你喜欢
![[论文阅读] Active Class Incremental Learning for Imbalanced Datasets](/img/20/ecfffcdeed6813a6bdb703794a2607.png)
[论文阅读] Active Class Incremental Learning for Imbalanced Datasets
![[timing] dcrnn: a spatiotemporal prediction network for traffic flow prediction combining diffusion convolution and GNN](/img/65/6bb2892f4aabe47002ada72ed139db.png)
[timing] dcrnn: a spatiotemporal prediction network for traffic flow prediction combining diffusion convolution and GNN

WSOLA principle and MATLAB simulation

PV-TSM原理及matlab仿真

【※ LeetCode 劍指 Offer 12. 矩陣中的路徑(簡單)】

创建双向链表(详解)

Development management · Huawei IPD

SED and awk tools of shell

Another perspective on the meta universe: the meta universe culture is quietly changing the world

Formation pratique à la sécurité de l'information financière - 22 / 4 / 19 (Partie I)
随机推荐
SQLSERVER解析JSON时string value中换行符问题
嵌入式AI
Text processing - sed
循环链表的创建及可控输出
(advanced usage) C language string function
Problem brushing plan -- dynamic programming (II)
创建双向链表(详解)
吴恩达机器学习作业——逻辑回归
Why does MySQL index use B + tree instead of jump table?
金融信息安全实训——22/4/19(上)
DOS 命令行基本命令及实践
ENSP layer 3 switch connects layer 2 switch and router
二元交叉熵损失函数
centos7安装mysql5.7的详细教程
Explain various cloud computing models in detail. How can enterprises use each model to improve business productivity?
Nocalhost for dapr remote debugging
Excel tips - automatically fill adjacent cells
How did opensea become the most popular NFT market?
Line feed in string value when sqlserver parses JSON
All primes - ladder training competition