当前位置:网站首页>Invisible OOM in kubernetes

Invisible OOM in kubernetes

2022-08-09 10:45:00 charlieroro

Recently read an article: Tracking Down “Invisible” OOM Kills in Kubernetes, which tells that the process in the Pod was killed due to insufficient memory, but the Pod did not restart, and there was no log or kubernetes event, only an "Exit"Code: 137", making it difficult to further locate the problem.Finally, I found the following information by viewing the node system log:

kernel: Memory cgroup out of memory: Killed process 18661 (helm) total-vm:748664kB, anon-rss:41748kB, file-rss:0kB, shmem-rss:0kB, UID:999 pgtables:244kBoom_score_adj:992kernel: oom_reaper: reaped process 18661 (helm), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

In the above article, the author made a summary:

When the Linux OOM Killer activated, it selected a process within the container to be killed. Apparently only when OOM selects the container's init process PID 1 will the container itselfbe killed and have a status of OOMKilled. If that container was marked as restartable, then Kubernetes would restart the container and then you would see an increase in restart count.

As I've seen, when PID 1 is not selected then some other process inside the container is killed. This makes it “invisible” to Kubernetes. If you are not watching the tty console device or scanning kernel logs, you may not know that part of your containers are being killed. Something to consider when you enable container memory limits.

The general idea is that the OOMKilled state will only appear when the PID 1 in the Pod is killed by OOM, and the container will be restarted. At this time, we can clearly see the OOM information.

But in the case of the problem, it is not PID 1 that is killed, which causes the container or kubernetes to not be able to record relevant information, and the container will not be restarted.In this case, the relevant information can only be found by looking at the system log.

A solution to this problem is also suggested in the article: VPA.

PS

I have encountered a similar problem before. When the problem occurs, there is only an "Exit Code: 137" message, the Pod is running normally, and there are no error logs and events, but in fact, a process in the Pod has beenKilled and unable to perform normal functions.

The reason for the "hidden OOM" may be that multiple independent processes are started separately in the Pod (there is no parent-child relationship between the processes). In my scenario, a script process is started alone. When the memory is insufficientwill cause the kill script process.Therefore, another solution is that if you want to start multiple independent processes, you can also use it as a sidecar to avoid this problem.

原网站

版权声明
本文为[charlieroro]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/221/202208091043257780.html