剛填完Yarn的虛擬內存OOM的坑,卻發現一坑填完一坑攔

我發現我真的是上天的寵兒,在我手上,Yarn的虛擬內存居然崩了,是的,它崩了。我這本來就是個測試的集群,數據量也不大。一次開的內存也不大,但是它崩了,虛擬內存崩了。請看案例分析。

案件回放

事情的經過是這樣的:

因為需要,Yarn的原來的調度模式 Capacity Scheduler 對目前的項目而言不合適,就要去更換另外一種調度模式: Fair Scheduler。配置好的結果如下圖所示:

剛填完Yarn的虛擬內存OOM的坑,卻發現一坑填完一坑攔

Yarn Fair Schedule

這說明我的配置沒問題呀。

現在,我要開始提叫我的開啟我的Flink集群環境了:

<code>./yarn-session.sh \
    -n 3 \
    -s 6 \
    -jm 256 \
    -tm 1024 \
    -nm "flink on yarn"
    -d/<code>

問題來了:

剛填完Yarn的虛擬內存OOM的坑,卻發現一坑填完一坑攔

ERROR

這個意思就是說,Flink的集群部署時間超過了60s,叫我們檢查我們的請求資源在Yarn集群裡面是否可用。換句話說,就是我們的Yarn集群掛了,您自個去找原因吧。這個找原因,就只能找logs文件了。我們找到日誌文件,從裡面去找相關信息:

<code>325.2 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used
2020-04-01 15:25:37,410 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24195 for container-id container_1585725830038_0002_02_000001: 209.6 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used
2020-04-01 15:25:40,419 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24064 for container-id container_1585725830038_0003_02_000001: 336.3 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:40,427 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24195 for container-id container_1585725830038_0002_02_000001: 340.0 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:43,450 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24064 for container-id container_1585725830038_0003_02_000001: 336.4 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:43,481 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24195 for container-id container_1585725830038_0002_02_000001: 340.1 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:46,503 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24064 for container-id container_1585725830038_0003_02_000001: 336.4 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:46,526 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24195 for container-id container_1585725830038_0002_02_000001: 334.9 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:49,545 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24064 for container-id container_1585725830038_0003_02_000001: 336.4 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:49,586 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24195 for container-id container_1585725830038_0002_02_000001: 334.8 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:52,607 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24064 for container-id container_1585725830038_0003_02_000001: 336.6 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:52,640 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24195 for container-id container_1585725830038_0002_02_000001: 334.9 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:53,040 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Directory /opt/module/hadoop-2.7.2/data/tmp/nm-local-dir error, used space above threshold of 90.0%, removing from list of valid directories
2020-04-01 15:25:53,040 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Directory /opt/module/hadoop-2.7.2/logs/userlogs error, used space above threshold of 90.0%, removing from list of valid directories
2020-04-01 15:25:53,040 INFO org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Disk(s) failed: 1/1 local-dirs are bad: /opt/module/hadoop-2.7.2/data/tmp/nm-local-dir; 1/1 log-dirs are bad: /opt/module/hadoop-2.7.2/logs/userlogs
2020-04-01 15:25:53,040 ERROR org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Most of the disks failed. 1/1 local-dirs are bad: /opt/module/hadoop-2.7.2/data/tmp/nm-local-dir; 1/1 log-dirs are bad: /opt/module/hadoop-2.7.2/logs/userlogs
/<code>

我找到這麼一段話,為了看的方便,我截個圖:


剛填完Yarn的虛擬內存OOM的坑,卻發現一坑填完一坑攔

Yarn OOM

意思就是說:我們的一個 container_1585725830038_0003_02_000001 ,他的使用的物理內存 使用了336.4MB/1GB,虛擬內存使用了: 2.3GB/2.1GB 。我這個表達方式是: 實際使用量 / 總量 。

很明顯就可以看到我們的虛擬內存明顯不對,我只有2.1,你怎麼冒了一個 2.3 出來了呢?這可不久 OOM 嗎?

Yarn 的虛擬內存

關於Yarn的虛擬內存,官方有這麼幾個配置參數:

<code>
  yarn.nodemanager.vmem-check-enabled
  true
  Whether virtual memory limits will be enforced for containers.



  yarn.nodemanager.vmem-pmem-ratio
  2.1
      Ratio between virtual memory to physical memory when setting memory limits for containers. Container allocations are expressed in terms of physical memory, and virtual memory usage is allowed to exceed this allocation by this ratio.



  yarn.scheduler.minimum-allocation-vcores
  1
  The minimum allocation for every container request at the RM in terms of virtual CPU cores. Requests lower than this will be set to the value of this property. Additionally, a node manager that is configured to have fewer virtual cores than this value will be shut down by the resource manager.



  yarn.scheduler.maximum-allocation-vcores
  4
      The maximum allocation for every container request at the RM in terms of virtual CPU cores. Requests higher than this will throw an InvalidResourceRequestException.



  yarn.nodemanager.elastic-memory-control.enabled
  false
  Enable elastic memory control. This is a Linux only feature. When enabled, the node manager adds a listener to receive an event, if all the containers exceeded a limit. The limit is specified by yarn.nodemanager.resource.memory-mb. If this is not set, the limit is set based on the capabilities. See yarn.nodemanager.resource.detect-hardware-capabilities for details. The limit applies to the physical or virtual (rss+swap) memory depending on whether yarn.nodemanager.pmem-check-enabled or yarn.nodemanager.vmem-check-enabled is set.

/<code>

什麼是虛擬內存

虛擬內存是我們的硬盤內存,被拿去充公了。

<code># 查看某個進程的虛擬內存使用
pmap -x pid/<code>

解決方案

內存小了,我們增大就是了。至於行不行,咋也不知道。只有嘗試後才能發現具體的原因所在。但是到這裡你們以為問題解決了嗎?我告訴你們,不可能的。因為主要的問題不在這裡,這只是我這個大問題裡面的小問題。這個 OOM 是解決了,相當於我剛填完一個坑,發現後面全是坑......


分享到:


相關文章: