Elasticsearch monitor reference-白红宇

Elasticsearch monitor reference

阅读量：5924 次

发布时间：2019-06-19

本文共 3864 字，大约阅读时间需要 12 分钟。

https://www.elastic.co/guide/en/elasticsearch/guide/2.x/heap-sizing.html

https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_monitoring_individual_nodes.html#_jvm_section

"jvm": {            "timestamp": 1408556438203,            "uptime_in_millis": 14457,            "mem": {               "heap_used_in_bytes": 457252160,               "heap_used_percent": 44,               "heap_committed_in_bytes": 1038876672,               "heap_max_in_bytes": 1038876672,               "non_heap_used_in_bytes": 38680680,               "non_heap_committed_in_bytes": 38993920,

The jvm section first lists some general stats about heap memory usage. You can see how much of the heap is being used, how much is committed (actually allocated to the process), and the max size the heap is allowed to grow to. Ideally, heap_committed_in_bytes should be identical to heap_max_in_bytes. If the committed size is smaller, the JVM will have to resize the heap eventually—and this is a very expensive process. If your numbers are not identical, see for how to configure it correctly.

The heap_used_percent metric is a useful number to keep an eye on. Elasticsearch is configured to initiate GCs when the heap reaches 75% full. If your node is consistently >= 75%, your node is experiencing memory pressure. This is a warning sign that slow GCs may be in your near future.

If the heap usage is consistently >=85%, you are in trouble. Heaps over 90–95% are in risk of horrible performance with long 10–30s GCs at best, and out-of-memory (OOM) exceptions at worst.

"gc": {   "collectors": {      "young": {         "collection_count": 13,         "collection_time_in_millis": 923      },      "old": {         "collection_count": 0,         "collection_time_in_millis": 0      }   }}

In contrast, the old generation collection count should remain small, and have a small collection_time_in_millis. These are cumulative counts(怎么个累加法？从cluster启动开始累加吗？), so it is hard to give an exact number when you should start worrying (for example, a node with a one-year uptime will have a large count even if it is healthy). This is one of the reasons that tools such as Marvel are so helpful. GC counts over time are the important consideration.

Time spent GC’ing is also important. For example, a certain amount of garbage is generated while indexing documents. This is normal and causes a GC every now and then. These GCs are almost always fast and have little effect on the node: young generation takes a millisecond or two, and old generation takes a few hundred milliseconds. This is much different from 10-second GCs.

Our best advice is to collect collection counts and duration periodically (or use Marvel) and keep an eye out for frequent GCs. You can also enable slow-GC logging, discussed in .

It is much better to handle queuing in your application by gracefully handling the back pressure from a full queue. When you receive bulk rejections, you should take these steps:

Pause the import thread for 3–5 seconds.

Extract the rejected actions from the bulk response, since it is probable that many of the actions were successful. The bulk response will tell you which succeeded and which were rejected.

Send a new bulk request with just the rejected actions.

Repeat from step 1 if rejections are encountered again.

Using this procedure, your code naturally adapts to the load of your cluster and naturally backs off.

Rejections are not errors: they just mean you should try again later.

There are a dozen threadpools. Most you can safely ignore, but a few are good to keep an eye on:

indexing

Threadpool for normal indexing requests

bulk

Bulk requests, which are distinct from the nonbulk indexing requests

get

Get-by-ID operations

search

All search and query requests

merging

Threadpool dedicated to managing Lucene merges

转载于:https://my.oschina.net/fayebrooke/blog/693346

你可能感兴趣的文章

2017-2018-1 20155312 《信息安全系统设计基础》第八周学习总结