https://www.elastic.co/guide/en/elasticsearch/guide/2.x/heap-sizing.html
https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_monitoring_individual_nodes.html#_jvm_section
1.
"jvm": { "timestamp": 1408556438203, "uptime_in_millis": 14457, "mem": { "heap_used_in_bytes": 457252160, "heap_used_percent": 44, "heap_committed_in_bytes": 1038876672, "heap_max_in_bytes": 1038876672, "non_heap_used_in_bytes": 38680680, "non_heap_committed_in_bytes": 38993920,
-
The
jvm
section first lists some general stats about heap memory usage. You can see how much of the heap is being used, how much is committed (actually allocated to the process), and the max size the heap is allowed to grow to. Ideally,heap_committed_in_bytes
should be identical toheap_max_in_bytes
. If the committed size is smaller, the JVM will have to resize the heap eventually—and this is a very expensive process. If your numbers are not identical, see for how to configure it correctly.The
heap_used_percent
metric is a useful number to keep an eye on. Elasticsearch is configured to initiate GCs when the heap reaches 75% full. If your node is consistently >= 75%, your node is experiencing memory pressure. This is a warning sign that slow GCs may be in your near future.If the heap usage is consistently >=85%, you are in trouble. Heaps over 90–95% are in risk of horrible performance with long 10–30s GCs at best, and out-of-memory (OOM) exceptions at worst.
2.
"gc": { "collectors": { "young": { "collection_count": 13, "collection_time_in_millis": 923 }, "old": { "collection_count": 0, "collection_time_in_millis": 0 } }}
In contrast, the old generation collection count should remain small, and have a small collection_time_in_millis
. These are cumulative counts(怎么个累加法?从cluster启动开始累加吗?), so it is hard to give an exact number when you should start worrying (for example, a node with a one-year uptime will have a large count even if it is healthy). This is one of the reasons that tools such as Marvel are so helpful. GC counts over time are the important consideration.
Time spent GC’ing is also important. For example, a certain amount of garbage is generated while indexing documents. This is normal and causes a GC every now and then. These GCs are almost always fast and have little effect on the node: young generation takes a millisecond or two, and old generation takes a few hundred milliseconds. This is much different from 10-second GCs.
Our best advice is to collect collection counts and duration periodically (or use Marvel) and keep an eye out for frequent GCs. You can also enable slow-GC logging, discussed in .
3.
It is much better to handle queuing in your application by gracefully handling the back pressure from a full queue. When you receive bulk rejections, you should take these steps:
- Pause the import thread for 3–5 seconds.
- Extract the rejected actions from the bulk response, since it is probable that many of the actions were successful. The bulk response will tell you which succeeded and which were rejected.
- Send a new bulk request with just the rejected actions.
- Repeat from step 1 if rejections are encountered again.
Using this procedure, your code naturally adapts to the load of your cluster and naturally backs off.
Rejections are not errors: they just mean you should try again later.
There are a dozen threadpools. Most you can safely ignore, but a few are good to keep an eye on:
indexing
Threadpool for normal indexing requests
bulk
Bulk requests, which are distinct from the nonbulk indexing requests
get
Get-by-ID operations
search
All search and query requests
merging
Threadpool dedicated to managing Lucene merges