Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions deployment/docker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,11 +169,11 @@ See below a list of recommended alerting rules for VictoriaLogs components for r
Some alerting rules thresholds are just recommendations and could require an adjustment.
The list of alerting rules is the following:
* [alerts-health.yml](https://github.com/VictoriaMetrics/VictoriaLogs/blob/master/deployment/docker/rules/alerts-health.yml):
alerting rules related to all VictoriaMetrics components for tracking their "health" state;
shared alerting rules for tracking the health of VictoriaLogs components;
* [alerts-vlogs.yml](https://github.com/VictoriaMetrics/VictoriaLogs/blob/master/deployment/docker/rules/alerts-vlogs.yml):
alerting rules related to [VictoriaLogs](https://docs.victoriametrics.com/victorialogs/);
[VictoriaLogs](https://docs.victoriametrics.com/victorialogs/)-specific alerting rules for VictoriaLogs installations. Load this together with `alerts-health.yml`;
* [alerts-vlagent.yml](https://github.com/VictoriaMetrics/VictoriaLogs/blob/master/deployment/docker/rules/alerts-vlagent.yml):
alerting rules related to [vlagent](https://docs.victoriametrics.com/victorialogs/vlagent/);
alerting rules related to [vlagent](https://docs.victoriametrics.com/victorialogs/vlagent/). Load this together with `alerts-health.yml`.

Please, also see [how to monitor VictoriaLogs installations](https://docs.victoriametrics.com/victorialogs/#monitoring).

Expand Down
40 changes: 16 additions & 24 deletions deployment/docker/rules/alerts-health.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# File contains default list of alerts for various VM components.
# The following alerts are recommended for use for any VM installation.
# File contains shared health alerts for VictoriaLogs components.
# The following alerts are recommended for use for VictoriaLogs installations.
# The alerts below are just recommendations and may require some updates
# and threshold calibration according to every specific setup.
groups:
Expand Down Expand Up @@ -81,19 +81,15 @@ groups:
Logging rate for job \"{{ $labels.job }}\" ({{ $labels.instance }}) is {{ $value }} for last 15m.
Worth to check logs for specific error messages.

- alert: ConcurrentInsertsHitTheLimit
expr: avg_over_time(vm_concurrent_insert_current{job=~".*(victorialogs|vlstorage|vlselect|vlinsert|vlagent).*"}[1m]) >= vm_concurrent_insert_capacity{job=~".*(victorialogs|vlstorage|vlselect|vlinsert|vlagent).*"}
- alert: RequestErrorsToAPI
expr: increase(vl_http_errors_total[5m]) > 0
for: 15m
labels:
severity: warning
annotations:
summary: "{{ $labels.job }} on instance {{ $labels.instance }} is constantly hitting concurrent inserts limit"
description: |
The limit of concurrent inserts on instance {{ $labels.instance }} depends on the number of CPUs.
Usually, when component constantly hits the limit it is likely the component is overloaded and requires more CPU.
In some cases for components like vmagent or vminsert the alert might trigger if there are too many clients
making write attempts. If vmagent's or vminsert's CPU usage and network saturation are at normal level, then
it might be worth adjusting `-maxConcurrentInserts` cmd-line flag.
summary: "Too many errors served for path {{ $labels.path }} (instance {{ $labels.instance }})"
description: "Requests to path {{ $labels.path }} are receiving errors.
Please verify if clients are sending correct requests."

- alert: RowsRejectedOnIngestion
expr: rate(vl_rows_dropped_total[5m]) > 0
Expand All @@ -102,23 +98,19 @@ groups:
severity: warning
annotations:
summary: "Some rows are rejected on \"{{ $labels.instance }}\" on ingestion attempt"
description: "Ingested rows on instance \"{{ $labels.instance }}\" are rejected due to the
description: "VictoriaLogs is rejecting to ingest rows on \"{{ $labels.instance }}\" due to the
following reason: \"{{ $labels.reason }}\""

- alert: TooHighQueryLoad
expr: increase(vl_concurrent_select_limit_timeout_total[5m]) > 0
- alert: ConcurrentInsertsHitTheLimit
Comment thread
func25 marked this conversation as resolved.
expr: avg_over_time(vm_concurrent_insert_current{job=~".*(victorialogs|vlstorage|vlinsert|vlagent).*"}[1m]) >= vm_concurrent_insert_capacity{job=~".*(victorialogs|vlstorage|vlinsert|vlagent).*"}
for: 15m
labels:
severity: warning
annotations:
summary: "Read queries fail with timeout for {{ $labels.job }} on instance {{ $labels.instance }}"
summary: "{{ $labels.job }} on instance {{ $labels.instance }} is constantly hitting concurrent inserts limit"
description: |
Instance {{ $labels.instance }} ({{ $labels.job }}) is failing to serve read queries during last 15m.
Concurrency limit `-search.maxConcurrentRequests` was reached on this instance and extra queries were
put into the queue for `-search.maxQueueDuration` interval. But even after waiting in the queue these queries weren't served.
This happens if instance is overloaded with the current workload, or datasource is too slow to respond.
Possible solutions are the following:
* reduce the query load;
* increase compute resources or number of replicas;
* adjust limits `-search.maxConcurrentRequests` and `-search.maxQueueDuration`.
See more at https://docs.victoriametrics.com/victoriametrics/troubleshooting/#slow-queries.
The limit of concurrent inserts on instance {{ $labels.instance }} depends on the number of CPUs.
Usually, when component constantly hits the limit it is likely the component is overloaded and requires more CPU.
In some cases the alert might trigger if there are too many clients making write attempts to the component.
If the component's CPU usage and network saturation are at normal level, then
it might be worth adjusting `-maxConcurrentInserts` cmd-line flag.
20 changes: 0 additions & 20 deletions deployment/docker/rules/alerts-vlagent.yml
Original file line number Diff line number Diff line change
Expand Up @@ -100,23 +100,3 @@ groups:
summary: "Persistent Queue (url {{ $labels.url }}) of {{ $labels.instance }} (job:{{ $labels.job }}) will run out of space in 4 hours."
description: "RemoteWrite destination ({{ $labels.url }}) is unavailable or unable to receive data in a timely manner, so the persistent queue size is growing.
Once the available space is exhausted, some samples will be discarded and cause incident. Please check the health of remoteWrite destination ({{ $labels.url }})."

- alert: RequestErrorsToAPI
expr: increase(vl_http_errors_total[5m]) > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Too many errors served for path {{ $labels.path }} (instance {{ $labels.instance }})"
description: "Requests to path {{ $labels.path }} are receiving errors.
Please verify if clients are sending correct requests."

- alert: RowsRejectedOnIngestion
expr: rate(vl_rows_dropped_total[5m]) > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Some rows are rejected on \"{{ $labels.instance }}\" on ingestion attempt"
description: "VictoriaLogs is rejecting to ingest rows on \"{{ $labels.instance }}\" due to the
following reason: \"{{ $labels.reason }}\""
35 changes: 24 additions & 11 deletions deployment/docker/rules/alerts-vlogs.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# File contains default list of alerts for VictoriaLogs single server.
# File contains VictoriaLogs-specific alerts for VictoriaLogs installations.
# The alerts below are just recommendations and may require some updates
# and threshold calibration according to every specific setup.
groups:
Expand All @@ -20,22 +20,35 @@ groups:
Having less than 20% of free disk space could cripple merge processes and overall performance.
Consider to limit the ingestion rate, decrease retention or scale the disk space if possible."

- alert: RequestErrorsToAPI
expr: increase(vl_http_errors_total[5m]) > 0

- alert: ConcurrentQueriesHitTheLimit
expr: avg_over_time(vl_concurrent_select_current[1m]) >= vl_concurrent_select_capacity
for: 15m
labels:
severity: warning
annotations:
summary: "Too many errors served for path {{ $labels.path }} (instance {{ $labels.instance }})"
description: "Requests to path {{ $labels.path }} are receiving errors.
Please verify if clients are sending correct requests."
summary: "{{ $labels.job }} on instance {{ $labels.instance }} is constantly hitting concurrent query limit"
description: |
The limit of concurrent queries on instance {{ $labels.instance }} is controlled by `-search.maxConcurrentRequests`.
Usually, when the component constantly hits the limit it is likely overloaded and requires more CPU or more replicas.
In some cases the alert might trigger if clients send too many expensive queries to the component.
If the component's CPU usage and storage latency are at normal level, then
it might be worth adjusting `-search.maxConcurrentRequests`.

- alert: RowsRejectedOnIngestion
expr: rate(vl_rows_dropped_total[5m]) > 0
- alert: TooHighQueryLoad
expr: increase(vl_concurrent_select_limit_timeout_total[5m]) > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Some rows are rejected on \"{{ $labels.instance }}\" on ingestion attempt"
description: "VictoriaLogs is rejecting to ingest rows on \"{{ $labels.instance }}\" due to the
following reason: \"{{ $labels.reason }}\""
summary: "Read queries fail with timeout for {{ $labels.job }} on instance {{ $labels.instance }}"
description: |
Instance {{ $labels.instance }} ({{ $labels.job }}) is failing to serve read queries during last 15m.
Concurrency limit `-search.maxConcurrentRequests` was reached on this instance and extra queries were
put into the queue for `-search.maxQueueDuration` interval. But even after waiting in the queue these queries weren't served.
This happens if instance is overloaded with the current workload, or datasource is too slow to respond.
Possible solutions are the following:
* reduce the query load;
* increase compute resources or number of replicas;
* adjust limits `-search.maxConcurrentRequests` and `-search.maxQueueDuration`.
See more at https://docs.victoriametrics.com/victorialogs/logsql/#troubleshooting.
Comment thread
hagen1778 marked this conversation as resolved.
3 changes: 2 additions & 1 deletion docs/victorialogs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,8 @@ See [metrics reference](https://docs.victoriametrics.com/victorialogs/metrics/)

We recommend installing Grafana dashboard for [VictoriaLogs single-node](https://grafana.com/grafana/dashboards/22084) or [cluster](https://grafana.com/grafana/dashboards/23274).

We recommend setting up [alerts](https://github.com/VictoriaMetrics/VictoriaLogs/blob/master/deployment/docker/rules/alerts-vlogs.yml)
We recommend setting up [alerts-vlogs.yml](https://github.com/VictoriaMetrics/VictoriaLogs/blob/master/deployment/docker/rules/alerts-vlogs.yml)
and [alerts-health.yml](https://github.com/VictoriaMetrics/VictoriaLogs/blob/master/deployment/docker/rules/alerts-health.yml)
via [vmalert](https://docs.victoriametrics.com/victoriametrics/vmalert/) or via Prometheus.

VictoriaLogs emits its own logs to stdout. It is recommended to investigate these logs during troubleshooting.
Expand Down
3 changes: 2 additions & 1 deletion docs/victorialogs/vlagent.md
Original file line number Diff line number Diff line change
Expand Up @@ -874,7 +874,8 @@ Use [the official Grafana dashboard for `vlagent` state overview](https://grafan
Graphs on this dashboard contain useful hints - hover the `i` icon at the top left corner of each graph in order to read them.
If you have suggestions for improvements or have found a bug, please open an issue on GitHub or add a review to the dashboard.

We recommend setting up [alerts](https://github.com/VictoriaMetrics/VictoriaLogs/blob/master/deployment/docker/rules/alerts-vlagent.yml)
We recommend setting up [alerts-vlagent.yml](https://github.com/VictoriaMetrics/VictoriaLogs/blob/master/deployment/docker/rules/alerts-vlagent.yml)
and [alerts-health.yml](https://github.com/VictoriaMetrics/VictoriaLogs/blob/master/deployment/docker/rules/alerts-health.yml)
via [vmalert](https://docs.victoriametrics.com/victoriametrics/vmalert/) or via Prometheus.

## Multitenancy
Expand Down