diff --git a/deployment/docker/README.md b/deployment/docker/README.md index 668c173b04..ca0a77c9c1 100644 --- a/deployment/docker/README.md +++ b/deployment/docker/README.md @@ -169,11 +169,11 @@ See below a list of recommended alerting rules for VictoriaLogs components for r Some alerting rules thresholds are just recommendations and could require an adjustment. The list of alerting rules is the following: * [alerts-health.yml](https://github.com/VictoriaMetrics/VictoriaLogs/blob/master/deployment/docker/rules/alerts-health.yml): - alerting rules related to all VictoriaMetrics components for tracking their "health" state; + shared alerting rules for tracking the health of VictoriaLogs components; * [alerts-vlogs.yml](https://github.com/VictoriaMetrics/VictoriaLogs/blob/master/deployment/docker/rules/alerts-vlogs.yml): - alerting rules related to [VictoriaLogs](https://docs.victoriametrics.com/victorialogs/); + [VictoriaLogs](https://docs.victoriametrics.com/victorialogs/)-specific alerting rules for VictoriaLogs installations. Load this together with `alerts-health.yml`; * [alerts-vlagent.yml](https://github.com/VictoriaMetrics/VictoriaLogs/blob/master/deployment/docker/rules/alerts-vlagent.yml): - alerting rules related to [vlagent](https://docs.victoriametrics.com/victorialogs/vlagent/); + alerting rules related to [vlagent](https://docs.victoriametrics.com/victorialogs/vlagent/). Load this together with `alerts-health.yml`. Please, also see [how to monitor VictoriaLogs installations](https://docs.victoriametrics.com/victorialogs/#monitoring). diff --git a/deployment/docker/rules/alerts-health.yml b/deployment/docker/rules/alerts-health.yml index a048b57df0..7e3b6ecabc 100644 --- a/deployment/docker/rules/alerts-health.yml +++ b/deployment/docker/rules/alerts-health.yml @@ -1,5 +1,5 @@ -# File contains default list of alerts for various VM components. -# The following alerts are recommended for use for any VM installation. +# File contains shared health alerts for VictoriaLogs components. +# The following alerts are recommended for use for VictoriaLogs installations. # The alerts below are just recommendations and may require some updates # and threshold calibration according to every specific setup. groups: @@ -81,19 +81,15 @@ groups: Logging rate for job \"{{ $labels.job }}\" ({{ $labels.instance }}) is {{ $value }} for last 15m. Worth to check logs for specific error messages. - - alert: ConcurrentInsertsHitTheLimit - expr: avg_over_time(vm_concurrent_insert_current{job=~".*(victorialogs|vlstorage|vlselect|vlinsert|vlagent).*"}[1m]) >= vm_concurrent_insert_capacity{job=~".*(victorialogs|vlstorage|vlselect|vlinsert|vlagent).*"} + - alert: RequestErrorsToAPI + expr: increase(vl_http_errors_total[5m]) > 0 for: 15m labels: severity: warning annotations: - summary: "{{ $labels.job }} on instance {{ $labels.instance }} is constantly hitting concurrent inserts limit" - description: | - The limit of concurrent inserts on instance {{ $labels.instance }} depends on the number of CPUs. - Usually, when component constantly hits the limit it is likely the component is overloaded and requires more CPU. - In some cases for components like vmagent or vminsert the alert might trigger if there are too many clients - making write attempts. If vmagent's or vminsert's CPU usage and network saturation are at normal level, then - it might be worth adjusting `-maxConcurrentInserts` cmd-line flag. + summary: "Too many errors served for path {{ $labels.path }} (instance {{ $labels.instance }})" + description: "Requests to path {{ $labels.path }} are receiving errors. + Please verify if clients are sending correct requests." - alert: RowsRejectedOnIngestion expr: rate(vl_rows_dropped_total[5m]) > 0 @@ -102,23 +98,19 @@ groups: severity: warning annotations: summary: "Some rows are rejected on \"{{ $labels.instance }}\" on ingestion attempt" - description: "Ingested rows on instance \"{{ $labels.instance }}\" are rejected due to the + description: "VictoriaLogs is rejecting to ingest rows on \"{{ $labels.instance }}\" due to the following reason: \"{{ $labels.reason }}\"" - - alert: TooHighQueryLoad - expr: increase(vl_concurrent_select_limit_timeout_total[5m]) > 0 + - alert: ConcurrentInsertsHitTheLimit + expr: avg_over_time(vm_concurrent_insert_current{job=~".*(victorialogs|vlstorage|vlinsert|vlagent).*"}[1m]) >= vm_concurrent_insert_capacity{job=~".*(victorialogs|vlstorage|vlinsert|vlagent).*"} for: 15m labels: severity: warning annotations: - summary: "Read queries fail with timeout for {{ $labels.job }} on instance {{ $labels.instance }}" + summary: "{{ $labels.job }} on instance {{ $labels.instance }} is constantly hitting concurrent inserts limit" description: | - Instance {{ $labels.instance }} ({{ $labels.job }}) is failing to serve read queries during last 15m. - Concurrency limit `-search.maxConcurrentRequests` was reached on this instance and extra queries were - put into the queue for `-search.maxQueueDuration` interval. But even after waiting in the queue these queries weren't served. - This happens if instance is overloaded with the current workload, or datasource is too slow to respond. - Possible solutions are the following: - * reduce the query load; - * increase compute resources or number of replicas; - * adjust limits `-search.maxConcurrentRequests` and `-search.maxQueueDuration`. - See more at https://docs.victoriametrics.com/victoriametrics/troubleshooting/#slow-queries. + The limit of concurrent inserts on instance {{ $labels.instance }} depends on the number of CPUs. + Usually, when component constantly hits the limit it is likely the component is overloaded and requires more CPU. + In some cases the alert might trigger if there are too many clients making write attempts to the component. + If the component's CPU usage and network saturation are at normal level, then + it might be worth adjusting `-maxConcurrentInserts` cmd-line flag. diff --git a/deployment/docker/rules/alerts-vlagent.yml b/deployment/docker/rules/alerts-vlagent.yml index e45f94352e..9187b3f3ba 100644 --- a/deployment/docker/rules/alerts-vlagent.yml +++ b/deployment/docker/rules/alerts-vlagent.yml @@ -100,23 +100,3 @@ groups: summary: "Persistent Queue (url {{ $labels.url }}) of {{ $labels.instance }} (job:{{ $labels.job }}) will run out of space in 4 hours." description: "RemoteWrite destination ({{ $labels.url }}) is unavailable or unable to receive data in a timely manner, so the persistent queue size is growing. Once the available space is exhausted, some samples will be discarded and cause incident. Please check the health of remoteWrite destination ({{ $labels.url }})." - - - alert: RequestErrorsToAPI - expr: increase(vl_http_errors_total[5m]) > 0 - for: 15m - labels: - severity: warning - annotations: - summary: "Too many errors served for path {{ $labels.path }} (instance {{ $labels.instance }})" - description: "Requests to path {{ $labels.path }} are receiving errors. - Please verify if clients are sending correct requests." - - - alert: RowsRejectedOnIngestion - expr: rate(vl_rows_dropped_total[5m]) > 0 - for: 15m - labels: - severity: warning - annotations: - summary: "Some rows are rejected on \"{{ $labels.instance }}\" on ingestion attempt" - description: "VictoriaLogs is rejecting to ingest rows on \"{{ $labels.instance }}\" due to the - following reason: \"{{ $labels.reason }}\"" diff --git a/deployment/docker/rules/alerts-vlogs.yml b/deployment/docker/rules/alerts-vlogs.yml index 6c8a47b551..723953e290 100644 --- a/deployment/docker/rules/alerts-vlogs.yml +++ b/deployment/docker/rules/alerts-vlogs.yml @@ -1,4 +1,4 @@ -# File contains default list of alerts for VictoriaLogs single server. +# File contains VictoriaLogs-specific alerts for VictoriaLogs installations. # The alerts below are just recommendations and may require some updates # and threshold calibration according to every specific setup. groups: @@ -20,22 +20,35 @@ groups: Having less than 20% of free disk space could cripple merge processes and overall performance. Consider to limit the ingestion rate, decrease retention or scale the disk space if possible." - - alert: RequestErrorsToAPI - expr: increase(vl_http_errors_total[5m]) > 0 + + - alert: ConcurrentQueriesHitTheLimit + expr: avg_over_time(vl_concurrent_select_current[1m]) >= vl_concurrent_select_capacity for: 15m labels: severity: warning annotations: - summary: "Too many errors served for path {{ $labels.path }} (instance {{ $labels.instance }})" - description: "Requests to path {{ $labels.path }} are receiving errors. - Please verify if clients are sending correct requests." + summary: "{{ $labels.job }} on instance {{ $labels.instance }} is constantly hitting concurrent query limit" + description: | + The limit of concurrent queries on instance {{ $labels.instance }} is controlled by `-search.maxConcurrentRequests`. + Usually, when the component constantly hits the limit it is likely overloaded and requires more CPU or more replicas. + In some cases the alert might trigger if clients send too many expensive queries to the component. + If the component's CPU usage and storage latency are at normal level, then + it might be worth adjusting `-search.maxConcurrentRequests`. - - alert: RowsRejectedOnIngestion - expr: rate(vl_rows_dropped_total[5m]) > 0 + - alert: TooHighQueryLoad + expr: increase(vl_concurrent_select_limit_timeout_total[5m]) > 0 for: 15m labels: severity: warning annotations: - summary: "Some rows are rejected on \"{{ $labels.instance }}\" on ingestion attempt" - description: "VictoriaLogs is rejecting to ingest rows on \"{{ $labels.instance }}\" due to the - following reason: \"{{ $labels.reason }}\"" + summary: "Read queries fail with timeout for {{ $labels.job }} on instance {{ $labels.instance }}" + description: | + Instance {{ $labels.instance }} ({{ $labels.job }}) is failing to serve read queries during last 15m. + Concurrency limit `-search.maxConcurrentRequests` was reached on this instance and extra queries were + put into the queue for `-search.maxQueueDuration` interval. But even after waiting in the queue these queries weren't served. + This happens if instance is overloaded with the current workload, or datasource is too slow to respond. + Possible solutions are the following: + * reduce the query load; + * increase compute resources or number of replicas; + * adjust limits `-search.maxConcurrentRequests` and `-search.maxQueueDuration`. + See more at https://docs.victoriametrics.com/victorialogs/logsql/#troubleshooting. diff --git a/docs/victorialogs/README.md b/docs/victorialogs/README.md index 5bc9eba521..db45866b57 100644 --- a/docs/victorialogs/README.md +++ b/docs/victorialogs/README.md @@ -91,7 +91,8 @@ See [metrics reference](https://docs.victoriametrics.com/victorialogs/metrics/) We recommend installing Grafana dashboard for [VictoriaLogs single-node](https://grafana.com/grafana/dashboards/22084) or [cluster](https://grafana.com/grafana/dashboards/23274). -We recommend setting up [alerts](https://github.com/VictoriaMetrics/VictoriaLogs/blob/master/deployment/docker/rules/alerts-vlogs.yml) +We recommend setting up [alerts-vlogs.yml](https://github.com/VictoriaMetrics/VictoriaLogs/blob/master/deployment/docker/rules/alerts-vlogs.yml) +and [alerts-health.yml](https://github.com/VictoriaMetrics/VictoriaLogs/blob/master/deployment/docker/rules/alerts-health.yml) via [vmalert](https://docs.victoriametrics.com/victoriametrics/vmalert/) or via Prometheus. VictoriaLogs emits its own logs to stdout. It is recommended to investigate these logs during troubleshooting. diff --git a/docs/victorialogs/vlagent.md b/docs/victorialogs/vlagent.md index 72fb1576ac..b1ecd00cfa 100644 --- a/docs/victorialogs/vlagent.md +++ b/docs/victorialogs/vlagent.md @@ -874,7 +874,8 @@ Use [the official Grafana dashboard for `vlagent` state overview](https://grafan Graphs on this dashboard contain useful hints - hover the `i` icon at the top left corner of each graph in order to read them. If you have suggestions for improvements or have found a bug, please open an issue on GitHub or add a review to the dashboard. -We recommend setting up [alerts](https://github.com/VictoriaMetrics/VictoriaLogs/blob/master/deployment/docker/rules/alerts-vlagent.yml) +We recommend setting up [alerts-vlagent.yml](https://github.com/VictoriaMetrics/VictoriaLogs/blob/master/deployment/docker/rules/alerts-vlagent.yml) +and [alerts-health.yml](https://github.com/VictoriaMetrics/VictoriaLogs/blob/master/deployment/docker/rules/alerts-health.yml) via [vmalert](https://docs.victoriametrics.com/victoriametrics/vmalert/) or via Prometheus. ## Multitenancy