BZ 1846711 - pcp-pmda-openmetrics produces warnings querying grafana in its default configuration
0b2ef2d79 pmdaopenmetrics: add control.status metrics, de-verbosify the log, QA updates
63605e3db qa/1102: tweak openmetrics QA to be more deterministic
649a0c3a2 qa: improve _filter_pmda_remove() in common.filter
commit 0b2ef2d79686d1e44901263093edeb9e1b9b5f77
Author: Mark Goodwin <mgoodwin@redhat.com>
Date: Fri Jun 19 12:18:47 2020 +1000
pmdaopenmetrics: add control.status metrics, de-verbosify the log, QA updates
Resolves: RHBZ#1846711
Add openmetrics.control.status (string status per configured URL
of the last fetch) and openmetrics.control.status_code, which
is the integer response code (e.g. 200 is success) with discrete
semantics.
In addition, we now only spam the PMDA log and systemd journal
when a URL fetch fails if openmetrics.control.debug is non-zero.
Users can instead rely on the new status metrics, which can also
be used for service availability monitoring. These metrics
complement the openmetrics.control.parse_time, fetch_time and
calls counters.
Includes QA updates and pmdaopenmetrics(1) doc updates.
diff --git a/qa/1321.out b/qa/1321.out
index cee072cd2..4533bccd8 100644
--- a/qa/1321.out
+++ b/qa/1321.out
@@ -13,6 +13,8 @@ openmetrics.control.calls
openmetrics.control.debug
openmetrics.control.fetch_time
openmetrics.control.parse_time
+openmetrics.control.status
+openmetrics.control.status_code
openmetrics.source1.metric1
== Created URL file /var/lib/pcp/pmdas/openmetrics/config.d/source2.url
@@ -22,6 +24,8 @@ openmetrics.control.calls
openmetrics.control.debug
openmetrics.control.fetch_time
openmetrics.control.parse_time
+openmetrics.control.status
+openmetrics.control.status_code
openmetrics.source1.metric1
openmetrics.source2.metric1
openmetrics.source2.metric2
@@ -33,6 +37,8 @@ openmetrics.control.calls
openmetrics.control.debug
openmetrics.control.fetch_time
openmetrics.control.parse_time
+openmetrics.control.status
+openmetrics.control.status_code
openmetrics.source1.metric1
openmetrics.source2.metric1
openmetrics.source2.metric2
@@ -47,6 +53,8 @@ openmetrics.control.calls
openmetrics.control.debug
openmetrics.control.fetch_time
openmetrics.control.parse_time
+openmetrics.control.status
+openmetrics.control.status_code
openmetrics.source1.metric1
openmetrics.source2.metric1
openmetrics.source2.metric2
@@ -63,6 +71,8 @@ openmetrics.control.calls
openmetrics.control.debug
openmetrics.control.fetch_time
openmetrics.control.parse_time
+openmetrics.control.status
+openmetrics.control.status_code
openmetrics.source1.metric1
openmetrics.source2.metric1
openmetrics.source2.metric2
diff --git a/src/pmdas/openmetrics/pmdaopenmetrics.1 b/src/pmdas/openmetrics/pmdaopenmetrics.1
index d3c7aa85f..0c92e2a11 100644
--- a/src/pmdas/openmetrics/pmdaopenmetrics.1
+++ b/src/pmdas/openmetrics/pmdaopenmetrics.1
@@ -413,10 +413,37 @@ log mandatory on 2 second {
The PMDA maintains special control metrics, as described below.
Apart from
.BR openmetrics.control.debug ,
-each of these metrics is a counter and has one instance for each configured metric source.
-The instance domain is adjusted dynamically as new sources are discovered.
+each of these metrics has one instance for each configured metric source.
+All of these metrics have integer values with counter semantics, except
+.BR openmetrics.control.status ,
+which has a string value.
+It is important to note that fetching any of the
+.B openmetrics.control
+metrics will only update the counters and status values if the corresponding URL is actually fetched.
+If the source URL is not fetched, the control metric values do not trigger a refresh and the control
+values reported represent the most recent fetch of each corresponding source.
+.PP
+The instance domain for the
+.B openmetrics.control
+metrics is adjusted dynamically as new sources are discovered.
If there are no sources configured, the metric names are still defined
but the instance domain will be empty and a fetch will return no values.
+.IP \fBopenmetrics.control.status\fP
+A string representing the status of the last fetch of the corresponding source.
+This will generally be
+.B success
+for an http response code of 200.
+This metric can be used for service availability monitoring - provided, as stated above,
+the corresponding source URL is fetched too.
+.IP \fBopenmetrics.control.status_code\fP
+This metric is similar to
+.B openmetrics.control.status
+except that it is the integer response code of the last fetch.
+A value of
+.B 200
+usually signifies success and any other value failure.
+This metric can also be used for service availability monitoring, with the same caveats as
+.BR openmetrics.control.status .
.IP \fBopenmetrics.control.calls\fP
total number of times each configured metric source has been fetched (if it's a URL)
or executed (if it's a script), since the PMDA started.
diff --git a/src/pmdas/openmetrics/pmdaopenmetrics.python b/src/pmdas/openmetrics/pmdaopenmetrics.python
index a5ed22f13..1486ed676 100755
--- a/src/pmdas/openmetrics/pmdaopenmetrics.python
+++ b/src/pmdas/openmetrics/pmdaopenmetrics.python
@@ -1,6 +1,6 @@
#!/usr/bin/env pmpython
#
-# Copyright (c) 2017-2019 Red Hat.
+# Copyright (c) 2017-2020 Red Hat.
# Copyright (c) 2017 Ronak Jain.
#
# This program is free software; you can redistribute it and/or modify it
@@ -704,6 +704,7 @@ class Source(object):
return
# fetch the document
+ status_code = 0
try:
if self.is_scripted:
# Execute file, expecting openmetrics metric data on stdout.
@@ -715,6 +716,7 @@ class Source(object):
self.document = open(self.url[7:], 'r').read()
else:
r = self.requests.get(self.url, headers=self.headers, timeout=timeout)
+ status_code = r.status_code
r.raise_for_status() # non-200? ERROR
# NB: the requests package automatically enables http keep-alive and compression
self.document = r.text
@@ -723,9 +725,13 @@ class Source(object):
incr = int(1000 * (time.time() - fetch_time))
self.pmda.stats_fetch_time[self.cluster] += incr
self.pmda.stats_fetch_time[0] += incr # total for all sources
+ self.pmda.stats_status[self.cluster] = "success"
+ self.pmda.stats_status_code[self.cluster] = status_code
except Exception as e:
- self.pmda.err('Warning: cannot fetch URL or execute script %s: %s' % (self.path, e))
+ self.pmda.stats_status[self.cluster] = 'failed to fetch URL or execute script %s: %s' % (self.path, e)
+ self.pmda.stats_status_code[self.cluster] = status_code
+ self.pmda.debug('Warning: cannot fetch URL or execute script %s: %s' % (self.path, e)) if self.pmda.dbg else None
return
def refresh2(self, timeout):
@@ -844,6 +850,20 @@ class OpenMetricsPMDA(PMDA):
pmUnits(0, 0, 0, 0, 0, 0)),
'debug flag to enable verbose log messages, to enable: pmstore %s.control.debug 1' % self.pmda_name)
+ # response status string, per-source end-point
+ self.stats_status = {0:"none"} # status string, keyed by cluster number
+ self.add_metric('%s.control.status' % self.pmda_name, pmdaMetric(self.pmid(0, 5),
+ c_api.PM_TYPE_STRING, self.sources_indom, c_api.PM_SEM_INSTANT,
+ pmUnits(0, 0, 0, 0, 0, 0)), # no units
+ 'per-end-point source URL response status after the most recent fetch')
+
+ # response status code, per-source end-point
+ self.stats_status_code = {0:0} # status code, keyed by cluster number
+ self.add_metric('%s.control.status_code' % self.pmda_name, pmdaMetric(self.pmid(0, 6),
+ c_api.PM_TYPE_32, self.sources_indom, c_api.PM_SEM_DISCRETE,
+ pmUnits(0, 0, 0, 0, 0, 0)), # no units
+ 'per-end-point source URL response status code after the most recent fetch')
+
# schedule a refresh
self.set_need_refresh()
@@ -961,6 +981,8 @@ class OpenMetricsPMDA(PMDA):
self.stats_fetch_calls[cluster] = 0
self.stats_fetch_time[cluster] = 0
self.stats_parse_time[cluster] = 0
+ self.stats_status[cluster] = "unknown"
+ self.stats_status_code[cluster] = 0
save_cluster_table = True
self.log("Found source %s cluster %d" % (name, cluster))
@@ -996,6 +1018,10 @@ class OpenMetricsPMDA(PMDA):
return [self.stats_parse_time[inst], 1] if inst in self.stats_parse_time else [c_api.PM_ERR_INST, 0]
elif item == 4: # $(pmda_name).control.debug
return [self.dbg, 1]
+ elif item == 5: # per-source status string
+ return [self.stats_status[inst], 1] if inst in self.stats_status else [c_api.PM_ERR_INST, 0]
+ elif item == 6: # per-source status code
+ return [self.stats_status_code[inst], 1] if inst in self.stats_status_code else [c_api.PM_ERR_INST, 0]
return [c_api.PM_ERR_PMID, 0]
self.assert_source_invariants(cluster=cluster)
commit 63605e3db4b2821df2a6ffb21507af91d97f3a8b
Author: Mark Goodwin <mgoodwin@redhat.com>
Date: Fri Jun 19 10:02:04 2020 +1000
qa/1102: tweak openmetrics QA to be more deterministic
Now that pmdaopenmetrics is Installed by default with the localhost
grafana metrics URL configured, after _pmdaopenmetrics_save_config
we need to _pmdaopenmetrics_remove before _pmdaopenmetrics_install
to make qa/1102 deterministic.
diff --git a/qa/1102 b/qa/1102
index f573d14f4..98ff61f5e 100755
--- a/qa/1102
+++ b/qa/1102
@@ -46,6 +46,7 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
_stop_auto_restart pmcd
_pmdaopenmetrics_save_config
+_pmdaopenmetrics_remove
_pmdaopenmetrics_install
port=`_find_free_port 10000`
diff --git a/qa/1102.out b/qa/1102.out
index 5094e4a82..aa74abe44 100644
--- a/qa/1102.out
+++ b/qa/1102.out
@@ -1,5 +1,12 @@
QA output created by 1102
+=== remove openmetrics agent ===
+Culling the Performance Metrics Name Space ...
+openmetrics ... done
+Updating the PMCD control file, and notifying PMCD ...
+[...removing files...]
+Check openmetrics metrics have gone away ... OK
+
=== openmetrics agent installation ===
Fetch and desc openmetrics metrics: success
commit 649a0c3a2745f549b139ce1250e38a1e90308426
Author: Mark Goodwin <mgoodwin@redhat.com>
Date: Fri Jun 19 09:55:58 2020 +1000
qa: improve _filter_pmda_remove() in common.filter
Filter "Job for pmcd.service canceled" in _filter_pmda_remove.
Systemd sometimes (uncommonly) prints this if a PMDA is still
starting when a QA test ./Removes it.
diff --git a/qa/common.filter b/qa/common.filter
index a53d4a49d..b327abedc 100644
--- a/qa/common.filter
+++ b/qa/common.filter
@@ -760,6 +760,7 @@ _filter_pmda_remove()
_filter_pmda_install |
sed \
-e '/Removing files/d' \
+ -e '/Job for pmcd.service canceled/d' \
-e '/Updating the PMCD control file/c\
Updating the PMCD control file, and notifying PMCD ...\
[...removing files...]'