TL;DR - work is underway on a scalable, label-based and
fast querying language for time series data in PCP.  It
aims to provide functionality comparable to Borgmon[1,2],
Prometheus[3,4] and similar tools.  Prototype[5,6] code
is available to try out with example use cases[7,8,9,10]
and an extensive ToDo[14] list.  Contributions welcome!

 [1] https://landing.google.com/sre/book/chapters/practical-alerting.html
 [2] https://research.google.com/pubs/pub43438.html
 [3] https://prometheus.io
 [4] https://prometheus.io/docs/querying/basics/
 [5] https://github.com/performancecopilot/pcp/tree/master
 [6] https://github.com/performancecopilot/pcp/tree/master/src/pmseries
 [7] https://gist.github.com/natoscott/1bada43f35acf3cbfcb6167ceec6cd65
 [8] https://gist.github.com/natoscott/71769e4e08966b5887b3b48e24becaf5
 [9] https://gist.github.com/natoscott/b26089838b93b22b04ee8633d581d5a6
[10] https://gist.github.com/natoscott/e03efe89e40f29bc484df68de92829f7
[11] https://github.com/performancecopilot/pcp/tree/master/src/pmseries/ident.txt
[12] https://github.com/performancecopilot/pcp/tree/master/src/pmseries/query.txt
[13] https://github.com/performancecopilot/pcp/tree/master/src/pmseries/schema.svg
[14] https://github.com/performancecopilot/pcp/tree/master/src/pmseries/TODO
[15] https://github.com/redis/hiredis
[16] https://redis.io/topics/data-types-intro


== Rationale (what are the problems being tackled here?)

1. Analytic capabilities that scale to very large numbers
of sources (hosts/archives) and metrics.

Aim to introduce the ability to search for and transform
(report, calculate aggregated statistics, graph) metrics
from multiple sources, without individual connections to
sources.  Search may involve the need to query data from
every source.

The single metrics source (pmNewContext) model currently
used in PCP has worked well for small numbers of hosts,
but has severe limitations as data volumes increase.

For example, it is not realistically feasible to open and
scan even just 100's of archives (potentially terrabytes
of data), or socket connections to just hundreds of hosts,
searching for certain conditions using the current single-
source-single-context model.  In these situations, we want
to be able to find and retrieve performance data from one
source among millions, and in milliseconds not days/weeks.

2. Tackle the long-standing problem of transparently
transitioning between live and archived data access.


== Constraints

It is highly desirable to continue to maintain all of the
existing facilities PCP provides.  In particular, the PCP
archive format has been very successful in the uses for
which it was designed, and it would be ideal to continue
to support this for long-term PCP data storage.

Likewise, the PCP live collection system is battle proven
and something we should aim to leverage - extend it, not
replace it.

Realistically, we have a small core team of PCP developers
who care for and feed PCP - building this ourselves is not
a viable plan in the short term, so we should be selecting
existing technologies as evolutionary cranes to assist.


== Goals

Support "planet-scale" analysis and monitoring using PCP.

The first step involves indexing metadata (and optionally
data) for fast searching beyond what we currently have -
the individual PCP archive temporal index.  For these
volumes of data being considered, this phase focuses on
fast access to sources in relatively close geograhical
proximity.  This still requires support for significantly
larger data volumes than we can comfortably process today.

The second step involves federating the results of the
first stage, to scale-out to geographically dispersed
sources of metrics.

At both stages, the introduction of scaling concepts like
sharding, replication and redundancy of data and servers
is needed to ensure fast and reliabile access.


== Mechanisms

- build in-memory inverted indices on all search terms for
  querying (metric names, instance names, label names, and
  time).

- provide a new query language using the inverted indices
  to access both metadata and (optionally) data

- use labels to identify everything (source host names, job
  names, domain names, applications, containers, pods, CPUs,
  disks, data centres names, NUMA nodes, cgroups, ...) - as
  well as existing metadata (metric names, instance names,
  and metric descriptors).

- introduce the concept of a "time series identifier" - the
  SHA1 hash of PCP "identity metadata"[11].  Composed of:
  descriptor, instance IDs and the merged pmLabelSet.  Does
  not include: optional labels (aka "notes"), metric names.

- use these time series identifiers as the underlying data
  type of the inverted indices, internally, and as a handle
  to uniquely, consistently identify time series externally.

- provide optimal memory access to time series data.  IOW, all
  time series values (both times and values) for the series:
      "www1.acme.com:kernel.all.load[1 minute]"
  are stored in close proximity and as a sorted set (sort key
  is the timestamps) for optimal querying.

- provide support for fast, dynamic calculation of arbitrary
  statistics like N-th percentile, standard deviation, mean,
  median, and so on, on individual time series.

- optimal memory footprint in the presence of repeated strings
  across sources (label, metric & instance names, string-type
  metric values).

- scale out horizontally as additional hosts are added to the
  monitored set, using sharding.

- in particular, ensure time series data (values + timestamps)
  shard well and have fault isolation - in the prototype (see
  below) this is catered for using the SHA1 hash identifiers
  for individual time series and denormalisation of timestamps
  (stored with the individual time series values).

- allow multiple modes of operation, such as a metadata-only
  mode, in addition to the full metadata and data interface.
  (thanks to Kenj for this idea)
- this allows for fast search among the metadata to be hooked
  up with the (long-term storage) of data in PCP archives,
  which can be accessed via the traditional PMAPI mechanisms.
 

== Prototype

An early implementation of several of these concepts has been
committed to the current master branch of the PCP tree on github.
It is implemented using the Redis NoSQL in-memory data store,
internally.  The above concepts are generic so an alternative
could conceivably be "plugged in" but so far Redis is working
out extremely well.

Currently the prototype requires a local Redis instance that
is running on the default port (6379).  In future this'll be
configurable, and support both scale-up (sharding across any
number of Redis servers) and scale-down (via a private local
Redis server on a Unix domain socket, started and stopped by
the PCP client tool needing it).

The prototype is currently wholly contained within a single
binary - pmseries - which performs all loading and querying
functionality.  Obviously this will change over time and the
definition of a more formal API will need to follow.

The representation of the data stored in Redis is described
by a pseudo-schema[13,11] diagram.  It makes use of the set,
zset and hash key types and has been laid out with sharding
of the time series keys in mind.  It follows the key naming
conventions recommended in the Redis community.

Building the prototype requires a hiredis[15,16] development
package to be installed locally - configure.ac in the master
branch checks for this.


== Querying

The query language borrows liberally from existing similar
systems[1,2,3,4].  An abstract definition of the language
exists[12] along with some example queries[6,7,8,9].

This is focused on the core component of the language - how
to specify time series vectors, and extracting values for
time windows.  It does not yet include the definitions of
functions (for manipulating time series values) but it is
expected these will not present a large diversion from the
flavour of the language.

Note that the current code layout allows for experimentation
with alternate caching systems (i.e. other than Redis) and
alternate query languages (other than described here).

The default basic localhost setup, aims for a transparent
integration such that users do not need to know or configure
Redis (or other caching system backend) themselves.  However
more complex multi-host setups will of course require deeper
knowledge of configuration of backend caching system.  Maybe
Redis cloud services can relieve this such that provisioning
additional monitoring resources becomes simple.  We'll see.


Enjoy!
-- Nathan
