Research: PCP Integration

Intro

PCP, short for "Performance Co-Pilot", is a system performance and analysis framework. It collects performance-related statistics from multiple hosts and operating systems, including popular Linux distributions, UNIX variants, Windows, and macOS.

Through its plugin architecture, PCP records data not just about host information (disk, network, memory), but also collects stats from Apache, MySQL, Java VM, KVM, etc.

PCP is used for both live and historical data.

Using PCP in Cockpit

View performance metrics of one host

Without PCP, Cockpit only displays while it is running. When PCP is installed, statistics may be pulled from before Cockpit has been signed into.

This is already implemented to some extent.

However, there are a few issues currently:

It is not obvious when PCP is installed and active versus standard charts in Cockpit
It is not obvious that installing PCP will improve the charts in Cockpit

Time

Graphs in Cockpit extend to the past 5 minutes (this works with and without PCP installed).

PCP would let us look for specific historical events as well. We might want to consider checking the past 24 hours for problem activities.

Examples:

Low available memory
Active swap
High load average
High disk IO
Network-related issues (latency, outages, etc.)

Timeframe for these warnings should probably be between the past 24 hours to past week.

Detailed performance data

View simple PCP-based stats from current machine. We're not going to go into the ultra-configurable route like Grafana. (If people want that, Grafana exists and can be used in parallel.)

This would probably replace the separate CPU/Memory/Storage views.

It could be filterable to certain processes, containers, network interfaces, etc.

Install PCP

Installation for usage within Cockpit would probably be done through the upcoming PackageKit library.
As PCP is useful not just for Cockpit, but for other tools, we could consider adding it to the "Applications" section.

Combined statistics from multiple hosts

Simply combing all the data from multiple machines gets noisy.

Show exceptional events from various servers here as well, similar to the host-specific view.
Indicate from a group of machines which one is (or has been) acting differently.

New features for Cockpit metrics

In addition to modifying our charts, we may want to consider the following more advanced features:

Auto-installation of PCP (manually triggered by a suggestion somewhere in the UI)
Review past 24 hours (week too?) in a sped-up playback
Show exceptional data (spikes and the times of spikes)
Filtering by process, container, network interface, etc.
Instead of customization, have different modes of charts in tabs? (Example: Flip between representations of CPUs.)

What can you do with PCP?

Performance Co-Pilot can do a lot. It's highly modular and configurable.

We'll want to carefully pick some of these (seemingly common) tasks without going too overboard. The following lists are from various tasks that can be accomplished through using PCP from the command line or within an existing interface.

Live metrics

Display enabled performance metrics
- with short descriptions
- give detailed information about each performance metric and current values
Monitor metrics on a host
- disk write operations per partition
- CPU load
- memory usage
- disk write operations
Show process creation rate and unavailable versus available memory
Monitor metrics from multiple hosts
Compare metrics to help understand what happens on a system at a given time
- example: swap happens, IO spikes, CPU load increases, network traffic goes down
Display running processes in a given time window

Historical metrics

Everything that live metrics has, plus:

View host over a given time period (with the correct timezone)
- Translate timezone to another timezone (ex: viewing a server in Europe in US/Eastern time)
Adjust "zoom" level
- graphs could indicate 10 minutes, 1 hour, 12 hours, 1 day, 1 week, 1 month
Replay history in a given time range
- sped up (real-time would probably be too slow in most cases)
- scrubbing (moving back and forth in a timeline like an audio editor)
Show average/mean, min, max values of performance (CPU load, memory usage, disk IO, etc.) over a given timeframe
Summarize differences between different timeframes
- example: easily determine "Monday mornings have heaver load than Sunday evenings"

Filtered views (applies to live and historical)

Process-specific view filters
- process ID
- process name
Network interface(s) filters
Container-specific views
- cgroup accounting
  - disk IO: IOPs/bytes, service / wait time, aggregate / per device
  - CPU accounting: per-cgroup processer usage, aggregate CPU usage
  - memory: mapped anon pages, page cache, writeback, swap, active/inactive
- namespaces: show content & processes that differ inside versus outside of a container

Recommendations

Avoid using the phrase PCP in the interface; "PCP" is problematic and the common name for phencyclidine (aka: "angel dust")
Rework host summary page to de-emphasize charts and highlight essential machine information first
- Use the rest of available space for overview charts
Make it obvious when PCP isn't available (by default) versus when it is available
- Also make it easy to install PCP when it isn't available
Main graphs should have a quick overview of recent data
Secondary page should have more in-depth information
- Possibly special views including filtering, containers, etc.
- Consider adding playback mode
- Add summary highlights