Feature: System Health

System health is both a subset and superset of PCP integration. In other words, health includes PCP-only health metrics, as well as non-PCP metrics.

Goals

Display system status at a glance
Provide ways to easily fix issues
When an easy fix is not possible, provide information on how to resolve a problem

Examples

When no health issues are detected, the system should report a healthy state. The lists below are not the only problems that could occur (more could be added later), but are a starter list of possible issues with a system.

General health issues

Some of these issues have simple solutions that Cockpit can automatically fix.

Security-related software updates available
- Click to view the software updates page
Issues mounting filesystems (as specified in fstab, etc.)
- Display issues with the filesystem mounts, along with errors while mounting
Insufficient storage space on partitions
- Show partitions with small amounts of space
SMART issues
- Display issue, which may include:
  - Bad clusters on a disk
  - IO issues
Swap is currently active
- Display warning that swap is active (PCP needs to be installed for more details; see below)
Issues with bringing up network interfaces
- Show problematic network interfaces; click to switch to the network page
Enabled & running systemd service keeps restarting
- Click to display service's page with its log visible

PCP-derived health issues

Several detectable issues require PCP to be installed to be accurate and/or useful. Most of these will not have a simple 1-click solution. Most will require displaying info and/or digging a bit further.

CPU load is constantly too high
- Identify top offenders over a window of time and provide actions to stop/restart services and/or kill processes
Not enough memory is free
- Identify top offenders and provide actions (similar to CPU load), suggest upgrading RAM
Swap is often used (PCP-enhanced version of swap rule above)
- Related to not enough available memory issue (above)
- Show top memory offenders while swap is active (this should help identify the offenders)
Network is constantly saturated
- Show processes transferring the most data
Excessive waiting for storage (disk is >85% busy)
Huge page fragmentation/defragmentation (memory is fragmented and system is spending a lot of time shuffling chunks of memory around to defragment)
Network errors exist
Packet receive (RX) queue is too small, causing many packages to be dropped
- Provide a means to specify a new queue length, with a suggested default