Skip to content

Feature: System Health

Garrett LeSage edited this page May 28, 2018 · 4 revisions

System health is both a subset and superset of PCP integration. In other words, health includes PCP-only health metrics, as well as non-PCP metrics.

Goals

  • Display system status at a glance
  • Provide ways to easily fix issues
  • When an easy fix is not possible, provide information on how to resolve a problem

Examples

When no health issues are detected, the system should report a healthy state. The lists below are not the only problems that could occur (more could be added later), but are a starter list of possible issues with a system.

General health issues

Some of these issues have simple solutions that Cockpit can automatically fix.

  1. Security-related software updates available
    • Click to view the software updates page
  2. Issues mounting filesystems (as specified in fstab, etc.)
    • Display issues with the filesystem mounts, along with errors while mounting
  3. Insufficient storage space on partitions
    • Show partitions with small amounts of space
  4. SMART issues
    • Display issue, which may include:
      • Bad clusters on a disk
      • IO issues
  5. Swap is currently active
    • Display warning that swap is active (PCP needs to be installed for more details; see below)
  6. Issues with bringing up network interfaces
    • Show problematic network interfaces; click to switch to the network page
  7. Enabled & running systemd service keeps restarting
    • Click to display service's page with its log visible

PCP-derived health issues

Several detectable issues require PCP to be installed to be accurate and/or useful. Most of these will not have a simple 1-click solution. Most will require displaying info and/or digging a bit further.

  1. CPU load is constantly too high
    • Identify top offenders over a window of time and provide actions to stop/restart services and/or kill processes
  2. Not enough memory is free
    • Identify top offenders and provide actions (similar to CPU load), suggest upgrading RAM
  3. Swap is often used (PCP-enhanced version of swap rule above)
    • Related to not enough available memory issue (above)
    • Show top memory offenders while swap is active (this should help identify the offenders)
  4. Network is constantly saturated
    • Show processes transferring the most data
  5. Excessive waiting for storage (disk is >85% busy)
  6. Huge page fragmentation/defragmentation (memory is fragmented and system is spending a lot of time shuffling chunks of memory around to defragment)
  7. Network errors exist
  8. Packet receive (RX) queue is too small, causing many packages to be dropped
    • Provide a means to specify a new queue length, with a suggested default

Red Hat Insights integration

Another source of system information would be to integrate with Red Hat Insights for registered RHEL machines that have access to the service.

(This would be in addition to the non-PCP and PCP health information above.)

Mockups

(Mockups are rough sketches and are not intended to be finalized or "pixel-perfect".)

Clone this wiki locally