Run Cluster Diagnostics in Linux
Use this guide to diagnose and troubleshoot issues in a Redpanda cluster running in Linux.
Collect all debugging data
For a comprehensive diagnostic snapshot, generate a debug bundle that collects detailed data for cluster, broker, or node analysis.
See Generate a Debug Bundle with rpk
in Linux for details on generating a debug bundle.
Self-test benchmarks
When anomalous behavior arises in a cluster, you can determine if it’s caused by issues with hardware, such as disk drives or network interfaces (NICs) by running rpk cluster self-test
to assess their performance and compare it to vendor specifications.
The rpk cluster self-test
command runs a set of benchmarks to gauge the maximum performance of a machine’s disks and network connections:
-
Disk tests: Measures throughput and latency by performing concurrent sequential operations.
-
Network tests: Selects unique pairs of Redpanda brokers as client/server pairs and runs throughput tests between them.
Each benchmark runs for a configurable duration and returns IOPS, throughput, and latency metrics. This helps you determine if hardware performance aligns with expected vendor specifications.
Cloud storage tests
You can also use the self-test command to confirm your cloud storage is configured correctly for Tiered Storage.
Self-test performs the following checks to validate cloud storage configuration:
-
Upload an object (a random buffer of 1024 bytes) to the cloud storage bucket/container.
-
List objects in the bucket/container.
-
Download the uploaded object from the bucket/container.
-
Download the uploaded object’s metadata from the bucket/container.
-
Delete the uploaded object from the bucket/container.
-
Upload and then delete multiple objects (random buffers) at once from the bucket/container.
For more information on cloud storage test details, see the rpk cluster self-test start
reference.
Start self-test
To start using self-test, run the self-test start
command. Only initiate self-test start
when system resources are available, as this operation can be resource-intensive.
rpk cluster self-test start
For command help, run rpk cluster self-test start -h
. For additional command flags, see the rpk cluster self-test start reference.
Before self-test start
begins, it requests your confirmation to run its potentially large workload.
Example start output:
? Redpanda self-test will run benchmarks of disk and network hardware that will consume significant system resources. Do not start self-test if large workloads are already running on the system. (Y/n) Redpanda self-test has started, test identifier: "031be460-246b-46af-98f2-5fc16f03aed3", To check the status run: rpk cluster self-test status
The self-test start
command returns immediately, and self-test runs its benchmarks asynchronously.
Check self-test status
To check the status of self-test, run the self-test status
command.
rpk cluster self-test status
For command help, run rpk cluster self-test status -h
. For additional command flags, see the rpk cluster self-test status reference.
If benchmarks are currently running, self-test status
returns a test-in-progress message.
Example status output:
$ rpk cluster self-test status Nodes [0 1 2] are still running jobs
The
bash |
If benchmarks have completed, self-test status
returns their results.
Test results are grouped by broker ID. Each test returns the following:
-
Name: Description of the test.
-
Info: Details about the test run attached by Redpanda.
-
Type: Either
disk
,network
, orcloud
test. -
Test Id: Unique identifier given to jobs of a run. All IDs in a test should match. If they don’t match, then newer and/or older test results have been included erroneously.
-
Timeouts: Number of timeouts incurred during the test.
-
Start time: Time that the test started, in UTC.
-
End time: Time that the test ended, in UTC.
-
Avg Duration: Duration of the test.
-
IOPS: Number of operations per second. For disk, it’s
seastar::dma_read
andseastar::dma_write
. For network, it’srpc.send()
. -
Throughput: For disk, throughput rate is in bytes per second. For network, throughput rate is in bits per second. Note that GiB vs. Gib is the correct notation displayed by the UI.
-
Latency: 50th, 90th, etc. percentiles of operation latency, reported in microseconds (μs). Represented as P50, P90, P99, P999, and MAX respectively.
If Tiered Storage is not enabled, then cloud storage tests do not run, and a warning displays: "Cloud storage is not enabled." All results are shown as 0.
Example status output: test results
$ rpk cluster self-test status
NODE ID: 0 | STATUS: IDLE
=========================
NAME 512KB sequential r/w
INFO write run (iodepth: 4, dsync: true)
TYPE disk
TEST ID 21c5a3de-c75b-480c-8a3d-0cbb63228cb1
TIMEOUTS 0
START TIME Fri Jul 19 15:02:45 UTC 2024
END TIME Fri Jul 19 15:03:15 UTC 2024
AVG DURATION 30002ms
IOPS 1182 req/sec
THROUGHPUT 591.4MiB/sec
LATENCY P50 P90 P99 P999 MAX
3199us 3839us 9727us 12799us 21503us
NAME 512KB sequential r/w
INFO read run
TYPE disk
TEST ID 21c5a3de-c75b-480c-8a3d-0cbb63228cb1
TIMEOUTS 0
START TIME Fri Jul 19 15:03:15 UTC 2024
END TIME Fri Jul 19 15:03:45 UTC 2024
AVG DURATION 30000ms
IOPS 6652 req/sec
THROUGHPUT 3.248GiB/sec
LATENCY P50 P90 P99 P999 MAX
607us 671us 831us 991us 2431us
NAME 4KB sequential r/w, low io depth
INFO write run (iodepth: 1, dsync: true)
TYPE disk
TEST ID 21c5a3de-c75b-480c-8a3d-0cbb63228cb1
TIMEOUTS 0
START TIME Fri Jul 19 15:03:45 UTC 2024
END TIME Fri Jul 19 15:04:15 UTC 2024
AVG DURATION 30000ms
IOPS 406 req/sec
THROUGHPUT 1.59MiB/sec
LATENCY P50 P90 P99 P999 MAX
2431us 2559us 2815us 5887us 9215us
NAME 4KB sequential r/w, low io depth
INFO read run
TYPE disk
TEST ID 21c5a3de-c75b-480c-8a3d-0cbb63228cb1
TIMEOUTS 0
START TIME Fri Jul 19 15:04:15 UTC 2024
END TIME Fri Jul 19 15:04:45 UTC 2024
AVG DURATION 30000ms
IOPS 430131 req/sec
THROUGHPUT 1.641GiB/sec
LATENCY P50 P90 P99 P999 MAX
1us 2us 12us 28us 511us
NAME 4KB sequential write, medium io depth
INFO write run (iodepth: 8, dsync: true)
TYPE disk
TEST ID 21c5a3de-c75b-480c-8a3d-0cbb63228cb1
TIMEOUTS 0
START TIME Fri Jul 19 15:04:45 UTC 2024
END TIME Fri Jul 19 15:05:15 UTC 2024
AVG DURATION 30013ms
IOPS 513 req/sec
THROUGHPUT 2.004MiB/sec
LATENCY P50 P90 P99 P999 MAX
15871us 16383us 21503us 32767us 40959us
NAME 4KB sequential write, high io depth
INFO write run (iodepth: 64, dsync: true)
TYPE disk
TEST ID 21c5a3de-c75b-480c-8a3d-0cbb63228cb1
TIMEOUTS 0
START TIME Fri Jul 19 15:05:15 UTC 2024
END TIME Fri Jul 19 15:05:45 UTC 2024
AVG DURATION 30114ms
IOPS 550 req/sec
THROUGHPUT 2.151MiB/sec
LATENCY P50 P90 P99 P999 MAX
118783us 118783us 147455us 180223us 180223us
NAME 4KB sequential write, very high io depth
INFO write run (iodepth: 256, dsync: true)
TYPE disk
TEST ID 21c5a3de-c75b-480c-8a3d-0cbb63228cb1
TIMEOUTS 0
START TIME Fri Jul 19 15:05:45 UTC 2024
END TIME Fri Jul 19 15:06:16 UTC 2024
AVG DURATION 30460ms
IOPS 558 req/sec
THROUGHPUT 2.183MiB/sec
LATENCY P50 P90 P99 P999 MAX
475135us 491519us 507903us 524287us 524287us
NAME 4KB sequential write, no dsync
INFO write run (iodepth: 64, dsync: false)
TYPE disk
TEST ID 21c5a3de-c75b-480c-8a3d-0cbb63228cb1
TIMEOUTS 0
START TIME Fri Jul 19 15:06:16 UTC 2024
END TIME Fri Jul 19 15:06:46 UTC 2024
AVG DURATION 30000ms
IOPS 424997 req/sec
THROUGHPUT 1.621GiB/sec
LATENCY P50 P90 P99 P999 MAX
135us 183us 303us 543us 9727us
NAME 16KB sequential r/w, high io depth
INFO write run (iodepth: 64, dsync: false)
TYPE disk
TEST ID 21c5a3de-c75b-480c-8a3d-0cbb63228cb1
TIMEOUTS 0
START TIME Fri Jul 19 15:06:46 UTC 2024
END TIME Fri Jul 19 15:07:16 UTC 2024
AVG DURATION 30000ms
IOPS 103047 req/sec
THROUGHPUT 1.572GiB/sec
LATENCY P50 P90 P99 P999 MAX
511us 1087us 1343us 1471us 15871us
NAME 16KB sequential r/w, high io depth
INFO read run
TYPE disk
TEST ID 21c5a3de-c75b-480c-8a3d-0cbb63228cb1
TIMEOUTS 0
START TIME Fri Jul 19 15:07:16 UTC 2024
END TIME Fri Jul 19 15:07:46 UTC 2024
AVG DURATION 30000ms
IOPS 193966 req/sec
THROUGHPUT 2.96GiB/sec
LATENCY P50 P90 P99 P999 MAX
319us 383us 735us 1023us 6399us
NAME 8K Network Throughput Test
INFO Test performed against node: 1
TYPE network
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 61612 req/sec
THROUGHPUT 3.76Gib/sec
LATENCY P50 P90 P99 P999 MAX
159us 207us 303us 431us 1151us
NAME 8K Network Throughput Test
INFO Test performed against node: 2
TYPE network
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 60306 req/sec
THROUGHPUT 3.68Gib/sec
LATENCY P50 P90 P99 P999 MAX
159us 215us 351us 495us 11263us
NAME Cloud Storage Test
INFO Put
TYPE cloud
TEST ID a349685a-ee49-4141-8390-c302975db3a5
TIMEOUTS 0
START TIME Tue Jul 16 18:06:30 UTC 2024
END TIME Tue Jul 16 18:06:30 UTC 2024
AVG DURATION 8ms
NAME Cloud Storage Test
INFO List
TYPE cloud
TEST ID a349685a-ee49-4141-8390-c302975db3a5
TIMEOUTS 0
START TIME Tue Jul 16 18:06:30 UTC 2024
END TIME Tue Jul 16 18:06:30 UTC 2024
AVG DURATION 1ms
NAME Cloud Storage Test
INFO Get
TYPE cloud
TEST ID a349685a-ee49-4141-8390-c302975db3a5
TIMEOUTS 0
START TIME Tue Jul 16 18:06:30 UTC 2024
END TIME Tue Jul 16 18:06:30 UTC 2024
AVG DURATION 1ms
NAME Cloud Storage Test
INFO Head
TYPE cloud
TEST ID a349685a-ee49-4141-8390-c302975db3a5
TIMEOUTS 0
START TIME Tue Jul 16 18:06:30 UTC 2024
END TIME Tue Jul 16 18:06:30 UTC 2024
AVG DURATION 0ms
NAME Cloud Storage Test
INFO Delete
TYPE cloud
TEST ID a349685a-ee49-4141-8390-c302975db3a5
TIMEOUTS 0
START TIME Tue Jul 16 18:06:30 UTC 2024
END TIME Tue Jul 16 18:06:30 UTC 2024
AVG DURATION 1ms
NAME Cloud Storage Test
INFO Plural Delete
TYPE cloud
TEST ID a349685a-ee49-4141-8390-c302975db3a5
TIMEOUTS 0
START TIME Tue Jul 16 18:06:30 UTC 2024
END TIME Tue Jul 16 18:06:30 UTC 2024
AVG DURATION 47ms
If self-test returns write results that are unexpectedly and significantly lower than read results, it may be because the Redpanda rpk client hardcodes the DSync option to true . When DSync is enabled, files are opened with the O_DSYNC flag set, and this represents the actual setting that Redpanda uses when it writes to disk.
|
Stop self-test
To stop a running self-test, run the self-test stop
command.
rpk cluster self-test stop
Example stop output:
$ rpk cluster self-test stop All self-test jobs have been stopped
For command help, run rpk cluster self-test stop -h
. For additional command flags, see the rpk cluster self-test stop reference.
For more details about self-test, including command flags, see rpk cluster self-test.
Next steps
Learn how to resolve common errors.