Might have stumbled on a possible source of the random errors on my drive tests. Yesterday’s test had a single drive failure, and in a similar manner to previous failures. One of the low hour drives reported several thousand write errors, and millions of read errors. I decided to have a look at the kernel message log, and started seeing regular instances of this:
[…] sd 7:0:2:0: Power-on or device reset occurred
[…] sd 7:0:4:0: Power-on or device reset occurred
[…] sd 7:0:1:0: Power-on or device reset occurred
[…] sd 7:0:0:0: Power-on or device reset occurred
This reset is happening on the devices connected to the new hard drive controller. I suspect there is either a cable (power or SATA) or there is something up with this brand new controller.
Output of lsscsi:
[7:0:0:0] disk ATA HGST HMS5C4040AL A3W0 /dev/sda
[7:0:1:0] disk ATA HGST HMS5C4040AL A3W0 /dev/sdb
[7:0:2:0] disk ATA HGST HMS5C4040AL A3W0 /dev/sdc
[7:0:3:0] disk ATA ST4000DM004-2CV1 0001 /dev/sdd
[7:0:4:0] disk ATA HGST HMS5C4040AL A3W0 /dev/sde
[7:0:5:0] disk ATA V-32 0825 /dev/sdf
There are only reported problems for sda, sdb, sdc, and sde. No problems reported on sdd, sdf.
To map which SCSI device goes to what controller I used lsscsi -v which produced:
[0:0:0:0] … /dev/sdg […02:00.1/ata1/host0/target0:0:0/0:0:0:0]
[1:0:0:0] … /dev/sdh […02:00.1/ata2/host1/target1:0:0/1:0:0:0]
[2:0:0:0] … /dev/sdi […02:00.1/ata3/host2/target2:0:0/2:0:0:0]
[3:0:0:0] … /dev/sdj […02:00.1/ata4/host3/target3:0:0/3:0:0:0]
[4:0:0:0] … /dev/sdk […02:00.1/ata5/host4/target4:0:0/4:0:0:0]
[5:0:0:0] … /dev/sdl […02:00.1/ata6/host5/target5:0:0/5:0:0:0]
[6:0:0:0] … /dev/sdm […02:00.1/ata7/host6/target6:0:0/6:0:0:0]
[7:0:0:0] … /dev/sda […08:00.0/host7/port-7:0/end_device-7:0/target7:0:0/7:0:0:0]
[7:0:1:0] … /dev/sdb […08:00.0/host7/port-7:1/end_device-7:1/target7:0:1/7:0:1:0]
[7:0:2:0] … /dev/sdc […08:00.0/host7/port-7:2/end_device-7:2/target7:0:2/7:0:2:0]
[7:0:3:0] … /dev/sdd […08:00.0/host7/port-7:3/end_device-7:3/target7:0:3/7:0:3:0]
[7:0:4:0] … /dev/sde […08:00.0/host7/port-7:4/end_device-7:4/target7:0:4/7:0:4:0]
[7:0:5:0] … /dev/sdf […08:00.0/host7/port-7:5/end_device-7:5/target7:0:5/7:0:5:0]
[8:0:0:0] … /dev/sdn […02:00.1/ata8/host8/target8:0:0/8:0:0:0]
[N:0:1:1] … /dev/nvme0n1 […01:00.0/nvme/nvme0/nvme0n1]
And to see which PCI controller this is I used lspci :
01:00.0 Non-Volatile memory controller: Phison Electronics Corporation Device 5013 (rev 01)
02:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] X370 Series Chipset SATA Controller (rev 02)
…
08:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
So the new controller card has the drives reporting problems. Note that only the HGST devices have errors. These drives, 12 in total, are all HGST MegaScale 4000. The Seagate Barracuda 3.5 is newer with a 6.0 Gb/s interface. It may be the Seagate drive isn’t reporting errors because it is not doing writes when I discovered this issue. It usually finishes the test in half the time as the HGST drives. After the HGST drives finished the write portion of the test, the reset errors stopped appearing in the kernel log.
I reviewer on Amazon noted that he had I/O problems (no details) that were solved by loading an older version of firmware. I will try this myself and see if it makes a difference.