Andrew Que Sites list Photos
Projects Contact
Main
Noah

Noah

After not having any luck yesterday with my new hard drive controller I’ve decided to try a different controller. The old Blue Dragon had a 4-port PCIe 1x card. This only gives me access to 12 of the 13 drives, but will allow me to run another test. I started that last night before going to bed.

The first thing I noticed is that the test runs noticeably slower. With the first controller I could test all 13 drives in less than 21 hours. This test finished in 22 hours, 57 minutes.

I did not see a single “Power-on or device reset occurred” error. What I did see where periodic “hard resetting link” errors. They didn’t happen often, but did occur. I think this might be due to poor SATA cables.

Vulture and Something Dead

Vulture and Something Dead

Drive test from yesterday came up clean with no drive having any errors. This morning I tried checking the Seagate drive by itself. I had not seen kernel log messages about failures of this drive, but the drive had also been reading and verifying rather than writing when I looked. After 30 minutes of writing, no errors were reported. Then I checked one of the HGST drives. Within a few minutes I started to see errors reported. So the new controller has problems writing to the HGST drives.

April 07, 2020

Data-Dragon Disk Kernel Error

Might have stumbled on a possible source of the random errors on my drive tests. Yesterday’s test had a single drive failure, and in a similar manner to previous failures. One of the low hour drives reported several thousand write errors, and millions of read errors. I decided to have a look at the kernel message log, and started seeing regular instances of this:

[…] sd 7:0:2:0: Power-on or device reset occurred
[…] sd 7:0:4:0: Power-on or device reset occurred
[…] sd 7:0:1:0: Power-on or device reset occurred
[…] sd 7:0:0:0: Power-on or device reset occurred

This reset is happening on the devices connected to the new hard drive controller. I suspect there is either a cable (power or SATA) or there is something up with this brand new controller.

Output of lsscsi:

[7:0:0:0]    disk    ATA      HGST HMS5C4040AL A3W0  /dev/sda 
[7:0:1:0]    disk    ATA      HGST HMS5C4040AL A3W0  /dev/sdb 
[7:0:2:0]    disk    ATA      HGST HMS5C4040AL A3W0  /dev/sdc 
[7:0:3:0]    disk    ATA      ST4000DM004-2CV1 0001  /dev/sdd 
[7:0:4:0]    disk    ATA      HGST HMS5C4040AL A3W0  /dev/sde 
[7:0:5:0]    disk    ATA      V-32             0825  /dev/sdf 

There are only reported problems for sda, sdb, sdc, and sde. No problems reported on sdd, sdf.

To map which SCSI device goes to what controller I used lsscsi -v which produced:

[0:0:0:0] … /dev/sdg […02:00.1/ata1/host0/target0:0:0/0:0:0:0]
[1:0:0:0] … /dev/sdh […02:00.1/ata2/host1/target1:0:0/1:0:0:0]
[2:0:0:0] … /dev/sdi […02:00.1/ata3/host2/target2:0:0/2:0:0:0]
[3:0:0:0] … /dev/sdj […02:00.1/ata4/host3/target3:0:0/3:0:0:0]
[4:0:0:0] … /dev/sdk […02:00.1/ata5/host4/target4:0:0/4:0:0:0]
[5:0:0:0] … /dev/sdl […02:00.1/ata6/host5/target5:0:0/5:0:0:0]
[6:0:0:0] … /dev/sdm […02:00.1/ata7/host6/target6:0:0/6:0:0:0]
[7:0:0:0] … /dev/sda […08:00.0/host7/port-7:0/end_device-7:0/target7:0:0/7:0:0:0]
[7:0:1:0] … /dev/sdb […08:00.0/host7/port-7:1/end_device-7:1/target7:0:1/7:0:1:0]
[7:0:2:0] … /dev/sdc […08:00.0/host7/port-7:2/end_device-7:2/target7:0:2/7:0:2:0]
[7:0:3:0] … /dev/sdd […08:00.0/host7/port-7:3/end_device-7:3/target7:0:3/7:0:3:0]
[7:0:4:0] … /dev/sde […08:00.0/host7/port-7:4/end_device-7:4/target7:0:4/7:0:4:0]
[7:0:5:0] … /dev/sdf […08:00.0/host7/port-7:5/end_device-7:5/target7:0:5/7:0:5:0]
[8:0:0:0] … /dev/sdn […02:00.1/ata8/host8/target8:0:0/8:0:0:0]
[N:0:1:1] … /dev/nvme0n1 […01:00.0/nvme/nvme0/nvme0n1]

And to see which PCI controller this is I used lspci :

01:00.0 Non-Volatile memory controller: Phison Electronics Corporation Device 5013 (rev 01)
02:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] X370 Series Chipset SATA Controller (rev 02)
… 
08:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)

So the new controller card has the drives reporting problems. Note that only the HGST devices have errors. These drives, 12 in total, are all HGST MegaScale 4000. The Seagate Barracuda 3.5 is newer with a 6.0 Gb/s interface. It may be the Seagate drive isn’t reporting errors because it is not doing writes when I discovered this issue. It usually finishes the test in half the time as the HGST drives. After the HGST drives finished the write portion of the test, the reset errors stopped appearing in the kernel log.

I reviewer on Amazon noted that he had I/O problems (no details) that were solved by loading an older version of firmware. I will try this myself and see if it makes a difference.

   In September of 2014 I installed a push button switch on the door of a wardrobe I used to hang my work shirts.  Although I took pictures of it, I never did write about the switch install.  At the time, I was doing so many projects this one fell through the tracks.  This switched setup worked until a couple weeks ago when the switch fell apart.  It really wasn't designed for what I did with it so I'm pretty happy at lasted as long as it did. 
   I got used to having a light in the wardrobe and wanted it back.  After some searching around I found a proper door switch and ordered it.  I needed to add a wood spacer to make the setup work, but otherwise the new switch went in quickly.  Now my shirt wardrobe has light again.
   Yesterday's drive test is again inclusive.  One drive that was reporting no errors the previous test now registered 63 million, and the drive that had errors reported none.  Again, the error was on a new drive with only 330 hours (about 14 days) of run time.  The drive that did not appear previously also had no errors.  When in doubt, test again.  I'm also ordering new SATA cables as clear wiggling cables helped and I don't trust the cables currently installed.  Should have results in sometime tomorrow.  Might as well run the test over and over.  Random inconsistencies might become less random with repetition.
   On Monday I did my first bike ride in a very long time.  Despite the quarantine there are no bans on solitary outdoor activities such as cycling (see section 11.c).  Temperatures were in the 50s with a mild wind from the north-west.  I did the Martinsville-Waunakee loop.  This ride is usually about 2 hours, 10 minutes and a 1,500 Calories.  This time it took me 2 hours and 29 minutes and I burned 2,113 Calories.  The increased calorie burn is likely the result of being out of shape from not riding for so long.
   The other day my new hard drive controller arrived, allowing the Data Dragon to address all 14 hard drives.  There are 13x 4 TB drives, and 32 GB SSD.  I had suspected one drive I removed was no good because it failed to appear to the machine, but decided to give it a full test to find out.  The test finished today, and the results are puzzling.  I didn't notice but only 12 of the 13x  4 TB drives actually showed up.  One of the others, which had been functional, did not appear.  In addition, one of the new drives I picked up from Pluvius registered several million failures.  I don't truest the results of this test.
   I decided to wiggle drive cables and start the test again.  This time, all 14 drives registered as being online.  The test takes around 20 hours so we'll find out then what happens.

April 01, 2020

Bad traffic from Amazon

Noticed a huge amount of traffic dragging through my site all coming from the same subnet. I’ve noticed this kind of thing in the past and it is usually from China, but this time it was form a subnet owned by Amazon. Someone is running a script that is downloading everything from my site. Unlike most websites, DrQue.net is not sitting in a data center with giant Internet pipes, and I need to share that bandwidth. So I temporally blocked a large block of IP space from: 54.174.54.0/22. Initially I just did blocked a class A starting at 54.174.55.0, but then I started seeing requests from 54.174.54.* and 54.174.53.* so I blocked those too.

sudo iptables -A INPUT -s 54.174.53/22 -j DROP

Now I just have to remember to remove that rule sometime in the future.

March 31, 2020

Custom Python Message Of the Day (MOTD)

Data-Dragon's MOTD

I typically replace the motd (message of the day) on all my main Linux computers. Getting exactly what you want from motd isn’t always straight forward. Initially, the message of the day was stored in a file called /etc/motd. On Debian there is a set of scripts to generate all kinds of extra crap located /etc/update-motd.d/. Generally I empty this directory out less one file, and edit that file to be the message of the day. On Proxmox, the message of the day is just the kernel name followed by the Debian legal message. The legal message comes from /etc/motd.

On the Data-Dragon, I truncated /etc/motd (just an empty file now). Then I replaced /etc/update-motd.d/10-uname with a contents of a Python script that generate a custom color pattern.