[PLUG] Netbooting device needs NFSv2

Ted Mittelstaedt tedm at portlandia-it.com
Wed Dec 25 18:13:55 UTC 2024


Kudos, Russell!  It nearly brought me to tears, reading that.  This is REAL old-school troubleshooting, like was routinely done back in the day when Linux wasn't so full of itself that it could just tell people to discard gear (or send it to a museum) that didn't work with it.

Does this apply to the net4801 or any other more commonly available used Soekris models that use the same CPU?

One of the critical reasons this kind of research is so important is when you do the Wrong Thing successfully, the code now diverges from the documentation and then later on someone gets a bug that they have no clue why it's there.

Ted

-----Original Message-----
From: PLUG <plug-bounces at lists.pdxlinux.org> On Behalf Of Russell Senior
Sent: Tuesday, December 24, 2024 4:59 PM
To: Portland Linux/Unix Group <plug at lists.pdxlinux.org>
Subject: Re: [PLUG] Netbooting device needs NFSv2

On Wed, Dec 18, 2024 at 5:48 AM Russell Senior <russell at personaltelco.net> wrote:
>
> The right solution is probably just to retire the one in the field and 
> put the whole lot of them into a "museum box", but hey, it's the 
> holidays. What better period to waste a bunch of time keeping creaking 
> hardware alive. And anyway, the museum curators will be more thrilled 
> to create an exhibit if they have working firmware.

I am happy to report that I was able to use the periodic builds I made historically to narrow down the region of the introduction of the breakage to a few months in 2019, between late February and late May of that year. Then I used classic git bisection, in half-a-dozen or so iterations, to narrow the breakage to a single commit. To do the bisection on basically a 5 year old project that is constantly changing, I had to set up a "period correct" build environment. That is because the state of the project back in 2019 did/could not anticipate the changes in the build host environment (things like new compiler and toolchain versions, in particular gcc, g++ and python).
That meant I had to find a "spare" machine that I could commit to an old OS version. I ended up with Ubuntu 18.04.6, which would have been extant in 2019. I tried a Debian version, but it didn't have the non-free firmware blobs needed to get the laptop ("spare") I had connected to a network.

The single commit was a kernel bump from v4.14.112 to v4.14.113. So, I looked at the commits involved in that transition and spotted one that changed how support for the cyrix chips were supported. So, I took
v4.14.113 and reverted that single change, and *boom* my breakage was fixed. So, I reported that upstream to the linux kernel people who were involved in that commit. While waiting for a response from them, an OpenWrt guy and I (mostly following his reasonable suggestions and intuition), we narrowed the problem down even further. The root cause appears to be that the SC1100 chip does *NOT* want its SUSP# pin enabled. This pin allows an external device (part of the chipset) to stop and start the CPU. Apparently, during warm boots, that pin gets pulled low and the CPU dutifully stops. So, I have a patch that works for my specific context, although it probably breaks in some other contexts, so upstream will need to determine how to deal with that. My same local fix works in modern OpenWrt with a v6.6.67 kernel. So, my field deployed Soekris net4826 *can* be updated to modern firmware.

  https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/datasheets/goede_gx1_databook-rev5.pdf

In the Geode GX1 family, there is a set of CPU registers that are accessed by first writing a register index to port 0x22 then reading or writing to port 0x23. The "fix" that broke the SC1100 was to actually do that getting/setting correctly in the right order. I
*think* the reason it was breaking is that the Old Method was trying to set the SUSP# enable bit, but actually failing, so it was not enabled and my warm boot succeeded. When the v4.14.113 changes fixed the getter/setter functions, it did the Wrong Thing successfully. So, the right fix is just to not do the Wrong Thing at all.

Merry Christmas,

--
Russell Senior
russell at personaltelco.net



More information about the PLUG mailing list