Hopefully, with a new kernel, a new driver for the NIC, and 90 minutes of downtime, the server we live on (that’s we as in the websites. I’m not living in a sleeping bag in a 40 RU rack) should no longer be up and down like a yoyo.
The problem has been that for several months (since March, I think) the ethernet link would mysteriously stop getting and sending data. The link light stayed on, syslog and kern.log happily partied, but the server fell off the ‘net. Luckily, from a previous occurance, I had a cronjob which brings the link down and up (ifdown/ifup if you know your Debian) when the router is absent. This meant that downtime would generally only be 60 seconds long.
This held until last week when I realised the IPv6 stack (yes, we’re IPv6 enabled!) was suffering under this relentless onslaught of interface bouncing, and something was chewing up memory every time this happened (up to 20 times a day >_<) such that the all important 6to4-to-2000::/3 (ie the route for everything that’s not using 6to4) wasn’t being added. I only noticed this when I was setting up this blog, and noticed that there was a significant delay between me hitting the server access button, and actually getting a webpage.
So I called my co-location provider again, given that last time this happened it was caused by them (or their predecessors) moving this server to a different switch. They disclaimed any such knowledge this time, and said they’d look into it. (Still waiting to hear back. @_@)
Rebooting fixed it temporarily, and this weekend I decided it was time to update my kernel from 2.4.21 w/Debian patches to 2.4.27. I did that on Sunday night, rebuilt the scyld rtl8139 driver I’d been using, and off it went. Clean reboot, no problems, but the link was still dying!
So this afternoon I took it upon myself to flick the server over to the kernel-included’s 8139too driver. Needless to say, rmmod barfed since the module appeared to be in use by more than one thing, and so I vainly made it reboot on the off chance that would clear things up. Only the reboot died. After shutting down the daemons, so I couldn’t ssh back in and fix it at all.
A call to my co-location providers, a 90-minute wait of terror, and we’re live again.
The new network driver seems to have fixed the problem.
The only thing that worries me? I only switched to the scyld driver because that fixed the problem the last time this happened.
I’ll leave the cron job running, just in case. ^_^