Something else went wrong this morning and we are investigating.
Another crash - this time somewhere completely different and equally impossible.
It looks like lines reconnected quickly over both LNSs as you would expect from a crash. This is an excellent test but obviously this sort of thing simply should not happen.
OK we are moving people that connected to D after it crashed over to C. It is about a 1/3 of customers that are being moved. The controlled LNS switch is deliberately slow taking one line at a time over and ensuring no load on BT or our RADIUS that could cause default accepts or other delays or problems.
We are installing new code with additional debugging on D, and will move some lines back to D shortly.
This means we will lose graphs for this morning.
Looks like a few dozen lines got a PPP kill twice. We have updated the scripts now, and we are moving 1000 lines over from C back to D now. The plan is to leave this for this afternoon to confirm all is working and stable.
We are moving some more lines over now.
We are moving the last of the lines over now.
All lines are on the new code on the "D" LNS.
The problem is we have not got to the underlying cause yet.
We'll leave service like this for the rest of the day.
Having been stable for over 24 hours I am closing this major issue but we will continue to monitor carefully and try and find the underlying cause.