By Вen Li
A single glitch in computer hardware disabled Internet access for many University of Calgary students and staff on Mon. Jan. 10. Due to a particular misbehavior router, which aggregates and directs much of the computer traffic on campus, users could not access the Infonet, e-mail, and Internet web sites.
“This incident is unprecedented,” said Associate Director of Information Technologies Tom Seto. “The Academic router had been in use for over four years.”
According to Seto, the outage on the Academic Router lasted from 1:28 p.m. to 4:20 p.m. and generated more than double the normal call volume to the IT help desk. While traffic from viruses and worms have caused significant outages in the past, most notably in September 2003, hardware had never been a cause of failure.
Network Analyst Dan Clark explained that the outage took almost three hours to remedy because of the complicated nature of the failure.
After technicians investigated a possible denial of service attack on DNS servers and replaced one of them, they rebooted the router gracefully and restored full service. Prior to Monday, the Academic Router had been operating flawlessly since its last software update in August 2004.
Students in some departments, with computers that rely on their own DNS servers, were not affected.
“There was no obvious precipitating event to cause the failure here,” said Clark of the difficulty in isolating the exact cause of failure. “The router logs will show the state of registers, but recreating the error from that is just about impossible.”
The IT department is conducting a major incident review this week to examine both the hardware failure and the response to it with respect to end users. Seto acknowledged that users and computer administrators could have been better informed of the outage and the restoration of service.
“The MIR will identify things we can do in the future to prevent outages,” said Seto. “More immediately, we’ll look at our phone-out and answering machine notifications, and work with helpdesks to better inform users. It doesn’t matter if it’s a three hour outage, people aren’t happy.”
Many students unable to access e-mail or the Infonet on the first school day of the semester were left helpless in the absence of information about the outage in some computer labs. Seto hopes that prioritizing notification will alleviate uncertainty among users during future incidents.
“Though fixing the problem is important, communication with users will be of primary importance,” he said.
Results from the MIR will be available next week.