Order posts by limited to posts

All Maidenhead Data Centre Major Service Outage Posts

Web and email outage - Closed 5 Apr 09:14:46
Details
4 Apr 15:47:25

This is ongoing. We're investigating.

Update
4 Apr 16:07:41

This should now be fixed. Please let support know if you see any problems, or have any questions.

Update
5 Apr 09:15:04

This was resolved yesterday afternoon.

Started 4 Apr 15:46:11
Closed 5 Apr 09:14:46

Connectivity problems in Maidenhead - Closed 6 Feb 12:49:46
Details
6 Feb 09:15:19

There's a connectivty problem in Maidenhead. We're investigating.

Update
6 Feb 09:34:51

Connectivity looks normal again now. We're still investigating the cause of the problem.

Closed
6 Feb 12:49:46

The problem was resolved by power cycling some hardware. Although we're still not sure exactly what caused this, we are suspicious of one of our switches. We already have planned work to replace the switch on Sunday.


Disk server issues - Closed 22 Aug 2012 21:18:59
Details
22 Aug 2012 21:18:04

Disk server playing uo affecting web pages and email.

Update
22 Aug 2012 21:18:49

Staff are working on this.

Started 22 Aug 2012 21:00:00
Closed
22 Aug 2012 21:18:59

All sorted


Web & email storage problem - Closed 04 May 2012 21:15:57
Details
04 May 2012 19:00:58

There's a problem with web and email disk storage in Maidenhead.

We're investigating.

Update
04 May 2012 19:16:08

Now back. We're investigating the cause of this.

Closed 04 May 2012 21:15:57

Problem in Maidenhead - affecting VoIp/ethernet/email/etc - Closed 26 Jan 2012 13:10:00
Details
24 Jan 2012 21:16:14

We are seeing a major issue in Midenhead - high levels of transit packet loss that will be affecting VoIP, email, and Ethernet customers (as well out our offices).

Update
24 Jan 2012 21:47:11

Looking like some sort of denial of service attack.

Update
24 Jan 2012 22:34:04

Sorry for the delay posting more details. This affects our links directly and making it difficult. The problem appears to be a huge denial of service attack invovling tens of thousands of sessions and filling gigabit links.

We have identified the target and disconnected it, black holed the target and even tried to divert traffic but to no avail as yet.

We are still working on this.

Update
24 Jan 2012 23:03:38

Just an update to say that this is still being worked on...

Update
24 Jan 2012 23:23:30

Still working on this

Update
24 Jan 2012 23:28:57

The problem is now only affeting Ethernet customers on the same block as the address being DDOS'd. We are still working on the issue. Most other services will be working fine now.

Update
24 Jan 2012 23:57:55

Some side effects on other services from Maidenhead, but we are still working on narrowing down the issue.

Update
25 Jan 2012 00:01:48

DOS attacks are, thankfully, rare. This has to be the biggest we have seen.

We will, of course, be talking to the customer who is being DOSed to fine what could have provoked such a major attack. There is usually a reason.

Update
25 Jan 2012 13:20:48

The blackhole for the target machine was removed at 1pm today, however the traffic was still being sent and affected VoIP, email and Ethernet services.

The block is now in place again, and we'll continue to investigate.

Started 24 Jan 2012 21:00:00
Closed 26 Jan 2012 13:10:00

Maidenhead datacentre problems again - Closed 25 Jan 2012 13:14:06
Details
25 Jan 2012 13:12:18

Similar to last night, access to servers and services in Maidenhead have high packet loss.

 

This will affect email, voip and Ethernet services.

Update to follow shortly.

Update
25 Jan 2012 13:13:19

Datacentre staff are working to blackhole the IP address that is the target of this attack.

Update
25 Jan 2012 13:14:43

The target IP address has been blackhole'd and service has been restored.

Started 25 Jan 2012 13:02:00
Closed
25 Jan 2012 13:14:06

We'll update the initial post from yesterday with further updates to this. http://status.aa.net.uk/apost.cgi?incident=1364


Web Services - Closed 28 Nov 2011 13:46:02
Details
28 Nov 2011 13:04:06

We have problems with Web and Email services at the moment, please see this post:

http://status.aa.nu/apost.cgi?incident=1298

Closed 28 Nov 2011 13:46:02

Network glitch affecting voice and ethernet - Closed 02 Nov 2011 11:25:49
Details
02 Nov 2011 11:30:52

There appears to have been a severe network glitch affecting both diverse routes out of the Maidenhead data centre. Routing is recovering now, but this would have affected Ethernet customers and VoIP customers the most. Some authentication of DSL lines may have been delayed. Access to our email and web servers and other hosted services would also have been affected.

The incident appears to have lasted a few minutes. We are trying to get more details.

Update
02 Nov 2011 11:49:03

The carriers have confirmed they had an outage and should send an explanation shortly.

Update
02 Nov 2011 21:49:20

Carriers explain the fault as:

The cause of this incident was traced to events on the network which caused high CPU load on the transit routers. This then resulted in router protocol instability which affected transit services.

We have since stabilised the network and are developing solutions to be implemented which should reduce the impact of such events in the future.

Loss of connectivity was detected at 11:24 with service restored by 11:27.

Please accept our apology for any inconvenience caused.

Started 02 Nov 2011 11:24:02
Closed 02 Nov 2011 11:25:49

Incident in maidenhead - Closed 18 Mar 2011 11:54:30
Details
17 Mar 2011 10:22:00

We have lost comms with Maidenhead and we have an engineer going to site now, we are not sure what the issue is but it may be power related.

 

Email, VOIP and some others services will be affected.

 

This is also affectig Ethernet customers and hosted servers in Maidenhaed

 

There appears to haver been a fire alarm that has gone off and data center has been evacuated. No evidence of a fire though but power is down

Update
17 Mar 2011 10:25:11

Staff are just approaching the data centre now.

Update
17 Mar 2011 10:37:59

Power is being restored now

Update
17 Mar 2011 10:49:15

Our engineers are on site and power has been restored, servers of ours are coming back on line, further updates will be posted when we get them

Update
17 Mar 2011 10:56:15

Not all power has been resotred yet. Some services (control pages, VOIP, web) are still down. They should be restored shortly.

Update
17 Mar 2011 11:13:10

VoIP and control pages are back. Email and web should be back soon.

Update
17 Mar 2011 11:22:00

The A viop server is still down.

Update
17 Mar 2011 11:56:02

Email servers are mostly back, and web services are back. We've still got some voip problems and are working on it.

Update
17 Mar 2011 11:58:56

The A voip server has a database problem, and won't let customers register.

Update
17 Mar 2011 12:02:12

There is now a database problem on C SIP server too. Investigating.

Update
17 Mar 2011 12:08:07

Database fixed on C.

Update
17 Mar 2011 12:21:07

Database problems fixed on A and C servers.

Update
17 Mar 2011 15:44:43

Most services are back up now, we have had a number of hardware fail as part of the power outage incident. 

Currently the main problem is our email ticketing server - this is affecting emails to support/sales/accounts etc - and so is causing a delay in email replies.

There are also problems with:

The online ordering system
ADSL usage reporting
ADSL line status on Clueless

Other servers still have problems which we are working through, but other servers are managing with the load (may services have multiple servers).

Update
17 Mar 2011 17:15:23

The odd effect with lines not showing as on-line properly on clueless is fixed, and lines will clear properly over night as a result. PPP restarts of lines are needed but this is done automatically in stages to minimise disruption.

Update
17 Mar 2011 17:15:37

On-line ordering restored a little while ago.

Update
17 Mar 2011 17:18:23

I would just like to say that I am very pleased with how my staff have handled this today - tackling the issues in a sensible priority and updating status pages. This is a major issue with not just a power outage, but issues with access to the building, and possibly even a power surges as several pieces of equipment have failed totally. The backup arrangements for critical systems have worked as expected as has the maintenance of broadband internet access, DNS, and RADIUS authentication. Well done everyone. We'll try and get a more detailed explanation from the data centre in due course. Staff are working on the last of the issues now.

Update
17 Mar 2011 18:38:16

thankless (ticketing) still down and being rebuilt now.

Update
18 Mar 2011 00:46:47

We have now got our email ticketing system back online - we do apologise for the time this has taken, and the delay this has caused to email to support, sales and accounts.

Update
18 Mar 2011 11:55:08

We'll close this incident for now - but will add the official response fron BlueSquare when they have let us know.

Update
21 Mar 2011 11:50:27

This is the official report from BlueSquare (Our racks are in the building called BS2)

 

This is a Reason for Outage Report with details regarding the power supply in BS2/3 with BlueSquare Data Services Ltd.

 

At 10:06 on Thursday 17th March one of the six UPS modules located in BlueSquare 2/3 suffered a critical component failure which resulted in a dead short on the output side (critical load side) of the UPS. This failure also caused an amount of smoke to be released by the failed UPS system which resulted in the fire alarm activating and the fire service attending. Once the fire service was happy with the situation we were able to restore power to the site via the generators with the UPS system bypassed whilst we investigated the fault further.

 

Due to the short circuit occurring on the output side of the UPS this meant the other UPS’s immediately went into an overload condition which then switched all modules into bypass mode, as per the design of the system. This overload then transferred to the raw mains and tripped the main incomer to the site. This caused the overload condition to cease and power was lost to the site. The UPS manufactures then worked to check all the remaining UPS modules to ensure the same component was within specification, and to fully test each UPS system, replacing some components where necessary. No further faults were found on the remaining UPS modules, and load was then switched back to full UPS protection at approx 02:15 and building load was transferred back from the generators to utility mains at approx 02:25.

 

Due to the size of the failure we have commissioned an independent organisation to forensically examine the failed UPS module. This work is scheduled to be completed next week and we will provide further details once we receive their report. This was an extremely unusual type of failure and the manufactures have not experienced such a problem before, despite over 3,000 similar UPS units being deployed. This suggests there isn’t an inherent design problem in the units but we will not reach any conclusions until the forensic examination is complete.

 

The failed UPS module will be replaced within the next 4 weeks and until that time we will remain on ‘N’ redundancy level at BlueSquare 2 & 3. Further updates will be provided before this replacement work takes place.

 

A number of customers have asked as to why this failure could occur when we operate an N+1 UPS architecture. The reason for this is that all of the six UPS modules in BlueSquare 2/3 are paralleled together as one large UPS system. BlueSquare 2/3 only requires 5 modules to hold the critical load to the site, however we have an additional unit which provides the redundancy in the event of a UPS module failure. However, as this failure was on the common critical load side of the UPS (the same output that feeds the distribution boards which then in turn feed the racks) and all the UPS systems are paralleled together, this had the effect of causing all UPS modules to go down.

 

As an example, in a N+N configuration, such as in our Tier IV Milton Keynes site, a failure of this nature would not be possible as two banks of independent UPS systems operate providing true A&B feeds to each rack.

Started 17 Mar 2011 10:00:20
Previously expected 17 Mar 2011 11:20:20
Closed 18 Mar 2011 11:54:30

VoIP and Email Problems due to Datacentre Connectivity - Closed 29 Dec 2010 12:43:00
Details
29 Dec 2010 12:35:23

We currently have routing problems to our datacentre in Maidenhead, this will be affecting access to:

  • Email - incoming and outgoing
  • VoIP
  • Hosted server
  • Control Pages (Clueless)

We have engineers looking in to this at the moment, and will post anohter update shortly.

Update
29 Dec 2010 12:47:47

This is now working. It seems to be some routing/peering problem outside of our and BlueSquare's network - If we get any more details we'll post an update.

Started 29 Dec 2010 12:20:00
Closed 29 Dec 2010 12:43:00