We’re not having any trouble, are you?
Create a ticket at www.theclientarea.info if you are.
We’re not having any trouble, are you?
Create a ticket at www.theclientarea.info if you are.
Our report from the incident on 05/04/2012 is as follows.
Issue
sms-sagat unresponsive
Underlying cause
Memory page fault caused a kernel panic
Symptoms
Complete loss of service on sms-sagat
Resolution
Continual memory tests are running on the system, but so far have shown without error. It is assumed it was a software fault (not hardware).
The RAID array is also degraded and being re-built, so performance is limited.
—
Follow Up
A SMART test was run on all drives and one drive reported bad sectors. As a result, this drive has been removed and replaced and the RAID array is rebuilding. An off-line snapshot has been taken of the system whilst the RAID array is degraded.
Create a ticket at www.theclientarea.info if you are.
We use two forms of monitoring, Pingdom (an external service) and our own monitoring platform (also external).
Within the last 15 minutes, we have received several Pingdom notifications reporting connectivity dropping and immediately coming back up. However, this does not correspond with our own monitoring reports.
Both Pingdom’s monitoring service and Pingdom’s FPT are showing strange results - however, other 3rd party services are reporting no issues.
At the moment, we are investigating what is going on, but it looks to be an issue with Pingdom rather than our connectivity. Enquiries are under way.
Create a ticket at www.theclientarea.info if you are.
Our report from the incident on 27/02/2012 is as follows.
Issue
DDOS attack to our transit provider’s network
Underlying cause
External high volume attack from multiple sources targeting a customer subnet
Symptoms
Intermittent loss of service on multiple subnets
Resolution
From the information gathered so far the evidence points to a single attack to one customer.
The team are still looking through logs and progressing the incident with the relevant authorities and further measures are currently being invoked to reduce such attacks in future.
We have had a brief chat with the data centre team and the root cause of the downtime last night is believed to be down to a broad DDoS attack across a number of subnets - peripheral to our own network, but substantial enough to saturate the 10GB uplinks to our peers.
A formal investigation is under way at present, however, we have been assured our own connectivity should not be affected any more.
We would like to apologise for the outage last night, which spanned 11 minutes in total, but we hope our proactive response to the situation and information clarity throughout was of some benefit to concerned customers.
We are currently discussing means to prevent this happening again, however, as the attack was not directed at subnets within our own network, it will still be hard to mitigate.
For reliability and performance, we hand off BGP to our upstream provider who uses multiple peers and handles external (internet) routes on our behalf - however, this was our downfall, as when another customer of theirs fell victim to a DDoS attack, it saturated the common transit uplinks affecting the entire data centre.
We are not in doubt of our current peers/transit providers; as it has served us well, with 3 years of 100% network connectivity and we have full faith in their ability to deal with future issues.
Connectivity was mostly restored after a few small windows of downtime, but routes are flapping at the moment.
Engineers are still working on a resolution and to identify the root issue - but at present we are awaiting updates.
What we know
The issue is outside of Sonassi Hosting’s network; our transit provider is experiencing difficulties at the data centre which is something that we cannot remedy. They have engineers on site working on a fix.
We still have 100% power and 100% cooling, as well as our internal network (from edge-in) is 100% functional, however outbound/inbound national routes are flapping.
First ever significant outage
This is our first ever significant outage in 3 years of operations and certainly not what our clients are accustomed to.
We would like to reassure all customers that we will remain available on here and Twitter (@sonassi @sonassihosting) if you want to talk to us directly.