This was the result of a critical systems failure in our file system that serves user uploaded files (e.g. JPEGS, MP3s, etc.). This was the first such failure of this system in our 8 year service history, and also the worst outage to affect the SiteMaker network during that time.

The events that led to the service going offline were complex and partly driven by safety procedures that kicked in to ensure that no further data was lost or corrupted. The network is designed to cope with system failures with no loss of service, but in this case we were unable to prevent it.

We fully understand that service disruption of this kind and duration is unacceptable and for this we sincerely apologise. We are taking further action in the next few days to ensure that this doesn't happen again.

If you are still seeing any lost files or unusual site behaviour, please submit a support ticket and we will follow it up to resolve this issue for you.

Detailed explanation of what happened
The SiteMaker network is built up by a large number of servers and data storage devices which work together to provide the service. Each machine has layers of redundancy in it, e.g. multiple pieces of hardware providing the same function, and each service has multiple machines in a pool to draw from, meaning that one or more could go offline with no loss to the overall service provision. In addition to this, all data is backed up both locally and in an offline facility.

This means that in most cases, the SiteMaker network is extremely resilient to failures, and customers will not notice any disruption while pieces of hardware are replaced or new machines added or removed from the pool.

So what went wrong?
Well the file system for user uploaded files is a large array of disks with many Terabytes of data (1000s of Gigabytes) on it. The disks are set up so that a number of them could simultaneously fail with no loss of data, and like the other services, all the data is backed up elsewhere.

The cause of the failure was a corruption of the file system data, which mean the location and references to some of the files on the disks were lost. This meant that while the files were still there, their location had become unknown to the system, and we could no longer serve them.

This failure caused the disk systems safety features to kick in and prevent any new files being added. It further required a full disk check to be run before services could be restored. This forced us to bring the whole service down to run the check. With many terabytes of data, this took 13 hours.

Once the disk system was happy that there were no disk errors, we brought the service back up and began the restore job for those files that had become 'lost'. For this we had to rely on our offsite backup, which had to sync back old data to the live system over the internet. Despite a 100mbs connection between the 2 facilities, this again took 12 hours before all files were restored.

What are we doing about it?
Two things caused us trouble here. Firstly, despite having large redundancy (multiple disks and backups) in the file system, we did not have a parallel file system capable of serving the files from backup while the primary file server was down. This we will rectify by investing in a parallel system which could kick in in event of the primary service going offline. This would allow us to avoid the first 13 hour downtime in event of a failure of this kind again (despite it having happened only once in 8 years, it's not good enough).

Secondly, the restoration of lost files took a long time to complete. This is because only a chunk of the live backup data is stored locally, with the majority held in an off-site facility. When this failure required is to check everything, we had no choice but to use the off-site copy. To rectify this we will invest in an additional full local backup, so that any future restorations can happen much more quickly. This would allow us to reduce the second 24 hour period, to around 1 or 2 hours, though the service would remain online throughout.

These changes will be in place within 3 weeks on the live system.

What can we take from this?
Systems are designed to minimise the risk of failure in any part. However, there are always risks to any system which can lead to outages. Indeed most of the biggest names on the internet do lose service from time to time.

In any system designed to be robust, when failure does happen, it's usually bad. So in this case, I can say that we are very proud and grateful to the engineering team that was able to put the system back together over this 36 hour period. And extremely grateful for the help and support provided by customers to identify problems and resolve them.

However, this 'unlikely event' explanation is never an excuse, and we fully accept that this has been a disruptive and painful experience for our customers. For this we sincerely apologise and commit to taking the action described above to prevent it happening again.

Thank you again for your support.

Joe


UPDATE 9.57 GMT 11 August 2007: The full restore finished around 9am GMT this morning. So all the images should be back. If you have specific pages where you are still missing images, then please submit these to the support queue as we will have to restore these manually. This may be the result of pages were being saved again which the images were missing, but cases should be rare. Thanks for your patience during this serious issue, and we'll follow will a full explanation and details for future plans.


UPDATE 8.41 GMT 10 August 2007: Site access has now been restored. File upload/delete has now been restored. There continue to be a minority of cases where files are missing. These are currently being restored from back-up and will be available over the next few hours. A full and detailed explanation of the incident will be provided once all systems are fully restored. We appreciate your patience in this.


Following a change on the file system, sitemaker users are currently unable to upload new files or delete old files. Existing files are being served in 'read-only' mode, so that in most cases your site will be unaffected. There is a minority of cases where the 'read-only' files cannot be displayed. We are working to resolve this issue.