Jump to content

Archived

This topic is now archived and is closed to further replies.

mugtang

Sorry guys

Recommended Posts

My host had an issue and their entire network (tens of thousands of sites) went offline.  It was caused by a single person screwing something up.  It looks like the board was restored to a Thursday night backup. Sorry for the lost activity and hopefully we’re good to go from here.  

 

-mug 

thelawlorfaithful, on 31 Dec 2012 - 04:01 AM, said:One of the rules I live by: never underestimate a man in a dandy looking sweater

 

Link to comment
Share on other sites

1 minute ago, TheSanDiegan said:

Thank God... I thought I had stumbled upon a glitch in the Matrix.

Pretty stressful 9 hours for me....and there was nothing I could do about it. 

thelawlorfaithful, on 31 Dec 2012 - 04:01 AM, said:One of the rules I live by: never underestimate a man in a dandy looking sweater

 

Link to comment
Share on other sites

56 minutes ago, mugtang said:

Pretty stressful 9 hours for me....and there was nothing I could do about it. 

the-russians-did-it-awman-not-again-them-ruskies-be-11121638.png

Link to comment
Share on other sites

2 hours ago, mugtang said:

My host had an issue and their entire network (tens of thousands of sites) went offline.  It was caused by a single person screwing something up.  It looks like the board was restored to a Thursday night backup. Sorry for the lost activity and hopefully we’re good to go from here.  

 

-mug 

Next time, get Frank Cannon on the case. p50tM49-Rdpo0lvQTa6n_wnEDa5fHERB-hl6bsAnr7XtNne5mWSG5DK256KbQkXXSBfn4KX2PZs1WzIm7r0zw-G-xOiXHbvGaDUG5fARN3zRZHahsXSFTti1lj6KffuwBh3sd-M8a3BQZgmCSEadYHkpCdIA_XfHF13ypWI1nyBfBnXod7PNAz9SsU_9vBnltROuO4eNE7eQjAEJA0PXsEXW8JSylgdXsBx0PkF37gJ3o8BlpTwHiMeTLl4c_MfB9pJT3DYBJfhUc1yfmjvCpXo9S1gXkQB7Ol5CDL1MY_DbKsyZ5rDZJdT42ofd3YqCK6Mp-n8-hcifBspC0buCwo9Akv9oTrVYZiiJLkDFhzYQv9QWAs7RTzU18jNUuZrYpp7Gk72Vm-YzHMPowEdmYJlhp1Ma0Ny16Q8aiapiI55qDdzi3qTCHKNzBGWR2wPsnIC211VmWb_cPFUMh_93ScZRZuOzclIYd1NngmRbFM1bDIT6fyCKSbIiaaF3_jhFo8cEYF0Q348Rpk5kMJ8YFcNR7iMuGtUkfzKbnMXx65Nv-U7ISFPfCv1qZ6Mxqd2TAM3q3Mq19N8CjnGZUn0MI25Y0d9cqdGZ1cbmJ8dO-xq9BNxvhgQAnplcb-CoJqVL=w481-h276-no

 

Link to comment
Share on other sites

If you were wondering why we were down for so long yesterday, here’s email I got from my hosting provider explaining what happened:

Quote

**What happened?**

First and foremost - this failure is not something that we planned on or expected.  A server administrator, the most experienced administrator we have, made a big mistake.  During some routine maintenance where they were supposed to perform a _file system trim_ they mistakenly performed a _block discard_.

**What does this mean?**

The server administrator essentially told our storage platform to drop all data rather than simply dropping data that had been marked as _deleted_ by our servers.

**Why is restoration taking so long?**

Initially we believed that only the primary operating system partition of the servers was damaged - so we worked to bring new machines online to connect to our storage to bring accounts back online.  Had our initial belief been correct - we'd have been back online in a few hours at most.

As it turns out our local data was corrupted beyond repair - to the point that we could not even mount the file systems to attempt data recovery.

Normally we would rely on snapshots in our storage platform - simply mounting a snapshot from prior to the incident and booting servers back up.  It would have taken minutes - if maybe an hour.  We are not sure as of yet, and will need to investigate, but snapshots were disabled.  I wish I could tell you why - and I wish I knew why - but we don't know yet and will have to look into it.

We are working to restore cPanel backups from our off-site backup server in Phoenix Arizona.  While you would think the distance and connectivity was the issue - the real issue is the amount of I/O that backup server has available to it.  While it is a robust server with 24 drives - it can only read so much data so fast.  As these are high capacity spinning drives - they have limits on speed.

Our disaster recovery server is our **last resort** to restore client data and, as it stands, is the _only_ copy we have remaining of all client data - except that which has already been restored which is back to being stored in triplicate.

**What will you do to prevent this in the future?**

We have, as we've been working on this and running into issues getting things back online quickly, discussing what changes we need to make to ensure that this both doesn't happen again as well as that we can restore quicker in the future should the need arise.  I will go into more detail about this once we are back online.

**We are sorry - we don't want you to be offline any more than you do.**

Personally I'm not going to be getting any sleep until every customer affected by this is back online.  I wish I could snap my fingers and have everybody back online or that I could go into the past and make a couple of _minor_ changes that would have prevented this.  I do wish, now that this has happened, that there was a quick and easy solution.

I understand you're upset / mad / angry / frustrated.  Believe me - I am sitting here listening to each and every one of you about how upset you are - I know you're upset and I am sorry.  We're human - and we make mistakes.  In this case **thankfully** we do have a last resort disaster recovery that we can pull data from.  There are _many_ providers that, having faced this many failures - a perfect storm so to speak - would have simply lost your data entirely.

This is the **first** major outage we've had in over a decade and while this is definitely major - our servers are online and we are actively working as quickly as possible to get all accounts restored and back online.  For clarity - the bottleneck here is not a staffing issue.  We evaluated numerous options to speed up the process and unfortunately short of copying the data off to faster disks - which we did try - there's nothing we can do to speed this up.  The process of copying the data off to faster disks was going to take just as long, if not longer, than the restoration process is taking on it's own.

Once everybody is back online - and there are accounts coming online every minute - we will be performing a complete post-mortem on this and will be writing a clear and transparent Reason For Outage [RFO] which we will be making available to all clients.

 

thelawlorfaithful, on 31 Dec 2012 - 04:01 AM, said:One of the rules I live by: never underestimate a man in a dandy looking sweater

 

Link to comment
Share on other sites

That was 12 straight, no distractions, BRUTAL hours, I had to spend with the family... Thanks a lot mug, I almost... :suicide:

"Make a mistake once and it becomes a lesson, make the same mistake twice and it becomes a choice."
 

Link to comment
Share on other sites

Whew, I thought you missed a payment to El Jefe.

Seriously, you have a great site.  It appears that your provider is competent, understands the mistake, had a backup, is being transparent, has started a root  cause analysis, and is taking ownership of the incident.  This board used to go down more frequently, so kudos to you and your provider for running a reliable site.

110926run_defense710.jpg
Link to comment
Share on other sites

1 hour ago, LoboMan59 said:

That was 12 straight, no distractions, BRUTAL hours, I had to spend with the family... Thanks a lot mug, I almost... :suicide:

Haha.  I feel you brother.  The ol'lady is going shopping with a friend today and I've been looking forward to some quality MWC Board time.

Link to comment
Share on other sites

Heh, I'm like "It's Friday night before a big football weekend and THERE'S NO BOARD!!!!!!" <click, refresh, click, refresh>

:P

Thank you Mug, it wouldn't be so bad if MWCBoard weren't so good.

Link to comment
Share on other sites



  • Recently Browsing   0 members

    • No registered users viewing this page.



×
×
  • Create New...