Galera Cluster Crash when changing database from MyISAM to InnoDB

jimgroom · June 6, 2020, 3:32pm

As should have been expected, when trying to update 50 tables in the ds106.us WordPress multisite database from MyISAM to InnoDB using this method, the database crashed, and I was unable to restart the cluster. When I tried the following error was returned:

{“result”:1,“actionkey”:“1591453564812;restart;003ab41f6dfcc625eb2c4600c2a91640;sqldb;“,”
error”:“Failed to start \n ERROR! MySQL server PID file could not be found!\nParametr
key_buffer_size set to 256M\nParametr table_open_cache set to 256\nParametr
myisam_sort_buffer_size set to 341M\nParametr innodb_buffer_pool_size set to
512M\nStarting MySQL.................................. ERROR! The server quit without
updating PID file (/var/lib/mysql/mysqld.pid).\ncat /var/lib/mysql/mysqld.pid No such file
or directory”,“__info”:{“params”:{“starttime”:“2020-06-06
14:26:00”,“count”:1000,“charset”:“UTF-8",“hx_lang”:“en”,“session”:”
bxe3c244fbe846d9c175238d7458755391",“ruk”:“fbac2761-25e9-468a-8274-tab772439",“debug”:{”
startTime”:1591453604672,“duration”:144,“headers”:{“x-xss-protection”:“1;
mode=block”,“x-resolver-ip”:“147.135.81.18",“x-content-type-options”:“nosniff”,“status”:”
200",“server”:“openresty”,“pragma”:“no-cache”,“expires”:“Sat, 06 Jun 2020 14:26:43
GMT”,“date”:“Sat, 06 Jun 2020 14:26:44
GMT”,“content-language”:“en-US”,“cache-control”:“no-cache”,“access-control-allow-origin”:”
*“},“userAgent”:“Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/83.0.4103.61
Safari/537.36",“cookieEnabled”:true,“browserLang”:“en-US”,“timezoneOffset”:-120,“tabId”:”
tab772439",“requestCount”:1066,“initiator”:“”,“app”:{“host”:“app.my.reclaim.cloud”,“lang”:
“en”,“locale”:“en-us”,“version”:“5.8-3"},“user”:{“uid”:11,“status”:1,“group”:“beta”},”
method”:“GET”,“retriesCount”:0,“requestId”:1066,“httpStatus”:200,“statusText”:“n/a”,”
responseText”:“”,“aborted”:false,“timedout”:false,“stack”:“Error\n    at
constructor.wrapParams
(https://app.my.reclaim.cloud/optimum/js/65b1dde59e7bf080cbc254afb0896e6c.out.js:1612:
67629)\n    at constructor.handleError
(https://app.my.reclaim.cloud/optimum/js/65b1dde59e7bf080cbc254afb0896e6c.out.js:1612:
65866)\n    at Object.finish
(https://app.my.reclaim.cloud/optimum/js/65b1dde59e7bf080cbc254afb0896e6c.out.js:1616:
259743)\n    at l
(https://app.my.reclaim.cloud/optimum/js/65b1dde59e7bf080cbc254afb0896e6c.out.js:1616:
255972)\n    at v.finish
(https://app.my.reclaim.cloud/optimum/js/65b1dde59e7bf080cbc254afb0896e6c.out.js:1616:
260582)\n    at v.<anonymous>
(https://app.my.reclaim.cloud/optimum/js/65b1dde59e7bf080cbc254afb0896e6c.out.js:1616:
250449)\n    at v.d.callback
(https://app.my.reclaim.cloud/optimum/js/65b1dde59e7bf080cbc254afb0896e6c.out.js:1612:
60683)\n    at Object.callback
(https://app.my.reclaim.cloud/optimum/js/65b1dde59e7bf080cbc254afb0896e6c.out.js:1602:
12676)\n    at constructor.onComplete
(https://app.my.reclaim.cloud/optimum/js/65b1dde59e7bf080cbc254afb0896e6c.out.js:1602:
28936)\n    at constructor.onStateChange
(https://app.my.reclaim.cloud/optimum/js/65b1dde59e7bf080cbc254afb0896e6c.out.js:1602:
28370)“,”isTrackedAction”:true}},“url”:“/JElastic/env/tracking/rest/getuidactions”,”
startTime”:1591453604672,“duration”:144,“headers”:{“x-xss-protection”:“1;
mode=block”,“x-resolver-ip”:“147.135.81.18",“x-content-type-options”:“nosniff”,“status”:”
200",“server”:“openresty”,“pragma”:“no-cache”,“expires”:“Sat, 06 Jun 2020 14:26:43
GMT”,“date”:“Sat, 06 Jun 2020 14:26:44
GMT”,“content-language”:“en-US”,“cache-control”:“no-cache”,“access-control-allow-origin”:”
*“}}}

After restart was impossible it seems the only option is to file a support ticket, which was conveniently inline of the error, or possible restore from backup. Given this is still early days and no expectations, wondering if this is something we need to see other solutions for given we may very well have big WordPress Multisite installs on the Reclaim Cloud.

timmmmyboy · June 6, 2020, 5:22pm

I was able to get it back online by restoring a backup of the database containers. A bit of a hamfisted approach as it’s harder to know what happened. It’s odd that the entire database server crashed from just changing the table types. The error “The server quit without updating PID file” is something I usually see when a database server stops suddenly without being shutdown gracefully. I’m also reading at ERROR: MySQL server PID file could not be found on osX - flyweb production that it could be a permissions thing but that seems less likely. Hard to really know what happened there. It may make sense to setup a test environment and walk through the steps methodically to see if you can recreate the issue, then we could look at the test environment and attempt to work backwards from it. I did notice after restoring the backup of the containers I also had to run database repairs on all the tables and I wonder if perhaps they were not “clean” when you converted to InnoDB complicating the matter.

jimgroom · June 6, 2020, 6:01pm

It could be a repair issue for sure, but it was clear when trying to do those 50 tables the whole load on all three databases jumped. I’ll create a test environment and give this a go, will be a good opportunity to try out the clone feature. Will update here if I learn more. and hopefully it works.

timmmmyboy · October 13, 2020, 1:04pm

As an update to this thread for others that run into similar issues we were seeing with Galera clusters we have a guide now at Restarting a Gallera Cluster that covers how to safely restart and troubleshoot a failed cluster.