UESPWiki:Administrator Noticeboard/Archives/Server-level Blocks

A UESPWiki – Sua fonte de The Elder Scrolls desde 1995
Semi Protection
This is an archive of past UESPWiki:Administrator Noticeboard discussions. Do not edit the contents of this page, except for maintenance such as updating links.

Server-level Blocks

Daveh made a few changes yesterday that will make it possible for me to help out behind-the-scenes much more than previously. In particular, I now have the ability to do server-level blocks of individual IPs that are doing DoS-type attacks on the site (as discussed originally at Blocking Rogue IPs at the Server, and more recently in posts such as Forum bots). Basically, I can now do one-week blocks of HTTP requests from specific IPs to the site's individual servers (e.g., on content1, where most of the problems have been occurring).

Within the next week (as I catch up with my new capabilities) I'm planning on posting a bit more in the way of details and proposed guidelines for such blocks, both here and on the forums (which are the most affected by these attacks). But in the meantime, I wanted to let everyone know that this is now possible. So if anyone notices that the site (or in particular content1) has become non-responsive, post a message here. Just to be clear: this type of extreme block is not for vandalism or IPs causing problems with edits to the wiki; it's for IPs that are causing problems at the server_status level.

And completely coincidentally, just as I'm finishing up this message, I get my first case. 198.24.6.168 has now been blocked on content1 after making some 30+ simultaneous forum requests, all of which were left to time out and make content1 stop responding. --NepheleTalk 17:14, 1 June 2009 (EDT)

84.234.254.83 and 81.85.214.15 just finished with about 23 connections to content1 each - all for the forums. –RpehTCE 01:03, 3 June 2009 (EDT)
They don't quite fit the typical profile, though -- in particular google doesn't provide a long list of incriminating reports. Also, I've been trying to keep tabs on content1 over the last two hours, and haven't noticed either IP pop again (so at least they're not creating dozens of timed-out connections at a time). For the moment I'm being a bit cautious about blocks (including unblocking the last IP, 198.24.6.168, after an hour). I'd like to work out a bit more of a notification system (including an announcement on the forums letting users know what's happening) -- just in case. Hopefully you won't have reason to curse my lack of action while I'm sleeping! --NepheleTalk 04:21, 3 June 2009 (EDT)
In case anyone else noticed that the site stopped responding for a couple of 10+ minutes intervals just now... The culprit this time was not a rogue IP. Rather, the problem each time was that db1.uesp.net (our database server) stopped responding. Once I was able to even login to db1, it was clear that source of the problem is the database backup. One outage was while the backup was creating the uespwiki.sql dump; another 10+ minute outage was while the dhackforum database was being dumped. I can understand why backing up the uespwiki database would require a lock interfering with wiki-related database requests, but it's not so obvious why the dhackforum backup would mess up the wiki.
I remember rpeh pointing out regular site problems at this time of day at some point a while back, but I can't remember where and therefore I can't remember Daveh's response. Nevertheless, I'm wondering whether it might make sense to investigate ways to make the database backup a bit less intrusive. --NepheleTalk 04:47, 4 June 2009 (EDT)
I think we traced the slowdowns to the daily reboot of the web server, and I'd assumed that was still the cause of the problem, since it usually happens roughly as content1 does its cycle. I did notice that the whole server, including content2, went totally glacial earlier today. It seems odd that even a full backup would be locking everything: certainly MS-SQL can do a backup while still processing queries, and I'd be a bit surprised if MySQL can't do that these days. Could a combination of server restarts and backups be causing the problem? –RpehTCE 08:29, 4 June 2009 (EDT)
The db server was supposed to backup daily at 4am (not 4pm). I'll check its time and schedule and ensure things are set correctly. Currently content1 is set to restart the Apache server once a day. Note this is just a web server restart which should be instant and not an actual server reboot. This was setup a while ago to deal with the issue of lingering connections that wouldn't die. If restarting is causing an issue we can try disabling it.
The db backups might lock the entire database depending on how its setup. Locking all databases makes it much easier to restore a slave from a backup. Unfortunately, there is little that can be done to speed up backups other than moving from a daily to a weekly backup schedule (which I've been thinking of doing anyways). MediaWiki seems to use a combination of ISAM and InnoDB tables so we can't use the faster mysqlhotcopy rather than the current mysqldump. -- Daveh 11:23, 4 June 2009 (EDT)
The backup is starting happening at 4am, server time.
Thinking about it some more, I realized the problem isn't with any database locks. The problem is that during the backup some server resource is getting completely maxed out -- I'm guessing memory. Because when the problems were occurring, db1 was nearly completely unresponsive -- not just mysql, but all processes (it took a couple minutes just to respond to a login request, a basic command like 'ps -ef | grep dump' took a couple minutes). Perhaps if the number of mysql requests being processed exceeds some number, then memory starts getting swapped, and everything just grinds to halt? From the 'ps -ef' calls that I made during the processing, the number of mysqld processes dropped to about 30 while uespwiki.sql was being zipped, during which time the wiki was functional. But as it dumped dhackforum, the number of mysqld processes climbed to ~150 then ~230, then probably even higher (I switched to 'ps -ef | grep dump' at that point, so I don't know how many mysqld processes there were at the worst). I'm guessing the problem doesn't occur on every single dump (e.g., the six databases in between uespwiki and dhackforum seemed to be dumped without problems; the previous night the site didn't lockup during the dumps). But perhaps some combination of the dump, a couple time-consuming mysqld requests, plus the usual flood of quick requests can add up to be too much? In which case, it might also happen occasionally at other times, if there are enough difficult mysqld requests at the same time. --NepheleTalk 11:57, 4 June 2009 (EDT)
I'm assuming what is happening is that during the backup the dbs are write-locked which means any db write request will be queued (or timeout) until the db is unlocked. I'm also assuming that any read requests will still be honored. If enough write requests get in the queue it may cause a "traffic jam" effect which makes the problem worse. This is just an un-educated guess though.
I'm not sure why the dhackforum db would be causing issues. The uesp_net_wiki5 db is the largest at 4.2GB so is obviously going to take a while to backup but the dhackforum is barely 15MB (the uesp_net_phpbb db is almost 200MB by comparison).
One of my main objectives in the near future is to ensure the site's backup procedures are working correctly and efficiently and this definitely comes under that. I still want to do backups off the master db, if not daily at least once a week. All the important dbs are being mirrored live by backup1 but there is always a risk of a mirrored db becoming corrupted so I don't want to rely solely on that. I'd like to get details on what Wikipedia (or another large MediaWiki) does for backups as this can't be an unknown issue that just UESP is having (a 4GB db is tiny compared to what some people deal with). -- Daveh 15:00, 4 June 2009 (EDT)

(outdent) I don't know if our current version of MySQL supports them, but I see MySQL 5 supports incremental backups. It might be worth looking into that as a way of reducing backup times. Have a once per week (say) full backup and then do incremental backups every other day in the week. Each daily backup will only record the changes made since the previous day so should be substantially smaller and faster than a full backup. It would mean that in the event of a disaster recovery, things would take a little longer to get going, but since UESP isn't a critical system (heresy, I know!) that shouldn't be a big problem.

I'd definitely add my voice to Daveh's statement that mirroring isn't a backup system. There have been a couple of cases fairly recently of major websites being taken down by hackers because they hit one server and the destruction was mirrored over to their backups. Mirroring is a Failover solution, not a backup solution. –RpehTCE 16:15, 4 June 2009 (EDT)

Various random observations:
  • At a first glance, various memory stats look OK (free -m, vmstat, ps aux [1]) -- at least on average, which is basically all I can see right now. If possible, I'll try to run them next time I notice a problem in progress.
  • There are a series of settings in our mysql.cnf for InnoDB tables, all of which are commented out. Should they be uncommented given that the wiki primarily uses InnoDB?
    • In particular, innodb_buffer_pool_size seems important [2]; with the default 8M value mysql might be forcing itself to do disk swapping even when there's no need.
  • There's a log file at /var/log/mysl/mysql.slow.log that logs queries which take longer than 1 sec to process -- but I can't read the contents ;) Might be worth checking to see some details on the slow queries and also to see when they occur.
  • One tip I've seen for mysql backups is to do daily backups from the mirror/slave: lock the mirror, then backup without affecting requests to the master database.
--NepheleTalk 17:48, 4 June 2009 (EDT)
  • We're currently running MySQL 4.0.27 and, looking at the incremental backup docs, appears to support it. We're actually doing the "log-bin" for mirroring purposes I just never associated it with incremental backups. With a few changes we should be able to easily change the backup strategy on the master to take advantage of it.
  • Currently we're doing daily local backups of the databases on both db1 and backup1 (the db mirror). On backup1 we just stop the slave as Nephele suggested. The only issue with backing up on the mirror is the occasional mirror error which stops the mirroring until it is manually corrected (couple of times of a year so far). I've been meaning to change db1 to weekly backups for a while now anyways.
  • I did basic my.cnf tweaking when I setup db1 but I could have easily overlooked things like the innodb settings. Last I checked db1 was doing fine with memory (4GB of RAM and most of it was being used as cache) so tweaking these value shouldn't hurt things. Of course I've also never actually done any explicit performance tests so any increase/decrease in performance would have to be large to be noticed.
  • I've also been meaning to document the backup strategy so it can be critiqued and improved upon (there might be some existing but its terribly out of date).
I'll look into things in more detail this weekend and make changes as needed. -- Daveh 20:59, 4 June 2009 (EDT)
Whatever is happening on db1 has been occurring multiple times per day, and not only when backups are taking place. Each time, both content1 and content2 stop responding for 5-10 minutes (can't even get server-status reports). It's possible that this issue has been going on for some time, and I'd just previously been assuming that any site slowdowns were specific to content1. However, I'm worried it might be a relatively new issue -- I hadn't previously noticed both content1 and content2 freezing up simultaneously.
Unfortunately, I'm having a very hard time getting more details on what's happening on db1 during one of these events, because while one is occurring it is always impossible for me to login to db1 -- the server won't even respond to a login request. When possible, I'll try to keep an open connection to db1 so that I'm not locked out next time it happens.
As for whether any changes recently might have prompted these problems, I can't see anything obvious. The single biggest recent performance change (fixing the file cache on content2) should have decreased db1 requests, not increased them. Similarly with the changes to the search engine -- and most of those took effect weeks ago; the most recent search updates didn't have any real performance implications. The #load and #save functions are the only changes that I can see that would increase the number of database requests, but they're being used in so few places that I can't imagine them having any widespread effect (in particular, #load is, as far as I know, only being used on two pretty obscure pages: Oblivion:Sandbox and Oblivion:Spell Effects -- and even if I purge one of those pages, I don't see any effect on db1).
In any case, I don't think we should be focussing solely on the backups as the cause/solution. I'm thinking they're just one factor that can contribute to or exacerbate the issue, but it's happening even when no backups are active. --NepheleTalk 17:43, 5 June 2009 (EDT)
I think it's related to images. Remember this discussion? I got around the problem by uploading direct to content1, which worked fine in most cases. Since then I've noticed several occasions where pages load but the images on them take significantly longer to appear. I even got significant slowdowns when deleting images. Something is definitely not right with images at the moment. –RpehTCE 02:15, 6 June 2009 (EDT)
I'm not so sure. I just had some ideas about images; I posted details below in a new section since the proposals there won't have any direct effect on db1 performance -- and definitely don't have anything to do with server-level blocks ;)
The $wgHashedUploadDirectory setting does not have any direct effect on db1: db1 doesn't even mount the problem image directory; it never tries to access any of the image files. So delays uploading/reading/altering image files should not cause db1 to lock up. And it's clear to me that in these recent situations, db1 is the source of the problem. content1 and content2 only stop responding to server-status requests because all of the http connection slots are filled -- by requests that are waiting for information from db1. Other than http, however, content1 and content2 are fully responsive -- I can login, command line response times are fast. On the other hand, db1 is practically dead during these events -- nothing is working on db1.
If there is a connection between db1 and images, it seems more likely to me that it's an indirect connection. For example, past image problems triggering corruption problems in the database. Or perhaps even that a wiki process on content2 locks the images table in the database before making an image change, then goes to make the physical change to the file, gets hung up on that change, and therefore leaves a lock in place for much too long, as a result blocking other wiki requests or even deadlocking two requests. Of course, I'll cross my fingers and hope that there might be some such indirect connection, it's just that I'm not ready to abandon other lines of investigation under the assumption that images are the cause.
It would really help to know the contents of /var/log/mysql/mysql.slow.log. It's clear from the timestamp that the file is being updated every time one of these lockups occur. Without knowing its contents, we're really just making circumstantial guesses about which requests are causing problems.
In the meantime, I'm trying to see what status information I can get for mysql. The image tables are all OK -- but I did just find corruption in the logging table. I'll continue scanning table status, and figure out what to about the logging table. --NepheleTalk 15:00, 6 June 2009 (EDT)
My subjective impression so far is that since repairing the logging table, there haven't been any more freezes on db1, and the site seems to be working more smoothly again (even during our Saturday traffic peak). I also just repaired some errors in the forum's database tables; based upon some reports of strangeness on the forums, those errors may date back to May 13 or earlier, and therefore might have contributed to content1/forum slowdowns over the last month. So hopefully we've made some progress (* knock on wood *).
I'll still follow through on the image configuration/reorganization, and getting the wiki software upgraded. I also think it's still worth exploring whether some of the innodb parameters in my.cnf should be un-commented. Other than that, I think this might be going onto the back burner, at least as long as there aren't any more problems on db1. --NepheleTalk 01:52, 7 June 2009 (EDT)

(outdent) Just a quick note that I've changed the UESP wiki backups on db1 from daily to weekly. All other databases on db1 are backed up daily as are all the slaved databases on backup1. I'll see if I can get the incremental backups working easily on content1 so it can still be a daily effective backup but without having to lock the databases for 30 minutes. Or similarily only do a complete Wiki database dumponce a month with incremental inbetween. -- Daveh 21:27, 8 June 2009 (EDT)