Killing Sidekiq on AWS with Ansible
We do a lot of batch data processing here and much of that involves shoving crap into Sidekiq to be run in a threaded fashion. I'm not done with my own threading stuff yet by any means but, for now, Sidekiq is a superstar so we're going to use it and bow west towards the Mike Perham alter of threaded awesomeness that is Sidekiq.
One issue we had recently was that we thought we had killed sidekiq dead but, since it was running as a background service instead of a foreground task, it popped back up and kept happily eating data. This would have been fine except for the fact that we had a db dump and restore in process and this caused data to be in an inconsistent state. And that led to a second round of table dump / restore tangoing.
The first thing to understand here is that this is my mistake. I'm the moron who set it as a background service after all and, in the utter panic that accompanies disaster recovery, well, I forgot. I did my normal kill -9 dance and went on my merry way ignoring the fact that Ubuntu would happily cackle gleefully as it re-launched it. Sigh.
So there's at least two playbooks here:
- playbook_status_sidekiq.yml
- playbook_stop_sidekiq_with_prejudice.yml
The "_with_prejudice" refers to stopping Sidekiq with a -9 argument to kill. This tells Linux "really, truly, right DAMN NOW kill this process". Sometimes Sidekiq fails to stop; often because of the ruby code its executing and while that's fine there are lots of times that you just need it to go away. This is one of them.
Two additional related playbooks I could write are:
- playbook_stop_sidekiq.yml
- playbook_service_stop_sidekiq.yml
And here we go! The first thing we need is the ability to know what's going on in our cluster of boxes. This means the ability to know if sidekiq is running:
playbook:
role:
output:
ansible-playbook -i inventories/production2 playbook_status_sidekiq.yml
PLAY [worker] ******************************************************************
TASK [setup] *******************************************************************
ok: [fiworker5]
ok: [fiworker6]
ok: [fiworker3]
ok: [fiworkerbig]
ok: [fiworker4]
ok: [fiworker8]
ok: [fiworker7]
ok: [fiworker9]
ok: [fiworker10]
ok: [fiworker11]
TASK [status_sidekiq : display sidekiq's status] *******************************
changed: [fiworker5]
changed: [fiworker3]
changed: [fiworkerbig]
changed: [fiworker4]
changed: [fiworker6]
changed: [fiworker7]
changed: [fiworker8]
changed: [fiworker10]
changed: [fiworker9]
changed: [fiworker11]
TASK [status_sidekiq : view the output] ****************************************
ok: [fiworker5] => {
"out.stdout_lines": [
"root 11103 0.0 0.0 4440 636 pts/5 S+ 12:43 0:00 /bin/sh -c ps auwwx | grep sidekiq",
"root 11105 0.0 0.0 10460 912 pts/5 S+ 12:43 0:00 grep sidekiq",
"ubuntu 17371 0.0 0.0 27920 5328 ? Ss Oct27 13:57 tmux new -s sidekiq",
"ubuntu 25815 0.6 6.5 1933952 1022068 pts/1 Sl+ Dec01 7:13 sidekiq 4.2.3 banks [0 of 25 busy] stopping "
]
}
ok: [fiworker6] => {
"out.stdout_lines": [
"ubuntu 17308 5.7 6.1 1948728 957972 pts/1 Sl+ Dec01 63:18 sidekiq 4.2.3 banks [25 of 25 busy] ",
"root 18126 0.0 0.0 4440 636 pts/5 S+ 12:43 0:00 /bin/sh -c ps auwwx | grep sidekiq",
"root 18128 0.0 0.0 10460 912 pts/5 S+ 12:43 0:00 grep sidekiq",
"ubuntu 23040 0.0 0.0 31808 9200 ? Ss Oct27 15:12 tmux new -s sidekiq"
]
}
As you can see in the first bit out output, fiworker5, I missed when I manually shut stuff down yesterday. Oops. And this brings us to our next playbook:
playbook:
#
# MONDAY ansible-playbook -i ec2.py playbook_stop_sidekiq_with_prejudice.yml
# ansible-playbook -i inventories/production2 playbook_stop_sidekiq_with_prejudice.yml
#
---
- hosts: worker
become: yes
remote_user: ubuntu
gather_facts: true
roles:
- { role: kill_sidekiq_with_prejudice, tags: sidekiq}
role:
---
- name: kill_sidekiq_with_prejudice
shell: ps -ef | grep sidekiq | grep -v grep | awk '{print $2}' | xargs kill -9
output:
(fiworker5 shut down on its own before this ran; sigh)
ansible-playbook -i inventories/production2 playbook_stop_sidekiq_with_prejudice.yml
PLAY [worker] ******************************************************************
TASK [setup] *******************************************************************
ok: [fiworker3]
ok: [fiworkerbig]
ok: [fiworker5]
ok: [fiworker4]
ok: [fiworker6]
ok: [fiworker7]
ok: [fiworker9]
ok: [fiworker10]
ok: [fiworker8]
ok: [fiworker11]
TASK [kill_sidekiq_with_prejudice : kill_sidekiq_with_prejudice] ***************
changed: [fiworker5]
changed: [fiworker6]
changed: [fiworker3]
changed: [fiworker4]
changed: [fiworkerbig]
changed: [fiworker7]
changed: [fiworker9]
changed: [fiworker8]
changed: [fiworker10]
changed: [fiworker11]
PLAY RECAP *********************************************************************
fiworker10 : ok=2 changed=1 unreachable=0 failed=0
fiworker11 : ok=2 changed=1 unreachable=0 failed=0
fiworker3 : ok=2 changed=1 unreachable=0 failed=0
fiworker4 : ok=2 changed=1 unreachable=0 failed=0
fiworker5 : ok=2 changed=1 unreachable=0 failed=0
fiworker6 : ok=2 changed=1 unreachable=0 failed=0
fiworker7 : ok=2 changed=1 unreachable=0 failed=0
fiworker8 : ok=2 changed=1 unreachable=0 failed=0
fiworker9 : ok=2 changed=1 unreachable=0 failed=0
fiworkerbig : ok=2 changed=1 unreachable=0 failed=0
Given my previous praise of pkill, readers may be wondering why I used the old xargs trick. Simply put I couldn't make pkill work. There are google posts on the topic but I didn't have time to dig into it – I knew that xargs had to work so I went with it. Honestly I dont understand why Ansible doesn't have a process module – it just seems so absolutely needed.
As with my previous example if there is interest, on Monday, I'll publish examples showing the dynamic inventory version of this.