Rundeck との闘争

ウチの社では、Rundeck をジョブ管理サーバーとして使用している。

各バッチサーバー上のバッチプログラムを、定周期で起動させるのが仕事なのだが

何故かRundeckを運用するサーバーの負荷が右肩上がりで、しかも定期的にCPUが高止まりする、という困った状態に。

これが何とか解消できた、という話。

そもそもRundeckとは

rundeck.org

ウチらが取ってる構成(現状)

以下のような感じ。

f:id:ndx:20160212212525p:plain

  • バックエンドDBをデフォルトのH2 DatabaseからMySQLに変更
    • これやらないと、WebのGUIがとても表示が遅く使い物にならなかった・・・
  • 待機系によるバックアップ
    • Rundeck待機機に定義ファイル等を定期的にrsync
    • MySQLもスレーブ機投入

そしてMySQLが過負荷に

ただジョブを定期的に起動させるだけなのだが・・・典型的にヤバい感じに。

f:id:ndx:20160212214430p:plain

グラフを見ればわかるのだが、負荷が落ち着く期間もあるものの、傾向的には右肩上がり。

ジョブは総数70もないので、今後が思いやられていた。

slow.log を見てみると

こんな感じのが大量に出ていた。(これに早く気付いていれば。)

select commands0_.workflow_commands_id as workflow1_15_0_, commands0_.workflow_step_id as workflow2_17_0_, commands0_.commands_idx as commands3_0_, workflowst1_.id as id1_16_1_, workflowst1_.version as version2_16_1_, workflowst1_.description as descript3_16_1_, workflowst1_.error_handler_id as error_ha4_16_1_, workflowst1_.keepgoing_on_success as keepgoin5_16_1_, workflowst1_.adhoc_execution as adhoc_ex7_16_1_, workflowst1_.adhoc_filepath as adhoc_fi8_16_1_, workflowst1_.adhoc_local_string as adhoc_lo9_16_1_, workflowst1_.adhoc_remote_string as adhoc_r10_16_1_, workflowst1_.arg_string as arg_str11_16_1_, workflowst1_.file_extension as file_ex12_16_1_, workflowst1_.interpreter_args_quoted as interpr13_16_1_, workflowst1_.script_interpreter as script_14_16_1_, workflowst1_.json_data as json_da15_16_1_, workflowst1_.node_step as node_st16_16_1_, workflowst1_.type as type17_16_1_, workflowst1_.job_group as job_gro18_16_1_, workflowst1_.job_name as job_nam19_16_1_, workflowst1_.node_filter as node_fi20_16_1_, workflowst1_.node_keepgoing as node_ke21_16_1_, workflowst1_.node_rank_attribute as node_ra22_16_1_, workflowst1_.node_rank_order_ascending as node_ra23_16_1_, workflowst1_.node_threadcount as node_th24_16_1_, workflowst1_.class as class6_16_1_ from workflow_workflow_step commands0_ inner join workflow_step workflowst1_ on commands0_.workflow_step_id=workflowst1_.id where commands0_.workflow_commands_id=2061677;

workflow_workflow_step なるテーブルの定義を見てみると

| workflow_workflow_step | CREATE TABLE `workflow_workflow_step` (
  `workflow_commands_id` bigint(20) DEFAULT NULL,
  `workflow_step_id` bigint(20) DEFAULT NULL,
  `commands_idx` int(11) DEFAULT NULL,
  KEY `FK_9pkey6k5fdo6worgquakkh7d1` (`workflow_step_id`),
  CONSTRAINT `FK_9pkey6k5fdo6worgquakkh7d1` FOREIGN KEY (`workflow_step_id`) REFERENCES `workflow_step` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=sjis |

indexらしきものがない!(PKさえない!)

explain とってみると、案の定。

> explain select commands0_.workflow_commands_id as workflow1_15_0_, commands0_.workflow_step_id as workflow2_17_0_, commands0_.commands_idx as commands3_0_, workflowst1_.id as id1_16_1_, workflowst1_.version as version2_16_1_, workflowst1_.description as descript3_16_1_, workflowst1_.error_handler_id as error_ha4_16_1_, workflowst1_.keepgoing_on_success as keepgoin5_16_1_, workflowst1_.adhoc_execution as adhoc_ex7_16_1_, workflowst1_.adhoc_filepath as adhoc_fi8_16_1_, workflowst1_.adhoc_local_string as adhoc_lo9_16_1_, workflowst1_.adhoc_remote_string as adhoc_r10_16_1_, workflowst1_.arg_string as arg_str11_16_1_, workflowst1_.file_extension as file_ex12_16_1_, workflowst1_.interpreter_args_quoted as interpr13_16_1_, workflowst1_.script_interpreter as script_14_16_1_, workflowst1_.json_data as json_da15_16_1_, workflowst1_.node_step as node_st16_16_1_, workflowst1_.type as type17_16_1_, workflowst1_.job_group as job_gro18_16_1_, workflowst1_.job_name as job_nam19_16_1_, workflowst1_.node_filter as node_fi20_16_1_, workflowst1_.node_keepgoing as node_ke21_16_1_, workflowst1_.node_rank_attribute as node_ra22_16_1_, workflowst1_.node_rank_order_ascending as node_ra23_16_1_, workflowst1_.node_threadcount as node_th24_16_1_, workflowst1_.class as class6_16_1_ from workflow_workflow_step commands0_ inner join workflow_step workflowst1_ on commands0_.workflow_step_id=workflowst1_.id where commands0_.workflow_commands_id=2061677;
+----+-------------+--------------+------+------------------------------+------------------------------+---------+-------------------------+---------+-------------+
| id | select_type | table        | type | possible_keys                | key                          | key_len | ref                     | rows    | Extra       |
+----+-------------+--------------+------+------------------------------+------------------------------+---------+-------------------------+---------+-------------+
|  1 | SIMPLE      | workflowst1_ | ALL  | PRIMARY                      | NULL                         | NULL    | NULL                    | 2008680 |             |
|  1 | SIMPLE      | commands0_   | ref  | FK_9pkey6k5fdo6worgquakkh7d1 | FK_9pkey6k5fdo6worgquakkh7d1 | 9       | rundeck.workflowst1_.id |       1 | Using where |
+----+-------------+--------------+------+------------------------------+------------------------------+---------+-------------------------+---------+-------------+

index を張ってみる

> ALTER TABLE workflow_workflow_step ADD INDEX workflow_commands_id(workflow_commands_id);

張った直後のリソース

f:id:ndx:20160212225524p:plain

f:id:ndx:20160212225542p:plain

劇的に収まったようだ。。。ε-(´。`;)ホッ

後は、ジョブが集中する月初、0時0分0秒の状態を見て問題なければこれで良さそう。