Description of Issue/Question
See also: https://groups.google.com/forum/#!topic/salt-users/tAm-c-Cs5ss
We are using the sqs_events engine but we are losing messages.
When Salt is starting, sqs_events engine is ready and starts to send events before the reactor engines is properly started.
I also test to sleep a bit at the beginning of the start function of the sqs_events engine.
With this, messages are not lost at startup anymore but we are still losing messages when restarting so it seems that when Salt is stopping, events may be sent when reactors are already stopped.
Test setup
For testing purpose, I've just created a reactor which creates empty files with filenames retrieved for the SQS queue and I'm sending messages in the queue manually.
Master configuration (relevant part):
log_level: debug
engines:
- sqs_events:
queue: salt-queue
sqs:
message_format: json
sqs.region: eu-central-1
reactor:
- salt/engine/sqs:
- salt://test_sqs_events.sls
states/test_sqs_events.sls
#!py
# vim:ft=python
import json
import os
def run():
if not os.path.isdir("/tmp/test_salt"):
os.mkdir("/tmp/test_salt")
message = json.loads(data['message'])
open("/tmp/test_salt/{}".format(message["filename"]), 'a').close()
return {}
Steps to Reproduce Issue
The first test is to stop salt-master, then to run
for i in $(seq 10); do aws sqs send-message --queue-url https://eu-central-1.queue.amazonaws.com/ACCOUNT_ID/salt-queue --message-body "{\"filename\": $i}"; done
Then to start salt-master
No messages will be processed by reactors:
$ ls /tmp/test_salt
ls: cannot access '/tmp/test_salt': No such file or directory
Logs (relevant part)
[...]
2018-08-29 15:55:03,584 [salt.utils.lazy :100 ][DEBUG ][2694] LazyLoaded sqs_events.start
2018-08-29 15:55:03,584 [salt.engines :53 ][INFO ][2694] Starting Engine salt.engines.Engine(salt.loaded.int.engines.sqs_events)
2018-08-29 15:55:03,586 [salt.utils.process:418 ][DEBUG ][2694] Started 'salt.engines.Engine(salt.loaded.int.engines.sqs_events)' with pid 2709
2018-08-29 15:55:03,589 [salt.utils.lazy :100 ][DEBUG ][2694] LazyLoaded reactor.start
2018-08-29 15:55:03,590 [salt.engines :53 ][INFO ][2694] Starting Engine salt.engines.Engine(salt.loaded.int.engines.reactor)
2018-08-29 15:55:03,592 [salt.utils.process:418 ][DEBUG ][2694] Started 'salt.engines.Engine(salt.loaded.int.engines.reactor)' with pid 2710
2018-08-29 15:55:03,598 [salt.utils.lazy :100 ][DEBUG ][2709] LazyLoaded sqs_events.start
2018-08-29 15:55:03,600 [salt.utils.event :321 ][DEBUG ][2709] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
2018-08-29 15:55:03,600 [salt.utils.event :324 ][DEBUG ][2709] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
2018-08-29 15:55:03,631 [boto :383 ][DEBUG ][2709] Retrieving credentials from metadata server.
2018-08-29 15:55:03,621 [salt.utils.lazy :100 ][DEBUG ][2710] LazyLoaded reactor.start
2018-08-29 15:55:03,638 [salt.config :2117][DEBUG ][2710] Reading configuration from /etc/salt/master
2018-08-29 15:55:03,646 [boto :404 ][DEBUG ][2709] Retrieved credentials will expire in 5:06:27.353955 at: 2018-08-29T21:01:31Z
2018-08-29 15:55:03,724 [boto :897 ][DEBUG ][2709] Method: GET
2018-08-29 15:55:03,724 [boto :898 ][DEBUG ][2709] Path: /ACCOUNT_ID/salt-queue
2018-08-29 15:55:03,725 [boto :899 ][DEBUG ][2709] Data:
2018-08-29 15:55:03,726 [boto :900 ][DEBUG ][2709] Headers: {}
2018-08-29 15:55:03,726 [boto :901 ][DEBUG ][2709] Host: eu-central-1.queue.amazonaws.com
2018-08-29 15:55:03,727 [boto :902 ][DEBUG ][2709] Port: 443
2018-08-29 15:55:03,750 [boto :1172][DEBUG ][2709] <?xml version="1.0"?><ReceiveMessageResponse xmlns="http://queue.amazonaws.com/doc/2012-11-05/"><ReceiveMessageResult><Message><MessageId>9b3789ec-cba0-4e5e-a3e3-6ca0d4ceda13</MessageId><ReceiptHandle>...</ReceiptHandle><MD5OfBody>aea163445e8c67e6d22ded8aa8dab526</MD5OfBody><Body>{"filename": 6}</Body></Message></ReceiveMessageResult><ResponseMetadata><RequestId>7b63582f-dbc4-5af7-be4d-0c2f1fd39f06</RequestId></ResponseMetadata></ReceiveMessageResponse>
2018-08-29 15:55:03,751 [salt.transport.ipc:254 ][DEBUG ][2709] Initializing new IPCClient for path: /var/run/salt/master/master_event_pull.ipc
2018-08-29 15:55:03,763 [salt.utils.event :739 ][DEBUG ][2709] Sending event: tag = salt/engine/sqs; data = {u'message': u'{"filename": 6}', u'_stamp': '2018-08-29T15:55:03.763372'}
[Idem with all messages]
2018-08-29 15:55:03,961 [salt.config :2117][DEBUG ][2710] Reading configuration from /etc/salt/master
[Reading all configuration files]
2018-08-29 15:55:06,083 [salt.loaded.int.grains.core:598 ][DEBUG ][2710] Please install 'virt-what' to improve results of the 'virtual' grain.
2018-08-29 15:55:06,281 [salt.loaded.int.grains.extra:70 ][DEBUG ][2710] Loading static grains from /etc/salt/grains
2018-08-29 15:55:06,497 [salt.utils.lazy :100 ][DEBUG ][2710] LazyLoaded jinja.render
2018-08-29 15:55:06,502 [salt.utils.lazy :100 ][DEBUG ][2710] LazyLoaded yaml.render
2018-08-29 15:55:06,504 [salt.utils.event :321 ][DEBUG ][2710] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
2018-08-29 15:55:06,506 [salt.utils.event :324 ][DEBUG ][2710] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
2018-08-29 15:55:06,506 [salt.transport.ipc:254 ][DEBUG ][2710] Initializing new IPCClient for path: /var/run/salt/master/master_event_pub.ipc
2018-08-29 15:55:12,693 [salt.utils.reactor:109 ][DEBUG ][2710] Gathering reactors for tag salt/auth
2018-08-29 15:55:22,707 [salt.utils.reactor:109 ][DEBUG ][2710] Gathering reactors for tag salt/auth
[...]
Restart
If we test to restart salt-master during message sending:
for i in $(seq 20); do aws sqs send-message --queue-url https://eu-central-1.queue.amazonaws.com/ACCOUNT_ID/salt-queue --message-body "{\"filename\": $i}"; done
Then just after messages started to be sent:
systemctl restart salt-master
We can see that some messages have been lost:
$ ls -1 /tmp/test_salt/ | sort -n
1
2
3
15
16
17
18
19
20
Versions Report
(also tested on Salt 2016.11.X)
Salt Version:
Salt: 2018.3.2
Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: 2.5.3
docker-py: Not Installed
gitdb: 2.0.0
gitpython: 2.1.1
ioflo: Not Installed
Jinja2: 2.9.4
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.8
mysql-python: 1.3.7
pycparser: Not Installed
pycrypto: 2.6.1
pycryptodome: Not Installed
pygit2: Not Installed
Python: 2.7.13 (default, Nov 24 2017, 17:33:09)
python-gnupg: Not Installed
PyYAML: 3.12
PyZMQ: 16.0.2
RAET: Not Installed
smmap: 2.0.1
timelib: Not Installed
Tornado: 4.4.3
ZMQ: 4.2.1
System Versions:
dist: debian 9.5
locale: UTF-8
machine: x86_64
release: 4.9.0-8-amd64
system: Linux
version: debian 9.5
Description of Issue/Question
See also: https://groups.google.com/forum/#!topic/salt-users/tAm-c-Cs5ss
We are using the sqs_events engine but we are losing messages.
When Salt is starting, sqs_events engine is ready and starts to send events before the reactor engines is properly started.
I also test to sleep a bit at the beginning of the
startfunction of the sqs_events engine.With this, messages are not lost at startup anymore but we are still losing messages when restarting so it seems that when Salt is stopping, events may be sent when reactors are already stopped.
Test setup
For testing purpose, I've just created a reactor which creates empty files with filenames retrieved for the SQS queue and I'm sending messages in the queue manually.
Master configuration (relevant part):
states/test_sqs_events.sls
Steps to Reproduce Issue
The first test is to stop salt-master, then to run
Then to start salt-master
No messages will be processed by reactors:
Logs (relevant part)
Restart
If we test to restart salt-master during message sending:
Then just after messages started to be sent:
We can see that some messages have been lost:
Versions Report
(also tested on Salt 2016.11.X)