Skip to content

Deadlock! #3

@zh217

Description

@zh217

Hi,

I have found a situation where zmq-async would deadlock. The following code will do:

(ns debug.async.core
  (:import [org.zeromq ZMQ$Socket])
  (:require [com.keminglabs.zmq-async.core :refer [register-socket!]]
            [clojure.core.async :refer [chan close! go >! <!]]))

(defn deadlock!
  []
  (let [write-in (chan)
        read-in (chan)
        read-out (chan)]
    (register-socket! {:in write-in :socket-type :push
                       :configurator (fn [^ZMQ$Socket s]
                                       (.bind s "tcp://127.0.0.1:19999"))})
    (register-socket! {:in read-in :out read-out :socket-type :pull
                       :configurator (fn [^ZMQ$Socket s]
                                       (.connect s "tcp://127.0.0.1:19999"))})

    ;; This and the next go blocks just send message from one socket to another repeatedly
    (go
      (loop [c 0]
        (>! write-in (str "send " c))
        (recur (inc c))))
    (go
      (loop []
        (println (String. (<! read-out)))
        (recur)))

    ;; This loop opens and closes sockets repeatedly
    (loop [c 0]
      (println "open-close " c)
      (let [in (chan)]
        (register-socket! {:in in :socket-type :pub
                           :configurator (fn [^ZMQ$Socket s]
                                           )})
        (close! in)
        (recur (inc c))))))

I also have a rough idea why the deadlock happens: in the above code, first, we have a lot of incoming messages from ZMQ sockets, so it is expected that the zmq-loop will spend a lot of time on this line

(>!! async-control-chan [incoming-sock-id msg])

At the same time, we also have lots of requests for opening/closing sockets, so the async-loop will spend a lot of time putting commands to zmq sockets onto the queue

  (.put queue msg)

Now, once the queue gets full, the put to the queue will block. If, at the same time, zmq-loop wants to put into async-control-chan as above, that would block too since async-loop still tries to put message onto the queue. We have a deadlock.

I have also found out that it is of no use either: increasing the queue size , or: give the async-control-chan some buffer, as long as it is not a buffer that drops messages , or doing them both. Doing them both only delays the deadlock by a tiny bit.

Changing pool and alts!! to be deterministic in the sense that they will always take the control channel/socket first if that's available, together with sufficient buffer, may be able to avoid this deadlock. I will investigate further.

I guess the reason that people have not discovered this deadlock is that in real situations it happens rarely, since we don't usually create/close sockets at this speed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions