Multithreaded, concurrent and intensive use often cause stale containers to remain up forever #808

mad-p · 2018-12-26T06:30:00Z

Zalenium Image Version(s): 3.14.0g ( also reproducible with 3.14.0c )
Docker Version: 18.09.0, build 4d60db4
If using docker-compose, version: 1.23.1, build b02f1306
OS: OSX High Sierra ( also reproducible on CentOS 7.2.1511 and latest Arch Linux )
Docker Command to start Zalenium: Executing through docker-compose.yml

Expected Behavior

Stale containers will get removed after idle timeout.
Thread AutoStartProxyPoolPoller works as expected.

Actual Behavior

Stale containers won't get removed and remain up forever even after idle timeout.
Thread AutoStartProxyPoolPoller hangs forever.
Note that those containers can still be reused as normal.

Minimal code to reproduce the problem

docker-compose.yml

--desiredContainers 0 helps us to tell whether idle timeout is working or not. The number of node containers stays above zero when the problem occurs.
--debugEnabled true helps us to tell whether the debug log of 'Checking containers...' is constantly printed to standard output or not , i.e.; thread AutoStartProxyPoolPoller is working or not.

version: "2"

services:
  zalenium:
    image: dosel/zalenium:3.14.0g
    container_name: zalenium
    privileged: true
    tty: true
    ports:
      - "4444:4444"
    volumes:
      - /tmp/videos:/home/seluser/videos/
      - /var/run/docker.sock:/var/run/docker.sock
    command: >
      start
        --seleniumImageName elgalu/selenium:3.14.0-p16
        --desiredContainers 0
        --maxDockerSeleniumContainers 10
        --debugEnabled true
    environment:
      - PULL_SELENIUM_IMAGE=true

Ruby script

Run this ruby script in the background in a few threads concurrently for the duration of several minutes.
The desired number of threads depends on the number of CPU cores.
Running with two threads and four CPU cores is a good example.
Shorter idleTimeout i.e. higher frequency of stale containers getting removed seems to make the problem more reproducible in lesser period of time.
In a "real" UI testing environment where idleTimeout defaults to 90 seconds, each UI test takes a dozen of seconds and the concurrency of tests is about four, it usually takes a couple of hours for the problem to occur.

require 'selenium-webdriver' # version 3.14.0 ( also reproducible with version 2.53.4 )

def exec
  caps = Selenium::WebDriver::Remote::Capabilities.chrome
  caps[:idleTimeout] = 10
  driver = Selenium::WebDriver.for :remote, url: 'http://${YOUR_ZALENIUM_HOST}:4444/wd/hub', desired_capabilities: caps
  sleep 10
  driver.quit
end

loop do
  begin
    exec
  rescue => e
    puts e
  end
end

Java thread dump taken after the problem occurs.

https://gist.github.com/mad-p/6082c9ee556ad84d1304be1c9f91b562

The Java thread dump was taken as follows.

docker exec -it ${YOUR_ZALENIUM_CONTAINER_NAME} bash
seluser@zalenium:~$ ps aux | grep java
seluser@zalenium:~$ sudo kill -3 ${YOUR_PROCESS_ID_OF_JAVA}

Root cause

The root cause seems to be the issue below.

Properly close the Apache response so that connections can be reused
eclipse-ee4j/jersey#3861

Tentative workaround

Use patched version of jersey-apache-connector.

git clone https://github.com/zalando/zalenium/
cd zalenium
git checkout 3.14.0g
mkdir -p src/main/java/org/glassfish/jersey/apache/connector
curl -o src/main/java/org/glassfish/jersey/apache/connector/ApacheConnector.java https://raw.githubusercontent.com/jersey/jersey/2.22.2/connectors/apache-connector/src/main/java/org/glassfish/jersey/apache/connector/ApacheConnector.java
# Here, manually apply the patch below to ApacheConnector.java.
# https://github.com/eclipse-ee4j/jersey/pull/3861/files
mvn clean package && (cd target && docker build -t ${YOUR_REPOSITORY}/zalenium:3.14.0g . )

Permanent workaround

Please consider upgrading com.spotify/docker-client:8.11.7 to a newer version(not released yet as of Nov 2018) where docker-client uses jersey-apache-connector:2.29(scheduled to be released on spring 2019).

Zalenium:3.14.0g uses docker-client:8.11.7.
https://github.com/zalando/zalenium/blob/3.14.0g/pom.xml#L62

docker-client:8.11.7 uses jersey-apache-connector:2.22.2.
https://github.com/spotify/docker-client/blob/v8.11.7/pom.xml#L109-L113

The Problem has been fixed with the version 2.29 of jersey-apache-connector. But The last version of com.spotify/docker-client is still using the version 2.22.
So this issue is still relevant. We encountred it multiple times during implementing the solution for docker swarm and it happens especially frequently, when a big number of containers are running in parallel.

…operations see also: - eclipse-ee4j/jersey#3772 - zalando#808

…operations see also: - eclipse-ee4j/jersey#3772 - #808

mad-p mentioned this issue Dec 26, 2018

Intensive and repeated use cause DockerTimeoutException with 3.141.59d #807

Closed

diemol added the waiting-retest label Dec 26, 2018

diemol closed this as completed Jan 2, 2019

tstern added a commit to yosserO/zalenium that referenced this issue Apr 5, 2019

Fix error with simultaneous docker operations by synchronizing those …

69ae6ab

…operations see also: - eclipse-ee4j/jersey#3772 - zalando#808

tstern mentioned this issue Apr 5, 2019

Docker Swarm Implementation #907

Merged

9 tasks

diemol pushed a commit that referenced this issue Jun 2, 2019

Fix error with simultaneous docker operations by synchronizing those …

cc99649

…operations see also: - eclipse-ee4j/jersey#3772 - #808

diemol pushed a commit that referenced this issue Jun 10, 2019

Fix error with simultaneous docker operations by synchronizing those …

aedb122

…operations see also: - eclipse-ee4j/jersey#3772 - #808

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multithreaded, concurrent and intensive use often cause stale containers to remain up forever #808

Multithreaded, concurrent and intensive use often cause stale containers to remain up forever #808

mad-p commented Dec 26, 2018 •

edited

Loading

diemol commented Dec 26, 2018

mad-p commented Dec 27, 2018

diemol commented Jan 2, 2019

mad-p commented Jan 4, 2019

diemol commented Jan 4, 2019

mad-p commented Jan 7, 2019

diemol commented Jan 9, 2019

yosserO commented Apr 3, 2019

Multithreaded, concurrent and intensive use often cause stale containers to remain up forever #808

Multithreaded, concurrent and intensive use often cause stale containers to remain up forever #808

Comments

mad-p commented Dec 26, 2018 • edited Loading

Expected Behavior

Actual Behavior

Minimal code to reproduce the problem

docker-compose.yml

Ruby script

Java thread dump taken after the problem occurs.

Root cause

Tentative workaround

Permanent workaround

See also

com.spotify/docker-client

jersey-apache-connector

Issues at com.spotify/docker-client

Issues at jersey-apache-connector

Jersey release schedule and roadmap

diemol commented Dec 26, 2018

mad-p commented Dec 27, 2018

diemol commented Jan 2, 2019

mad-p commented Jan 4, 2019

diemol commented Jan 4, 2019

mad-p commented Jan 7, 2019

diemol commented Jan 9, 2019

yosserO commented Apr 3, 2019

mad-p commented Dec 26, 2018 •

edited

Loading