Skip to content
This repository has been archived by the owner on Sep 21, 2021. It is now read-only.

Multithreaded, concurrent and intensive use often cause stale containers to remain up forever #808

Closed
mad-p opened this issue Dec 26, 2018 · 8 comments

Comments

@mad-p
Copy link

mad-p commented Dec 26, 2018

Zalenium Image Version(s): 3.14.0g ( also reproducible with 3.14.0c )
Docker Version: 18.09.0, build 4d60db4
If using docker-compose, version: 1.23.1, build b02f1306
OS: OSX High Sierra ( also reproducible on CentOS 7.2.1511 and latest Arch Linux )
Docker Command to start Zalenium: Executing through docker-compose.yml

Expected Behavior

Stale containers will get removed after idle timeout.
Thread AutoStartProxyPoolPoller works as expected.

Actual Behavior

Stale containers won't get removed and remain up forever even after idle timeout.
Thread AutoStartProxyPoolPoller hangs forever.
Note that those containers can still be reused as normal.

Minimal code to reproduce the problem

docker-compose.yml

  • --desiredContainers 0 helps us to tell whether idle timeout is working or not. The number of node containers stays above zero when the problem occurs.
  • --debugEnabled true helps us to tell whether the debug log of 'Checking containers...' is constantly printed to standard output or not , i.e.; thread AutoStartProxyPoolPoller is working or not.
version: "2"

services:
  zalenium:
    image: dosel/zalenium:3.14.0g
    container_name: zalenium
    privileged: true
    tty: true
    ports:
      - "4444:4444"
    volumes:
      - /tmp/videos:/home/seluser/videos/
      - /var/run/docker.sock:/var/run/docker.sock
    command: >
      start
        --seleniumImageName elgalu/selenium:3.14.0-p16
        --desiredContainers 0
        --maxDockerSeleniumContainers 10
        --debugEnabled true
    environment:
      - PULL_SELENIUM_IMAGE=true

Ruby script

  • Run this ruby script in the background in a few threads concurrently for the duration of several minutes.
  • The desired number of threads depends on the number of CPU cores.
  • Running with two threads and four CPU cores is a good example.
  • Shorter idleTimeout i.e. higher frequency of stale containers getting removed seems to make the problem more reproducible in lesser period of time.
    In a "real" UI testing environment where idleTimeout defaults to 90 seconds, each UI test takes a dozen of seconds and the concurrency of tests is about four, it usually takes a couple of hours for the problem to occur.
require 'selenium-webdriver' # version 3.14.0 ( also reproducible with version 2.53.4 )

def exec
  caps = Selenium::WebDriver::Remote::Capabilities.chrome
  caps[:idleTimeout] = 10
  driver = Selenium::WebDriver.for :remote, url: 'http://${YOUR_ZALENIUM_HOST}:4444/wd/hub', desired_capabilities: caps
  sleep 10
  driver.quit
end

loop do
  begin
    exec
  rescue => e
    puts e
  end
end

Java thread dump taken after the problem occurs.

https://gist.github.com/mad-p/6082c9ee556ad84d1304be1c9f91b562

The Java thread dump was taken as follows.

docker exec -it ${YOUR_ZALENIUM_CONTAINER_NAME} bash
seluser@zalenium:~$ ps aux | grep java
seluser@zalenium:~$ sudo kill -3 ${YOUR_PROCESS_ID_OF_JAVA}

Root cause

The root cause seems to be the issue below.

Properly close the Apache response so that connections can be reused
eclipse-ee4j/jersey#3861

Tentative workaround

Use patched version of jersey-apache-connector.

git clone https://github.com/zalando/zalenium/
cd zalenium
git checkout 3.14.0g
mkdir -p src/main/java/org/glassfish/jersey/apache/connector
curl -o src/main/java/org/glassfish/jersey/apache/connector/ApacheConnector.java https://raw.githubusercontent.com/jersey/jersey/2.22.2/connectors/apache-connector/src/main/java/org/glassfish/jersey/apache/connector/ApacheConnector.java
# Here, manually apply the patch below to ApacheConnector.java.
# https://github.com/eclipse-ee4j/jersey/pull/3861/files
mvn clean package && (cd target && docker build -t ${YOUR_REPOSITORY}/zalenium:3.14.0g . )

Permanent workaround

Please consider upgrading com.spotify/docker-client:8.11.7 to a newer version(not released yet as of Nov 2018) where docker-client uses jersey-apache-connector:2.29(scheduled to be released on spring 2019).

Zalenium:3.14.0g uses docker-client:8.11.7.
https://github.com/zalando/zalenium/blob/3.14.0g/pom.xml#L62

docker-client:8.11.7 uses jersey-apache-connector:2.22.2.
https://github.com/spotify/docker-client/blob/v8.11.7/pom.xml#L109-L113

See also

com.spotify/docker-client

https://github.com/spotify/docker-client
https://mvnrepository.com/artifact/com.spotify/docker-client

jersey-apache-connector

https://github.com/jersey/jersey/ (old repo)
https://github.com/eclipse-ee4j/jersey/
https://mvnrepository.com/artifact/org.glassfish.jersey.connectors/jersey-apache-connector

Issues at com.spotify/docker-client

spotify/docker-client#727
spotify/docker-client#727 (comment)
spotify/docker-client#727 (comment)

Issues at jersey-apache-connector

https://github.com/jersey/jersey/issues/3772 (old repo)
eclipse-ee4j/jersey#3772
eclipse-ee4j/jersey#3772 (comment)

Jersey release schedule and roadmap

https://projects.eclipse.org/projects/ee4j.jersey
https://github.com/eclipse-ee4j/jersey/wiki/Road-Map

@diemol
Copy link
Contributor

diemol commented Dec 26, 2018

Hi @mad-p,

Could you please try with the latest release? https://github.com/zalando/zalenium/releases/tag/3.141.59e

These issues should have been solved in that release.

@mad-p
Copy link
Author

mad-p commented Dec 27, 2018

OK, I'll try.
BTW, sorry for the wrong url for thread-dump gist. Here's correct one:
https://gist.github.com/mad-p/6082c9ee556ad84d1304be1c9f91b562

@diemol
Copy link
Contributor

diemol commented Jan 2, 2019

No worries,

Closing due to lack of activity, please let us know if things didn't work out.

@diemol diemol closed this as completed Jan 2, 2019
@mad-p
Copy link
Author

mad-p commented Jan 4, 2019

I tried it twice. Unfortunately, in one of the two tries, some containers remained unreclaimed. I'll try some more.

Upgrading jersey wiped out the problem completely in our side. Can you have a look at it?

@diemol
Copy link
Contributor

diemol commented Jan 4, 2019

How did you upgrade the Jersey version?

@mad-p
Copy link
Author

mad-p commented Jan 7, 2019

Pls look at the "Tentative workaround" above.

@diemol
Copy link
Contributor

diemol commented Jan 9, 2019

In the permanent workaround you mention:

Please consider upgrading com.spotify/docker-client:8.11.7 to a newer version(not released yet as of Nov 2018) where docker-client uses jersey-apache-connector:2.29(scheduled to be released on spring 2019).

But the current version we are using is 8.14.3, so it seems that you are not using the latest Zalenium release?

@yosserO
Copy link
Contributor

yosserO commented Apr 3, 2019

The Problem has been fixed with the version 2.29 of jersey-apache-connector. But The last version of com.spotify/docker-client is still using the version 2.22.
So this issue is still relevant. We encountred it multiple times during implementing the solution for docker swarm and it happens especially frequently, when a big number of containers are running in parallel.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants