Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spark-tensorflow-distributor: RAM overflow when running ResNet152 #189

Open
wobfan opened this issue Jun 23, 2021 · 1 comment
Open

spark-tensorflow-distributor: RAM overflow when running ResNet152 #189

wobfan opened this issue Jun 23, 2021 · 1 comment

Comments

@wobfan
Copy link

wobfan commented Jun 23, 2021

Hello!

I am using the spark-tensorflow-distributor package to run TensorFlow jobs on our Spark-on-YARN 3-node-cluster. We are running another cluster, with the exact same specs, but with native TensorFlow distribution on the cluster, not using Spark-on-YARN. Both clusters feature 64 core CPUs, with 188 GB usable RAM, and 12 GPUs with 10 GB RAM each.

Both clusters are running on Python 3.7.3, with tensorflow==2.4.1. The Spark-cluster also has spark-tensorflow-distributor==0.1.0 installed.

To get a little insight in performance differences, we ran the ResNet152 network with the CIFAR-10 dataset on both of them, as both are included out of the box in TF packages. I'll attach the code below.

Although we are using the exact same code on both clusters, with the same dataset and the same network, the one on Spark eats WAY more RAM than the one that's being distributed by TF itself: While Spark initially uses up to 137 GB RAM and stays there most of the time (with peaks of 148 GB RAM usage), the TF-distributed model only uses a maximum of 28 GB RAM at it's peak, slowly starting from 17 GB.

Everything else (we compared GPU memory, usage, CPU usage, network I/O, etc.) seems to be somewhat comparable to each other, but the RAM usage differs extremely. When using a bigger dataset, it does even overflow the RAM at some point in the calculations, causing an EOF Exception at some point - while the natively-distributed one does only use about 50 GBs RAM and smoothly succeeds.

This is the code I am using:

from spark_tensorflow_distributor import *

def train():
  import tensorflow as tf
  import numpy as np

  model = tf.keras.applications.ResNet152(
  include_top=True, weights='imagenet', input_tensor=None,
  input_shape=None, pooling=None, classes=1000)
  
  model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

  (train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.cifar10.load_data()
  train_images = tf.image.resize(train_images, (224,224))

  dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels))
  dataset = dataset.shuffle(100)
  dataset = dataset.batch(512)

  model.fit(dataset, epochs=3)

MirroredStrategyRunner(num_slots=12, use_gpu=True).run(train)

Any clue on this strange behaviour, or what causes it?

Many thanks in advance! :-)

@wobfan wobfan changed the title RAM overflow when running ResNet152 spark-tensorflow-distributor: RAM overflow when running ResNet152 Jun 23, 2021
@cometta
Copy link

cometta commented Apr 29, 2023

any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants