spark-tensorflow-distributor: RAM overflow when running ResNet152 #189

wobfan · 2021-06-23T10:53:25Z

Hello!

I am using the spark-tensorflow-distributor package to run TensorFlow jobs on our Spark-on-YARN 3-node-cluster. We are running another cluster, with the exact same specs, but with native TensorFlow distribution on the cluster, not using Spark-on-YARN. Both clusters feature 64 core CPUs, with 188 GB usable RAM, and 12 GPUs with 10 GB RAM each.

Both clusters are running on Python 3.7.3, with tensorflow==2.4.1. The Spark-cluster also has spark-tensorflow-distributor==0.1.0 installed.

To get a little insight in performance differences, we ran the ResNet152 network with the CIFAR-10 dataset on both of them, as both are included out of the box in TF packages. I'll attach the code below.

Although we are using the exact same code on both clusters, with the same dataset and the same network, the one on Spark eats WAY more RAM than the one that's being distributed by TF itself: While Spark initially uses up to 137 GB RAM and stays there most of the time (with peaks of 148 GB RAM usage), the TF-distributed model only uses a maximum of 28 GB RAM at it's peak, slowly starting from 17 GB.

Everything else (we compared GPU memory, usage, CPU usage, network I/O, etc.) seems to be somewhat comparable to each other, but the RAM usage differs extremely. When using a bigger dataset, it does even overflow the RAM at some point in the calculations, causing an EOF Exception at some point - while the natively-distributed one does only use about 50 GBs RAM and smoothly succeeds.

This is the code I am using:

from spark_tensorflow_distributor import *

def train():
  import tensorflow as tf
  import numpy as np

  model = tf.keras.applications.ResNet152(
  include_top=True, weights='imagenet', input_tensor=None,
  input_shape=None, pooling=None, classes=1000)
  
  model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

  (train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.cifar10.load_data()
  train_images = tf.image.resize(train_images, (224,224))

  dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels))
  dataset = dataset.shuffle(100)
  dataset = dataset.batch(512)

  model.fit(dataset, epochs=3)

MirroredStrategyRunner(num_slots=12, use_gpu=True).run(train)

Any clue on this strange behaviour, or what causes it?

Many thanks in advance! :-)

The text was updated successfully, but these errors were encountered:

cometta · 2023-04-29T12:26:26Z

any update on this?

wobfan changed the title ~~RAM overflow when running ResNet152~~ spark-tensorflow-distributor: RAM overflow when running ResNet152 Jun 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark-tensorflow-distributor: RAM overflow when running ResNet152 #189

spark-tensorflow-distributor: RAM overflow when running ResNet152 #189

wobfan commented Jun 23, 2021 •

edited

Loading

cometta commented Apr 29, 2023

spark-tensorflow-distributor: RAM overflow when running ResNet152 #189

spark-tensorflow-distributor: RAM overflow when running ResNet152 #189

Comments

wobfan commented Jun 23, 2021 • edited Loading

cometta commented Apr 29, 2023

wobfan commented Jun 23, 2021 •

edited

Loading