Skip to content

Pass bytes object from remote to local #954

@jacgoldsm

Description

@jacgoldsm

Currently, the way to transfer a dataframe from remote to local is with the command %%spark -o df. However, this is not an optimal solution as it is less efficient and not equivalent with respect to data types as the normal .toPandas() method of a spark dataframe.

The easiest workaround is to write the dataframe to a file system that both remote and local have access to. For example, you could do:

df = spark.table("data")
df.toPandas().to_parquet("/file/system/data.parquet")

%%local
df = pd.read_parquet("/file/system/data.parquet")

If you could transfer the serialized parquet file directly to local, you could avoid the external file system:

df = spark.table("data")
buf = df.toPandas().to_parquet(path=None)

%%send_bytes buf

%%local
import io
df = pd.read_parquet(io.BytesIO(buf))

How difficult would it be to add the %%send_bytes magic? From looking through the code, it seems as though it should be doable, but I may be missing something. I am happy to help as best I can with the implementation, but I am not very familiar with the codebase.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions