-
Notifications
You must be signed in to change notification settings - Fork 455
Description
Currently, the way to transfer a dataframe from remote to local is with the command %%spark -o df. However, this is not an optimal solution as it is less efficient and not equivalent with respect to data types as the normal .toPandas() method of a spark dataframe.
The easiest workaround is to write the dataframe to a file system that both remote and local have access to. For example, you could do:
df = spark.table("data")
df.toPandas().to_parquet("/file/system/data.parquet")
%%local
df = pd.read_parquet("/file/system/data.parquet")
If you could transfer the serialized parquet file directly to local, you could avoid the external file system:
df = spark.table("data")
buf = df.toPandas().to_parquet(path=None)
%%send_bytes buf
%%local
import io
df = pd.read_parquet(io.BytesIO(buf))
How difficult would it be to add the %%send_bytes magic? From looking through the code, it seems as though it should be doable, but I may be missing something. I am happy to help as best I can with the implementation, but I am not very familiar with the codebase.