Skip to content

Any reason not to use DataFrame.from_csv? #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
steffenix opened this issue Sep 9, 2023 · 1 comment
Open

Any reason not to use DataFrame.from_csv? #1

steffenix opened this issue Sep 9, 2023 · 1 comment

Comments

@steffenix
Copy link

I am looking at your example while having issues with transformers, I am wondering if my data loading could be the issue. I can see you have made your own data loader

def get_data(tokenizer, filename) do
, any reason for not using: https://hexdocs.pm/explorer/Explorer.DataFrame.html#from_csv/2

@steffenix steffenix changed the title Any reason not to use DataFrame.from_csv Any reason not to use DataFrame.from_csv? Sep 9, 2023
@toranb
Copy link
Owner

toranb commented Sep 9, 2023

Great question! The talk was aimed at the very beginner so I just kept it simple for those who know Elixir but not the ML libraries. I did use DataFrame in my original work like you see in the bumblebee guides :)

defmodule Example.Data do
  def get_data(path, tokenizer, opts \\ []) do
    path
    |> Explorer.DataFrame.from_csv!(header: false)
    |> Explorer.DataFrame.rename(["label", "text"])
    |> stream()
    |> tokenize_and_batch(tokenizer, opts[:batch_size], opts[:sequence_length])
  end

  def stream(df) do
    xs = df["text"]
    ys = df["label"]

    xs
    |> Explorer.Series.to_enum()
    |> Stream.zip(Explorer.Series.to_enum(ys))
  end

  def tokenize_and_batch(stream, tokenizer, batch_size, sequence_length) do
    stream
    |> Stream.chunk_every(batch_size)
    |> Stream.map(fn batch ->
      {text, labels} = Enum.unzip(batch)
      tokenized = Bumblebee.apply_tokenizer(tokenizer, text, length: sequence_length)
      {tokenized, Nx.stack(labels)}
    end)
  end
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants