Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement mv method in GDriveFileSystem #217

Merged
merged 18 commits into from
Sep 18, 2022

Conversation

simone-viozzi
Copy link
Contributor

Hi, i'm addind a method to rename a GoogleDriveFile

the google api to rename a file is: https://developers.google.com/drive/api/v2/reference/files/patch

I saw you have a wrapper for files().patch: _FilesPatch
do you prefer if I use the existing wrapper or the api directly?

@simone-viozzi simone-viozzi changed the title add rename method in GoogleDriveFile Add rename method in GoogleDriveFile Jul 24, 2022
@shcheklein
Copy link
Member

@simone-viozzi this is from the docs:

Manipulate file metadata and contents from [GoogleDriveFile](https://docs.iterative.ai/PyDrive2/pydrive2/#pydrive2.files.GoogleDriveFile) object and call [Upload()](https://docs.iterative.ai/PyDrive2/pydrive2/#pydrive2.files.GoogleDriveFile.Upload)

Does it make sense to use the Upload for this?

E.g. like this https://docs.iterative.ai/PyDrive2/filemanagement/?highlight=metadata#update-file-metadata

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to review the workflow around Upload()

@simone-viozzi
Copy link
Contributor Author

Right now is possible to rename a file using existing methods:

file.FetchMetadata(fields="title")
file["title"] = "new_file1"
file.Upload()

and looking at the code for upload:

PyDrive2/pydrive2/files.py

Lines 499 to 512 in 64a50a1

def Upload(self, param=None):
"""Upload/update file by choosing the most efficient method.
:param param: additional parameter to upload file.
:type param: dict.
:raises: ApiRequestError
"""
if self.uploaded or self.get("id") is not None:
if self.dirty["content"]:
self._FilesUpdate(param=param)
else:
self._FilesPatch(param=param)
else:
self._FilesInsert(param=param)

it uses _FilesPatch internally.

i discovered this way to rename a file with #91

but it would be much more convenient and user friendly to just have a method that does that

@shcheklein
Copy link
Member

@simone-viozzi could you check if fsspec interface had move / rename please? May be we should implement it there (keep this one as-is - low level).

one more question:

file.FetchMetadata(fields="title")
file["title"] = "new_file1"
file.Upload()

is it possible to do this w/o a call to Fetch? (if I know the file id). I wonder also if it's possible to do this if I know only a file name w/o any extra calls. That would be a good benefit of this I think.

@simone-viozzi
Copy link
Contributor Author

simone-viozzi commented Jul 26, 2022

is it possible to do this w/o a call to Fetch? (if I know the file id). I wonder also if it's possible to do this if I know only a file name w/o any extra calls. That would be a good benefit of this I think.

No, without fetch it doesn't do nothing.

Consider the minimal example:

file = drive.CreateFile({'id': "13nYjSBX7FmSUFnBcTuT6hKsFswXhqcpzpioei5r2eRQ"})
file["title"] = "new_file1"
file.Upload()

The stack-trace of Upload in this case is:

  • Upload
  • _FilesPatch
  • GetChanges -> this return a blank dict

def GetChanges(self):
"""Returns changed metadata elements to update api resources efficiently.
:returns: dict -- changed metadata elements.
"""
dirty = {}
for key in self:
if self.metadata.get(key) is None:
dirty[key] = self[key]
elif self.metadata[key] != self[key]:
dirty[key] = self[key]
return dirty

i think because the title was None and it doesn't count as a change, but i might be wrong on this one.
This is also the same issue as #91

With the rename method i implemented:

https://github.com/simone-viozzi/PyDrive2/blob/c55271427ace18d59019f03392e73cf99d46f9eb/pydrive2/files.py#L574-L587

you don't need to fetch it before because i specified it in the request: param["fields"] = "title"

@simone-viozzi could you check if fsspec interface had move / rename please? May be we should implement it there (keep this one as-is - low level).

there should be a move method (#119) but i can't find it on the current version. Also there is a reference to it in one of the tests:

def test_move(fs, remote_dir):
fs.touch(remote_dir + "/a.txt")
initial_info = fs.info(remote_dir + "/a.txt")
fs.move(remote_dir + "/a.txt", remote_dir + "/b.txt")
secondary_info = fs.info(remote_dir + "/b.txt")
assert not fs.exists(remote_dir + "/a.txt")
assert fs.exists(remote_dir + "/b.txt")
initial_info.pop("name")
secondary_info.pop("name")
assert initial_info == secondary_info

i suppose it's an old test and the method was dropped somewhere.

Anyway, there is no rename / move on the current version of fsspec

@simone-viozzi simone-viozzi requested a review from shcheklein July 27, 2022 20:08
@shcheklein
Copy link
Member

i discovered this way to rename a file with #91

yep, thanks. I remember it now. I was trying to fix it, but it was not that trivial. Ideally, it should be fixed though.

there should be a move method (#119) but i can't find it on the current version. Also there is a reference to it in one of the tests:

https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.rename

That's the default implementation of move:

https://github.com/fsspec/filesystem_spec/blob/master/fsspec/spec.py#L917-L920

That's the rename:

https://github.com/fsspec/filesystem_spec/blob/master/fsspec/spec.py#L1251

It's not terrible (if we optimize copy). But both can improved to use a single API call I think.


So, ideally, we can take a look and see if we can fix the #91 + optimize fsspec to provide a better interface for this. Then we can see if we need a separate wrapper on top of _Patch

Let me know wdyt.

@simone-viozzi
Copy link
Contributor Author

oh, i get it now, thank you!

i think we can definitively optimize them, maybe for the rename we could use self.auth.service.files().patch(**param).execute() directly, so we don't need the fetch before the call.
But this way we would need a function on the GoogleDriveFile level and another on the GDriveFileSystem level.

otherwise we can do just a function on the GDriveFileSystem level like this:

def rename(path, new_title)
    id = _get_item_id(path)
    file = drive.CreateFile({'id': id})
    file.FetchMetadata(fields="title")
    file["title"] = new_title
    file.Upload()

But the current implementation of rename is: def rename(self, path1, path2, **kwargs): and i don't think we can change the signature.

So maybe the best is to change the implementation of move in a way that:

  • if only the last segment of the path is changed it will rename the file
  • if other segments are changed too it will move the file

This way we can leave the default implementation of rename.

The move method will become something like:

def move(self, path1, path2):
    path1_left = path1.split('/')[:-1]
    path1_right = path1.split('/')[-1]

    path2_left = path1.split('/')[:-1]
    path2_right = path1.split('/')[-1]

   if path1_left == path2_left:
        rename file from path1_right to path2_right
   else:
        move file from path1 to path2

what do you think?

This way even if someone use move to rename a file it would call the right API

@shcheklein
Copy link
Member

Re moving files, check this https://developers.google.com/drive/api/guides/folder#move_files_between_folders

I feel it can be generalized to a single Patch call that updates parents + title if needed, wdyt? The only thing we'll need is to get ids for those parents, for the file, etc. It should be done carefully. The most annoying thing on Google Drive is that one path can resolve into multiple IDs.

@simone-viozzi
Copy link
Contributor Author

The most annoying thing on Google Drive is that one path can resolve into multiple IDs.

check out: https://rclone.org/commands/rclone_dedupe/

i think for the move / rename method we can just throw an error if a path resolve in multiple IDs.

Re moving files, check this https://developers.google.com/drive/api/guides/folder#move_files_between_folders

Good!
so we need patch to rename the file and update to move it.
we can use GoogleDriveFile.Upload that does both, i will try it and get back

@simone-viozzi
Copy link
Contributor Author

mh, funny.
The rename part does work:

fs.mv('root/tmp/a.pdf', 'root/tmp/b.pdf')

but the move part does not.

Even if i change the parent array, the Upload method will call _FilesPatch instead of _FilesUpdate.

PyDrive2/pydrive2/files.py

Lines 499 to 512 in 64a50a1

def Upload(self, param=None):
"""Upload/update file by choosing the most efficient method.
:param param: additional parameter to upload file.
:type param: dict.
:raises: ApiRequestError
"""
if self.uploaded or self.get("id") is not None:
if self.dirty["content"]:
self._FilesUpdate(param=param)
else:
self._FilesPatch(param=param)
else:
self._FilesInsert(param=param)

I also tried on manually calling the privates methods:

if file1_name != file2_name:
    file1._FilesPatch()

if file1_parent != file2_parent:
    file1._FilesUpdate()

But param["body"] = self.GetChanges() in _FilesUpdate does not recognize that the parent array has changed and nothing get updated.

@shcheklein do you have any idea why?

if i did the fetch before it should get that the array has changed

@simone-viozzi
Copy link
Contributor Author

Ok, now it does work.

both rename / move and rename and move at the same time.

i changed:

file1["parents"].append({"id": file2_parent_id})

to:

file1["parents"] = [{"id": file2_parent_id}]

This way when _FilesUpdate get the changes in the dict, the list is different ( different object ) and it gets updated.

i don't know if it's ok to just reset the parents array, on a normal drive it's not allowed to have a folder with multiple parents but in some other drives maybe is possible.

@simone-viozzi
Copy link
Contributor Author

simone-viozzi commented Jul 31, 2022

I also discovered that we don't need to use _FilesPatch, but _FilesUpdate can also rename the file.

Please take a look at the code, for a single file it should be ready for a review.

Than i will see how to implement recursive and maxdepth.

@simone-viozzi simone-viozzi marked this pull request as ready for review July 31, 2022 13:51
@simone-viozzi simone-viozzi changed the title Add rename method in GoogleDriveFile Implement mv method in GDriveFileSystem [renamed] Jul 31, 2022
@shcheklein
Copy link
Member

@simone-viozzi thanks! it's a good iteration, I put some comments to improve it.

i don't know if it's ok to just reset the parents array, on a normal drive it's not allowed to have a folder with multiple parents but in some other drives maybe is possible.

can you give more context please?

@simone-viozzi
Copy link
Contributor Author

@shcheklein how you want to handle recursive and maxdepth?
Right now is recursive by default, I think we can add a check so if the source is a folder and recursive=False it raises an exception. What do you think?

Regarding maxdepth I think is ok to leave it not implemented.

@simone-viozzi simone-viozzi marked this pull request as ready for review August 31, 2022 07:45
@simone-viozzi
Copy link
Contributor Author

Hey @shcheklein could you please review this PR?
I think it's ready to be merged

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, some stylistic changes. Good stuff overall. I'm not sure how recursive is handled, also we need to add test, would be great to count the number of api calls in those tests also

if maxdepth:
raise NotImplementedError("Max depth move is not supported")

if self.isdir(path1) and not recursive:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: how do we handle recursive after all? Why do we need this extra check. Is it only as a safeguard to check the intention? I'm asking since even the isdir check can be expensive (extra call).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you move a folder, the content is moved with it. So is kinda always recursive.

The check is there to avoid that a user wanted to move a file (recursive=False) but by accident the source is a directory.

We could also remove recursive and always assume that the user want to move the entire folder. This way, there is no need to check if the source is a directory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say yes (to avoid that extra API call). Also data does not disappear anywhere, it's not that dangerous.

how does the local filesystem fsspec implementation approach this?

unix mv doesn't require the recursive flag afair

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that to not modify the default method signature of ffspec:
https://github.com/fsspec/filesystem_spec/blob/2633445fc54797c79a7ac96d213bd24dfbdfcdc2/fsspec/spec.py#L917-L920
In this case, the flags are passed down to copy.

But yes, if it's allowed to change the signature, we can also remove recursive and the isDir check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just ignore it. We are not allowed to change the default.

We can also check still how they implement the local fs. Is it unix way or it check the recursive flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also check still how they implement the local fs. Is it unix way or it check the recursive flag.

Where?

@simone-viozzi
Copy link
Contributor Author

@shcheklein there is already a test:

def test_move(fs, remote_dir):

@shcheklein
Copy link
Member

@shcheklein there is already a test:

def test_move(fs, remote_dir):

Q: How did it work before if mv was not implemented?

@simone-viozzi
Copy link
Contributor Author

@shcheklein regarding the recursive behavior, there is no actual need of recursion.
If we change the parent of a folder, all the subfolders will still point to the same id. So they will move along with their parent.

So it is possible to move an entire folder structure with just one call.

@simone-viozzi
Copy link
Contributor Author

@shcheklein there is already a test:

def test_move(fs, remote_dir):

Q: How did it work before if mv was not implemented?

Before mv was implemented with copy + rm:
https://github.com/fsspec/filesystem_spec/blob/2633445fc54797c79a7ac96d213bd24dfbdfcdc2/fsspec/spec.py#L917-L920

@shcheklein
Copy link
Member

@shcheklein regarding the recursive behavior, there is no actual need of recursion.

okay, let's remove / ignore it for now

@simone-viozzi simone-viozzi temporarily deployed to external September 18, 2022 15:32 Inactive
@simone-viozzi
Copy link
Contributor Author

I removed the recursive flag. If a user use it, it will go into the kwargs and it will be ignored.

@shcheklein shcheklein changed the title Implement mv method in GDriveFileSystem [renamed] Implement mv method in GDriveFileSystem Sep 18, 2022
@shcheklein shcheklein merged commit 3172dea into iterative:main Sep 18, 2022
@shcheklein
Copy link
Member

Thanks @simone-viozzi !

@simone-viozzi simone-viozzi deleted the add-rename-method branch September 18, 2022 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants