Skip to content
This repository has been archived by the owner on Jul 3, 2024. It is now read-only.

perf: Asynchronously dispatch requests in groups #10

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

alexandreteles
Copy link
Contributor

@alexandreteles alexandreteles commented Mar 8, 2024

This small rewrite uses async to dispatch requests in groups of five with a small delay of sleep: float = random.uniform(1, 3) on each dispatch. This should result in faster execution than dispatching requests in a synchronous way while introducing some entropy to not scare YouTube too much.

I cannot test it myself, so I would be glad if you could check it out @oSumAtrIX.

Thank you!

EDIT: it also introduces a retry option that tries to execute the mark_watched operation three times before giving up on that specific video. I did not introduce a global failure count, but this should be trivial if the current code works.

@alexandreteles alexandreteles requested a review from oSumAtrIX March 8, 2024 02:00
@alexandreteles alexandreteles self-assigned this Mar 8, 2024
@indrastorms
Copy link
Contributor

indrastorms commented Mar 8, 2024

File "/data/data/com.termux/files/home/restore-missing-youtube-watch-history/main.py", line 106, in main
    kept: list[dict[str, Any]] = await filter_video_events(data, RESUME_TIMESTAMP)
                                 ^^^^^^^^^^^^^^^^^^^^^^^
TypeError: object async_generator can't be used in 'await' expression

break

await asyncio.sleep(random.uniform(1, 3))
logger.info(f"Processed URL: {url}.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add counter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can, contribute that change to the PR. Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to do it. Maybe a task_done callback function can do. Will it also count the errors?

@alexandreteles
Copy link
Contributor Author

alexandreteles commented Mar 8, 2024

File "/data/data/com.termux/files/home/restore-missing-youtube-watch-history/main.py", line 106, in main
    kept: list[dict[str, Any]] = await filter_video_events(data, RESUME_TIMESTAMP)
                                 ^^^^^^^^^^^^^^^^^^^^^^^
TypeError: object async_generator can't be used in 'await' expression

Fixed the issue, that is what I get for writing code without testing. Anyway, against my better judgment I have tested the script using my own account. The new execution logic should also pull new videos to process as soon as more space is available in the semaphore instead of waiting for the whole batch to finish. Every video will still have a random asyncio.sleep() to introduce some entropy. Default concurrency is still five requests at the same time, but that can be controlled with --concurrency.

I've also added a check to not process the same video multiple times by checking the video URL against a log file.

Would you be kind enough to test it again?

@indrastorms
Copy link
Contributor

It's working fine, thanks to your async contribution its super fast now.

@alexandreteles
Copy link
Contributor Author

It's working fine, thanks to your async contribution its super fast now.

@oSumAtrIX Can you PR a fix to the readme that includes these changes? I will be a bit busy today so I'm not sure I'll be able to write it.

@oSumAtrIX oSumAtrIX changed the title feat: async runner perf: Use async to dispatch requests in groups Mar 9, 2024
@oSumAtrIX oSumAtrIX changed the title perf: Use async to dispatch requests in groups perf: Asynchronously dispatch requests in groups Mar 9, 2024
@oSumAtrIX
Copy link
Member

@alexandreteles What changes to the readme are necessary?

@alexandreteles
Copy link
Contributor Author

Some of the command line arguments are gone and we have a new one called concurrency that allows you to tell how many connections the app will do at the same time. That's about it.

"time": time,
} if header != "YouTube" or time < RESUME_TIMESTAMP:
return False
case {"details": [{"name": "From Google Ads"}]}:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently this is necessary according to #13 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't currently verify that. Would you be able to do so?

Copy link

@jmorgannz jmorgannz Mar 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is necessary according to my issue entry?
Just remember the example I included there in #13 is from "My Activity.JSON" with only "Ads" selected as export.
Those entries do not show in watched_history.JSON at all so referencing #13 should be redundant.

I don't know if other entries in watched_history.JSON need those checks though.
But either way, the comment you referenced is saying the entries with "From Google Ads" are actually LEGITIMATE watch history that was scrubbed due to the changes, not ones to be omitted.

@Mr-HaleYa
Copy link

Tqdm needs to be installed to run. Should this be in the requirements file?

@alexandreteles
Copy link
Contributor Author

Tqdm needs to be installed to run. Should this be in the requirements file?

That's in a different PR 😅

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants