Skip to content

Remove residual [duplicate] YouTube Shorts urls from database#155

Merged
holta merged 3 commits into
iiab:masterfrom
deldesir:deldesir-patch-27
May 31, 2024
Merged

Remove residual [duplicate] YouTube Shorts urls from database#155
holta merged 3 commits into
iiab:masterfrom
deldesir:deldesir-patch-27

Conversation

@deldesir

@deldesir deldesir commented Apr 16, 2024

Copy link
Copy Markdown
Member

This PR enables downloading of channels that contains Shorts. It does 2 things:
1- Remove shorts duplicates from database
2- Ignore short videos when downloading playlists and channels

Fixes #154

Tested with https://www.youtube.com/@AutoApps. MAX_VIDEOS_PER_DOWNLOAD was set to 530 to account for the maximum videos downloads since the channels contains 523 videos. 415/415 were successfully processed for metadata, ignoring all shorts.

1- Remove shorts duplicates from database
2- Ignore short videos when downloading playlists and channels
@deldesir deldesir self-assigned this Apr 16, 2024
@deldesir deldesir requested a review from holta April 16, 2024 18:20
@holta

holta commented Apr 16, 2024

Copy link
Copy Markdown
Member

2- Ignore short videos when downloading playlists and channels

Quick questions:

  • YouTube "Shorts" will not be downloaded?

  • Explain why this is advisable or necessary?

  • Is this a temporary measure?

(Thanks for making the context/motivation more clear!)

Comment thread cps/tasks/metadata_extract.py Outdated
Comment on lines +106 to +109

def _ignore_shorts(self, requested_urls):
requested_urls = {url: requested_urls[url] for url in requested_urls.keys() if requested_urls[url]["duration"] > 60}

@deldesir deldesir Apr 16, 2024

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • YouTube "Shorts" will not be downloaded?

Exactly. The function _ignore_shorts defaults playlists and channels downloads to ignoring Shorts

  • Explain why this is advisable or necessary?

It's there to support it as an optional feature. Like constants MAX_VIDEOS_PER_DOWNLOAD and MAX_GB_PER_DOWNLOAD, we can make it configurable for the user to toggle it on/off.

  • Is this a temporary measure?

It doesnt affect the downloading per se. However the other function introduced (_remove_shorts_from_db) is a temporary measure.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like constants MAX_VIDEOS_PER_DOWNLOAD

Should the option to ignore "Shorts" be part of this PR, if it's straightforward to implement safely?

@deldesir deldesir Apr 16, 2024

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving it into another PR might make it more feature-relevant than portraying it as a fix. Zero confusion is better.

@holta holta May 4, 2024

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving it into another PR might make it more feature-relevant than portraying it as a fix. Zero confusion is better.

Please link to any PR or ticket explaining what should happen next:

How should people who want "YouTube Shorts" included when downloading channels/playlists/etc... set that option at the command-line... prior to clicking "Download to IIAB" ?

Comment on lines 58 to +62

def _remove_shorts_from_db(self, conn):
conn.execute("DELETE FROM media WHERE path LIKE '%shorts%'")
conn.commit()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is a temporary solution that deletes all entries with paths that contain "shorts". It fixes #154

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is a temporary solution that deletes all entries with paths that contain "shorts". It fixes #154

"temporary solution" means what?

(When should this PR be reverted-or-improved upon? Outline concrete next steps?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means this should be fixed in xklb. I need to reproduce the issue about short URLs duplicates first. Meanwhile it's safe to implement this temporary fix because it removes those URLs.

@holta holta May 2, 2024

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR remove duplicate "YouTube Shorts" from xklb-metadata.db ?

(And if so, a question: does the removing happen in a quite safe or completely safe way — such that everything will continue to work in future, e.g. if xklb itself later removes duplicates?)

@deldesir deldesir May 2, 2024

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's safe enough. It will continue to work regardless of xklb implementing it or not. However, it's worth noting that if there are no "shorts" URLs in the database, the function will still execute a query to delete them, which might be unnecessary overhead. I think adding a condition to check for their existence first might enhance it.

Suggested change
def _remove_shorts_from_db(self, conn):
conn.execute("DELETE FROM media WHERE path LIKE '%shorts%'")
conn.commit()
def _remove_shorts_from_db(self, conn):
cursor = conn.cursor()
cursor.execute("SELECT COUNT(*) FROM media WHERE path LIKE '%shorts%'")
count = cursor.fetchone()[0]
if count > 0:
conn.execute("DELETE FROM media WHERE path LIKE '%shorts%'")
conn.commit()

What do you think?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's tested very safe, and ideally simplest?

(Make a recommendation!)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code is the simplest and very safe. I recommend it.

@deldesir deldesir changed the title Handle YouTube Shorts Remove residual Shorts urls fro database Apr 16, 2024
@deldesir deldesir changed the title Remove residual Shorts urls fro database Remove residual Shorts urls from database Apr 16, 2024
@holta

holta commented Apr 16, 2024

Copy link
Copy Markdown
Member

Should @EMG70 test this PR ?

(And if so how!)

@holta holta mentioned this pull request Apr 16, 2024
@deldesir

Copy link
Copy Markdown
Member Author

Should @EMG70 test this PR ?

Yes. By downloading a whole channel.

@holta holta added the bug Something isn't working label May 2, 2024
@holta

holta commented May 2, 2024

Copy link
Copy Markdown
Member

Should @EMG70 test this PR ?

Yes. By downloading a whole channel.

@deldesir please test this and let us know that everything's solid — if @EMG70 (contributing in many other ways!) has no time here? ✅

@deldesir

deldesir commented May 2, 2024

Copy link
Copy Markdown
Member Author

Okay. I will give some additional feedback on it.

@holta holta changed the title Remove residual Shorts urls from database Remove residual [duplicate] YouTube Shorts urls from database May 30, 2024
@holta

holta commented May 31, 2024

Copy link
Copy Markdown
Member

This PR was smoke-tested with #151 and #157 on t4 as explained here:

#157 (comment)

(Further functionality testing needed to fully confirm...)

@holta holta merged commit 478fd22 into iiab:master May 31, 2024
@deldesir deldesir deleted the deldesir-patch-27 branch July 1, 2024 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Channel downloads fail due to residual shorts duplicates

2 participants