Custom YouTube to Podcast Generator with Python

Suppose you have a set of YouTube videos and you want to easily listen to the audio. For me, this is the The Archers omnibus which I am really behind on (yeah, I know, but I’ve been listening since I was 14 and can’t quit now).

There are some services that can do this for you, but I’ve found them to be kind of annoying. The one I was using only accepted audio, so I had to rip the YouTube audio myself, put it somewhere public like Dropbox public folder, and add the URL to the podcast site. I figured I could do it pretty quickly myself with some Python.

So here’s what this script does:

Download the YouTube video and rip the audio from it
Upload the audio and podcast cover image to an S3 bucket
Generate a podcast RSS feed with the new file
Upload the RSS feed to S3 too, and that’s it!

A few caveats before we start: this is a public RSS feed. While you can pretty simply obfuscate the URL by picking a random S3 name and RSS feed name, it is all required to be public. We could set up a podcast feed with basic auth and so on, but not on S3 without some server in between. This is fine for me: I don’t mind if people want to know I’m working my way through the Archers, but keep that in mind.

Let’s go. First, make an S3 bucket that has the public access permissions turned on. Also, configure your AWS credentials file. Let’s look what we’re doing step by step:

First, let’s import libraries we will need and set the S3 bucket name as a constant.

import os
import yt_dlp
import boto3
from feedgen.feed import FeedGenerator
from pathlib import Path
from urllib.parse import urljoin
from slugify import slugify
import requests
import feedparser
from tqdm import tqdm
from botocore.exceptions import ClientError
import time

S3_BUCKET = ‘podcast

session = boto3.Session(profile_name='personal')
s3_client = session.client('s3')

Great. Now, let’s create two essential functions: one for downloading the YouTube video’s thumbnail image and another for the audio itself. With the yt-dlp library, this task is quite simple.

The first function, download_thumbnail(), takes the thumbnail URL and output path as arguments. We use the requests library to download the image, streaming the content and saving it in chunks for efficient memory usage.

The second function, download_audio(), accepts the video URL as input. Using the yt-dlp library’s YoutubeDL class, we retrieve metadata about the video, like its title and thumbnail URL. We then configure the downloading process, specifying the desired audio format and quality, and download the audio file accordingly.

These two functions work together to obtain the necessary thumbnail and audio files from any YouTube video for our podcast:

def download_thumbnail(url, output_path):
    response = requests.get(url, stream=True)
    response.raise_for_status()

    with open(output_path, 'wb') as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)

def download_audio(url):
    with yt_dlp.YoutubeDL() as ydl:
        info = ydl.extract_info(url, download=False)
        video_title = info['title']
        thumbnail_url = info['thumbnail']
        slug_title = slugify(video_title)

    ydl_opts = {
        'format': 'bestaudio/best',
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'mp3',
            'preferredquality': '192',
        }],
        'quiet': True,
        'outtmpl': f'{slug_title}.%(ext)s',
    }

    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([url])

    return video_title, thumbnail_url

In this section, we will implement several functions to handle file transfers with Amazon S3 and create the podcast RSS feed. We’ll be using the feedparser package to facilitate the RSS feed generation and parsing. These functions will enable us to upload and download files, such as audio and thumbnail files, to and from S3, and manage the RSS feed with ease.

We will write an update_rss_feed_s3 function, which will be responsible for updating the podcast RSS feed hosted on S3. This function takes the remote_audio_url, public_thumbnail_url, video_title, and audio_length as arguments. It will first check if the RSS file exists on S3. If it does, the function will download and parse the existing feed to maintain the previous entries. If the RSS file does not exist, it will create a new feed with the necessary information. The function will then add a new entry to the feed, containing the provided information about the latest episode. Finally, the updated RSS feed will be uploaded back to S3 to reflect the changes.

def download_rss_from_s3(local_rss_file_path, remote_rss_file_path):
    try:
        with open(local_rss_file_path, 'wb') as f:
            s3_client.download_fileobj(S3_BUCKET, remote_rss_file_path, f)
        return True
    except ClientError as e:
        if e.response['Error']['Code'] == '404':
            return False
        raise e

def upload_to_s3(file_path, remote_path):
    file_size = os.path.getsize(file_path)

    with open(file_path, 'rb') as f:
        progress = tqdm(
            total=file_size, unit='B', unit_scale=True, desc='Uploading', ncols=80
        )

        def callback(bytes_transferred):
            progress.update(bytes_transferred)

        config = boto3.s3.transfer.TransferConfig(use_threads=False)
        transfer = boto3.s3.transfer.S3Transfer(s3_client, config)
        transfer.upload_file(
            file_path,
            S3_BUCKET,
            remote_path,
            callback=callback,
            extra_args={'ACL': 'public-read'},
        )

        progress.close()

    public_url = f'https://{S3_BUCKET}.s3.amazonaws.com/{remote_path}'
    return public_url

def update_rss_feed_s3(remote_audio_url, public_thumbnail_url, video_title, audio_length):
    remote_rss_file_path = 'podcast.rss'
    local_rss_file_path = 'podcast.rss'

    # Check if the RSS file exists on S3
    try:
        s3_client.head_object(Bucket=S3_BUCKET, Key=remote_rss_file_path)
        rss_file_found = True
    except ClientError:
        rss_file_found = False

    rss_feed = FeedGenerator()

    if rss_file_found:
        # If the RSS file exists on S3, download and parse it
        with open(local_rss_file_path, 'wb') as f:
            s3_client.download_fileobj(S3_BUCKET, remote_rss_file_path, f)

        with open(local_rss_file_path, 'r') as f:
            existing_feed = feedparser.parse(f.read())

        # Set feed-level information from the existing feed
        rss_feed.title(existing_feed.feed.title)
        # rss_feed.author(
        #     {'name': existing_feed.feed.author_detail.name, 'email': existing_feed.feed.author_detail.email})
        rss_feed.link(href=existing_feed.feed.link, rel='alternate')
        rss_feed.logo(existing_feed.feed.image.href)
        rss_feed.subtitle(existing_feed.feed.subtitle)
        rss_feed.language(existing_feed.feed.language)

        # Add existing entries to the feed
        for entry in existing_feed.entries:
            e = rss_feed.add_entry()
            e.id(entry.id)
            e.title(entry.title)
            e.link(href=entry.links[0].href, rel='enclosure', type=entry.links[0].type)
            e.enclosure(entry.enclosures[0].href, entry.enclosures[0].length, entry.enclosures[0].type)
    else:
        # Initialize a new feed if the file does not exist
        rss_feed.title('Sam\'s Podcast')
        # rss_feed.author({'name': 'Sam', 'email': 'your_email@example.com'})
        rss_feed.link(href='your_podcast_link', rel='alternate')
        rss_feed.logo(public_thumbnail_url)
        rss_feed.subtitle('Just my stuff')
        rss_feed.language('en')

    # Add the new entry to the feed
    entry = rss_feed.add_entry()
    entry.id(remote_audio_url)
    entry.title(video_title)
    entry.link(href=public_thumbnail_url, rel='enclosure', type='image/jpeg')
    entry.enclosure(url=remote_audio_url, length=str(audio_length), type='audio/mpeg')

    # Save the feed to a local file
    rss_content = rss_feed.rss_str(pretty=True)
    with open(local_rss_file_path, 'wb') as f:
        f.write(rss_content)

    # Upload the feed to S3
    config = boto3.s3.transfer.TransferConfig(use_threads=False)
    transfer = boto3.s3.transfer.S3Transfer(s3_client, config)
    extra_args = {'ACL': 'public-read'}
    transfer.upload_file(
        local_rss_file_path,
        S3_BUCKET,
        remote_rss_file_path,
        extra_args=extra_args,
    )

In the end, we have the main function, which serves as the entry point for our script. This function orchestrates all the other functions we have implemented so far. It starts by prompting the user to enter the YouTube video URL, and then proceeds with the following steps:

Downloads the audio and thumbnail for the YouTube video using the download_audio and download_thumbnail functions.
Uploads the downloaded audio and thumbnail files to the specified Amazon S3 bucket using the upload_to_s3 function.
Checks if the podcast RSS file exists on S3 using the download_rss_from_s3 function. If it does not exist, a new RSS file will be created.
Downloads the existing RSS file from S3 or initializes a new one, and updates the RSS feed with the new episode information using the update_rss_feed_s3 function.
Uploads the updated RSS feed back to the S3 bucket.

Once all these steps are complete, the function will calculate and display the time taken for the entire process. This main function effectively ties together all the individual components to create a seamless script for converting YouTube videos into podcast episodes and updating the associated RSS feed.

def main():
    start_time = time.time()
    video_url = input("Enter the YouTube video URL: ")

    video_title, thumbnail_url = download_audio(video_url)
    print(f"Audio downloaded: {video_title}")

    slug_title = slugify(video_title)
    local_audio_file_path = f'{slug_title}.mp3'
    local_thumbnail_file_path = f'{slug_title}.jpg'
    audio_length = os.path.getsize(local_audio_file_path)

    download_thumbnail(thumbnail_url, local_thumbnail_file_path)
    print("Thumbnail downloaded.")

    print("Uploading to S3.")
    remote_audio_file_path = f'{slug_title}.mp3'
    remote_thumbnail_file_path = f'{slug_title}.jpg'
    public_audio_url = upload_to_s3(local_audio_file_path, remote_audio_file_path)
    public_thumbnail_url = upload_to_s3(local_thumbnail_file_path, remote_thumbnail_file_path)
    print("Audio and thumbnail uploaded to S3.")
    print(f"Public audio URL: {public_audio_url}")
    print(f"Public thumbnail URL: {public_thumbnail_url}")

    local_rss_file_path = 'podcast.rss'
    remote_rss_file_path = 'podcast.rss'

    rss_file_found = download_rss_from_s3(local_rss_file_path, remote_rss_file_path)
    if not rss_file_found:
        print('RSS feed not found in S3. Creating a new one.')
        with open(local_rss_file_path, 'w') as f:
            f.write('')
        rss_file_found = True

    download_rss_from_s3(local_rss_file_path, remote_rss_file_path)
    update_rss_feed_s3(public_audio_url, public_thumbnail_url, video_title, audio_length)
    print("RSS feed updated.")

    end_time = time.time()
    time_taken = end_time - start_time
    minutes = int(time_taken // 60)
    seconds = int(time_taken % 60)
    print("Time taken: {} mins {} secs".format(minutes, seconds))

if __name__ == '__main__':
    main()

That’s it! To run your script, pass it the YouTube URL you want to add to your podcast:

podcast.py https://www.youtube.com/watch?v=dQw4w9WgXcQ

And you have a podcast! It’ll be available at your bucket’s S3 URL at the /podcast.rss feed. Just add that RSS feed to your podcast app.

You can see the full code on Github here.