Back to Blog

finding the best data science podcasts

September 9th, 2020


For the people who read the TL;DR first. You people disgust me, you savages.

titleauthoravg_rtgrtg_ctepisodes
Lex Fridman PodcastLex Fridman4.92400126
Machine Learning GuideOCDevel4.962630
Data SkepticKyle Polich4.4431300
Data StoriesEnrico Bertini and Moritz Stefaner4.5405162
The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)Sam Charrington4.7300300
DataFramedDataCamp4.918859
The AI PodcastNVIDIA4.5162125
SuperDataScienceKirill Eremenko4.6161300
Partially DerivativePartially Derivative4.8141101
Machine LearningStanford3.913820
Talking MachinesTote Bag Productions4.6133106
AI in BusinessDaniel Faggella4.4102100
Learning Machines 101Richard M. Golden4.48782
storytelling with data podcastCole Nussbaumer Kna4.98033
Data CrunchData Crunch Corporation4.97064
Data Viz TodayAlli Torban56462
Artificial IntelligenceMIT4.16131
O'Reilly Data Show PodcastO'Reilly Media4.25960
Machine Learning – Software Engineering DailyMachine Learning – Software Engineering Daily4.559115
Data Science at HomeFrancesco Gadaleta4.258100
Data Engineering PodcastTobias Macey4.758150
Big DataRyan Estrada4.65813
Follow the Data PodcastBloomberg Philanthropies4.35782
Making Data SimpleIBM4.356104
Analytics on FireMico Yuk4.45148
Learn to Code in One MonthLearn to Code4.95026
Becoming A Data Scientist PodcastRenee Teate4.54921
Practical AI: Machine Learning & Data ScienceChangelog Media4.548105
The Present Beyond Measure Show: Data Visualization, Storytelling & Presentation for Digital MarketersLea Pica4.94458
The Data ChiefMission4.94316
AI Today Podcast: Artificial Intelligence Insights, Experts, and OpinionCognilytica4.242161
Data DrivenData Driven4.941257
HumAIn Podcast - Artificial Intelligence, Data Science, and Developer EducationDavid Yakobovitch4.83978
Data GurusSima Vasa539106
Masters of Data PodcastSumo Logic hosted by Ben Newton53874
The PolicyViz PodcastThe PolicyViz Podcast4.736180
The Radical AI PodcastRadical AI4.93435
Women in Data ScienceProfessor Margot Gerritsen4.92824
Towards Data ScienceThe TDS team4.62650
Data in DepthMountain Point52224
Data Science Imposters PodcastAntonio Borges and Jordy Estevez4.42288
The Artists of Data ScienceHarpreet Sahota4.91941
#DataFemmeDikayo Data51730
The Banana Data PodcastDataiku4.91533
Experiencing Data with Brian T. O'NeillBrian T. O'Neill4.91413
Secrets of Data Analytics LeadersEckerson Group4.81382
Data JourneysAJ Goldstein51326
Data Driven DiscussionsOutlier.ai5128
Data Futurology - Leadership And Strategy in Artificial Intelligence, Machine Learning, Data ScienceFelipe Flores4.411135
Artificially IntelligentChristian Hubbs and Stephen Donnelly4.911100

A few additional podcasts were recommended by the community at reddit - those are listed below

titleauthoravg_rtgrtg_ctepisodes
Linear DigressionsUdacity4.8325291
Not So Standard DeviationsRoger Peng and Hilary Parker4.2164100
The Local MaximumMax Sklar4.942136
Chai Time Data ScienceSanyam Bhutani4.89109

why i want to find data science podcasts

This would normally be at the top of an article on finding data science podcasts. Well it would be at the top of any article. But realistically, most people are finding this from google, and they're just looking for the answer that's at the top of the page. If you type in 'the most popular data science podcasts', you really don't want to have to scroll down endlessly to find the answer you're looking for. So to make their experience better, we're just leaving the answer up there. And giving them sass. Lots of sass.

Anyways, I really like listening to things. While newsletters are great for keeping up with current events and blogs are great for learning specific things, podcasts have a special place in my heart for allowing me to aimlessly learn something new every day. The format really lends itself to delivering information efficiently, but in a way where you can multitask. Pre-COVID, my morning commute was typically full of podcasts. While COVID has rendered my commute a nonexistent affair, I still try to listen to at least a podcast a day if I can manage it. My view is that 30 minutes of learning a day will really add up in the long run, and I feel that podcasts are a great way to get there.

Now that we've been through my love affair with podcasts, you can imagine my surprise when I started looking for a few data science ones to subscribe to and I didn't find a tutorial on how to use web scraping to find the most popular data science podcasts to listen to. I know, crazy. There's a web scraping tutorial on everything under the sun except for - seemingly - podcasts. I mean there's probably not one on newsletters either, but we'll leave that alone for now...

So if no one else is crazy enough to write about finding data science podcasts with web scraping, then...

gameplanning the process

By now we're almost certainly rid of those savages who are only here for the answer (gasp, how could they), so we'll go into the little process I went through to gather the data. It's not particularly long, and took me probably an hour to put it together, so it should be a good length for an article.

I'm using python here with an installation of Anaconda (which is a common package management / deployment system for python). I'll be running this in a Jupyter notebook, since its a one-off task that I don't need to use ever again... hopefully.

In terms of what I'm going to do, I'll run a few google keyword searches which are limited to the 'https://podcasts.apple.com/us/podcast/' domain and scrape the results for the first few pages. From there I'll just be scraping the apple podcast page to get the total number of ratings and the average rating. Yea, the data will be biased, but its a quick and dirty way to get the answer I'm looking for.

code to find top data science podcasts - version 1

# import default python packages
import urllib
import requests
import time

The above packages are included in python, the below ones aren't always included. If you don't have them installed, you'll have to download them. You can find out how to use pip to do it or conda.

# import non-standard python packages
# if you dont have these installed, install them with pip or conda
from bs4 import BeautifulSoup
import pandas as pd

Now that the packages have been imported, you should define your user agent. First off, because its polite if you're scraping anything. Secondly, google gives different results for mobile and desktop searches. This isn't actually my user-agent, I took it from another tutorial since I'm a bit lazy. I actually use linux...

# define your desktop user-agent
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"

Alright now we're going to define the queries we want to run. And then create a function that spits out the URL we want to scrape on google. I'm putting the queries in a kwargs format, since I want to put them through a function. That means I can just loop through the list of kwargs and get the results that the function returns.

# Queries
list_kwargs = [
    {"string": 'data podcast'},
    {"string": 'data podcast', "pg": 2},
    {"string": 'data podcast', "pg": 3},
    {"string": 'data science podcast'},
    {"string": 'data engineering podcast'},
    {"string": 'data visualization podcast'},
]

def string_to_podcast_query(string, pg=None):
    query = urllib.parse.quote_plus(f'site:https://podcasts.apple.com/us/podcast/ {string}')
    if pg != None:
        query = query + "&start=" + str(10*(pg-1))
    return f"https://google.com/search?hl=en&lr=en&q={query}", string

# define the headers we will add to all of our requests
headers = {"user-agent" : USER_AGENT}

# set up an empty list to push results to
results = []

# cycle through the list of queries 
for x in list_kwargs:
    # return the query url and the search term that was used to create it (for classification later)
    url, search_term = string_to_podcast_query(**x)

    # make a get request to the url, include the headers with our user-agent
    resp = requests.get(url, headers=headers)

    # only proceed if you get a 200 code that the request was processed correctly
    if resp.status_code == 200:
        # feed the request into beautiful soup
        soup = BeautifulSoup(resp.content, "html.parser")
    
    # find all divs (a css element that wraps page areas) within google results
    for g in soup.find_all('div', class_='r'):
        # within the results, find all the links 
        anchors = g.find_all('a')
        if anchors:
            # get the link and title, add them to an object, and append that to the results array
            link = anchors[0]['href']
            title = g.find('h3').text
            item = {
                "title": title,
                "link": link,
                "search_term": search_term
            }
            results.append(item)

    # sleep for 2.5s between requests.  we don't want to annoy google and deal with recaptchas
    time.sleep(2.5)

Alright, now we have the google results back - nice. From here, lets put that in a pandas dataframe and filter it a bit.

google_results_df = pd.DataFrame(results)

# create a filter for anything that is an episode.  They should contain a ' | '.
# drop any duplicate results as well.
google_results_df['is_episode'] = google_results_df['title'].str.contains(' | ',regex=False)
google_results_df = google_results_df.drop_duplicates(subset='title')

google_results_podasts = google_results_df.copy().loc[google_results_df['is_episode']==False]

Ok cool, we have a list of podcasts. Lets define our apple podcasts scraper.

def podcast_scrape(link):
    # get the link, use the same headers as had previously been defined.
    resp = requests.get(link, headers=headers)
    if resp.status_code == 200:
        soup = BeautifulSoup(resp.content, "html.parser")

    # find the figcaption element on the page
    rtg_soup = soup.find("figcaption", {"class": "we-rating-count star-rating__count"})
    # the text will return an avg rating and a number of reviews, split by a •
    # we'll spit that out, so '4.3 • 57 Ratings' becomes '4.3', '57 Ratings'
    avg_rtg, rtg_ct = rtg_soup.get_text().split(' • ')
    # then we'll take numbers from the rtg_ct variable by splitting it on the space
    rtg_ct = rtg_ct.split(' ')[0]
    
    # find the title in the document, get the text and strip out whitespace
    title_soup = soup.find('span', {"class":"product-header__title"})
    title = title_soup.get_text().strip()
    # find the author in the document, get the text and strip out whitespace
    author_soup = soup.find('span', {"class":"product-header__identity podcast-header__identity"})
    author = author_soup.get_text().strip()

    # find the episode count div, then the paragraph under that, then just extract the # of episodes
    episode_soup = soup.find('div', {"class":"product-artwork__caption small-hide medium-show"})
    episode_soup_p = episode_soup.find('p')
    episode_ct = episode_soup_p.get_text().strip().split(' ')[0]
    
    # format the response as a dict, return that response as the result of the function
    response = {
        "title": title,
        "author": author,
        "link": link,
        "avg_rtg": avg_rtg,
        "rtg_ct": rtg_ct,
        "episodes": episode_ct
    }
    return response

Cool, we now have a podcast scraper. You can try it with the below code.

podcast_scrape('https://podcasts.apple.com/us/podcast/follow-the-data-podcast/id1104371750')


{'title': 'Follow the Data Podcast',
'author': 'Bloomberg Philanthropies',
'link': 'https://podcasts.apple.com/us/podcast/follow-the-data-podcast/id1104371750',
'avg_rtg': '4.3',
'rtg_ct': '57'}

Back to the code. Lets now loop through all the podcast links we have.

# define the result array we'll fill during the loop
podcast_summ = []
for link in google_results_podcasts['link']:
    # use a try/except, since there are a few episodes still in the list that will cause errors if we don't do this.  This way, if there is an error we just wont add anything to the array.
    try:
        # get the response from our scraper and append it to our results
        pod_resp = podcast_scrape(link)
        podcast_summ.append(pod_resp)
    except:
        pass
    # wait for 5 seconds to be nice to apple
    time.sleep(5)

Now to put everything into a dataframe and do a little bit of sorting and filtering.

pod_df = pd.DataFrame(podcast_summ)

# Remove non-english podcasts, sorry guys...
pod_df = pod_df.loc[~pod_df['link'].str.contains('l=')]
pod_df.drop_duplicates(subset='link', inplace=True)

# merge with the original dataframe (in case you want to see which queries were responsible for which podcasts)
merge_df = google_results_podcasts.merge(pod_df,on='link',suffixes=('_g',''))
merge_df.drop_duplicates(subset='title', inplace=True)

# change the average rating and rating count columns from strings to numbers
merge_df['avg_rtg'] = merge_df['avg_rtg'].astype('float64')
merge_df['rtg_ct'] = merge_df['rtg_ct'].astype('int64')

# sort by total ratings and then send them to a csv
merge_df.sort_values('rtg_ct',ascending=False).to_csv('podcasts.csv')

A previous version of the article had me export this as a csv... I've later learned I can export things in markdown straight from pandas... goes to show I don't write many blog posts. If you are exporting to markdown, you'll need to install a package called tabulate with pip or conda

Anyways, that was the full process in creating the above list of data science podcasts. You now have the top podcasts, sorted by total reviews. I considered also using castbox as a source of scraping (since they have an approximation of subscribers / downloads), but I couldn't find any good way to search for generally popular podcasts. Or podcasts that contained a certain word.

The first version of this article stopped here and showed results from this code

code to find top data science podcasts - version 2

Well, that was fine, but I think its actually lacking a bit. There seem to be a few podcasts that I've stumbled across that are missing which I was hoping this would capture. So we're going to switch some stuff up. First, I'm going to use a mobile user agent to tell Google I'm searching from my phone.

Why? Well Google shows different results for desktop searches vs mobile searches, so if we're looking to find the best podcasts, we want to be where most of the searches are actually happening. And since you basically always listen to podcasts on your phone, it probably makes sense to search from your phone... The code for that is below, the main changes are in which elements are scraped from the page

# Mobile Search Version
headers = {"user-agent" : MOBILE_USER_AGENT}
            
results = []
for x in list_kwargs:
    url, search_term = string_to_podcast_query(**x)
    resp = requests.get(url, headers=headers)
    if resp.status_code == 200:
        soup = BeautifulSoup(resp.content, "html.parser")
    
    for g in soup.find_all('div', class_='mnr-c'): # updated target class
        anchors = g.find_all('a')
        if anchors:
            link = anchors[0]['href']
            title = anchors[0].find_all('div')[1].get_text().strip() # updated title crawler
            item = {
                "title": title,
                "link": link,
                "search_term": search_term
            }
            results.append(item)

    time.sleep(2.5)

What else did I switch up? I switched the Google queries up a bit and added a few more. I figure if I'm actually trying to find the best podcasts, it makes sense to search for them. That way, you get the ones that typically show up on these types of blog lists.

# Queries
list_kwargs = [
    {"string": 'best data podcast'},
    {"string": 'best data podcast', "pg": 2},
    {"string": 'best data podcast', "pg": 3},
    {"string": 'best data podcast', "pg": 4},
    {"string": 'best data science podcast'},
    {"string": 'best data science podcast', "pg": 2},
    {"string": 'best data science podcast', "pg": 3},
    {"string": 'best artificial intelligence podcast'},
    {"string": 'best machine learning podcast'},
    {"string": 'best data engineering podcast'},
    {"string": 'best data visualization podcast'},
]

And that's it - all of the changes I made for the second version. The results are updated up top, but it gets a more complete

code to find top data science podcasts - version 3

And I'm an idiot. 'Fixing' my queries to only find the 'best data science podcasts' ended up making me miss a few of the good ones I found earlier. So I'm going to do as any good data scientist does and just combine the results of both sets of queries...

# Queries
list_kwargs = [
    {"string": 'best data podcast'},
    {"string": 'best data podcast', "pg": 2},
    {"string": 'best data podcast', "pg": 3},
    {"string": 'best data podcast', "pg": 4},
    {"string": 'best data science podcast'},
    {"string": 'best data science podcast', "pg": 2},
    {"string": 'best data science podcast', "pg": 3},
    {"string": 'best artificial intelligence podcast'},
    {"string": 'best machine learning podcast'},
    {"string": 'best data engineering podcast'},
    {"string": 'best data visualization podcast'},
]