Back to Blog

finding the best data science twitter accounts

September 14th, 2020


This section is for everyone who just wants the answer immediately...

namehandlemention_countfollowerstweetsfollowing
Kirk Bornekirkdborne18268,081124,51410,437
Hilary Masonhmason16125,08619,1991,811
Andrew Ngandrewyng15529,7541,293472
KDnuggetskdnuggets14179,06265,786433
peteskomorochpeteskomoroch948,84646,8653,625
dj patildpatil776,30219,8771,976
Ryan R. Rosariodatajunkie620,61415,697932
Bernard Marrbernardmarr6130,00633,62925,480
Fei-Fei Lidrfeifei5364,4521,333354
Olivier Griselogrisel528,72614,2082,805
Ben Lorica 罗瑞卡bigdata544,6039,521340
Ronald van Loonronald_vanloon5232,10789,286183,633
David Smithrevodavid531,55315,0532,280
Carla Gentrydata_nerd555,925330,61913,153
Hadley Wickhamhadleywickham5111,75739,726272
Monica Rogatimrogati551,6742,439595
Jeff Hammerbacherhackingdata535,476315,781
Cole Knaflicstorywithdata422,4805,802776
Jake Porwayjakeporway414,3256,893680
John Myles Whitejohnmyleswhite429,42822,86817
Randy Olsonrandal_olson4123,39220,07977,558
Drew Conwaydrewconway424,67727,601297
Ian Goodfellowgoodfellow_ian3199,3942,7091,079
Lisa Charlotte Rostlisacrost319,3944,430748
Mara Averickdataandme345,26043,3562,892
Evan Sinar, PhDevansinar351,5279,76123,455
Nathan Yauflowingdata382,732190352
Mona Chalabimonachalabi388,4746,8641,906
r/DataIsBeautifuldataisbeautiful348,96713,0021,369
Josh Willsjosh_wills314,16713,5681,332
Sebastian Raschkarasbt347,4921,021398
Cindi Howsonbiscorecard325,57812,1954,061
Ilya Sutskeverilyasut357,9475401,849
Data Science Centralanalyticbridge3222,337139,7354,232
Andy Kirkvisualisingdata349,62438,765505

twitter time!

So I've never been an avid Twitter user in the past - I've gravitated to Reddit historically and have been pretty loyal in that regard. But recently I've found myself pulled back towards Twitter and have become much more engaged... Why? Not a clue. Maybe their data scientists have unlocked the secret to addictive social media, who knows...

Nonetheless, since I'm now using Twitter much more - and now that I have a Twitter account that I actually post on (shameless plug to my profile) - I figured I should find the top data science Twitter accounts to follow.

Back to the picture at hand: similar to my issue with finding podcasts and newsletters, there was no one who decided to utilize web scraping to find good Data Science Twitter accounts. So I took it upon myself...

I'm sensing a pattern here...

gameplanning the process

You've now suffered through the section where I give you a personal view to relate to, so next we're onto the actual process. I'd probably have stopped reading by now if I were you. But you, you've struggled onwards. And for that, I commend you.

The overall idea of this process is an amended version of the code I wrote for finding good newsletters. I actually have to go back and fix that code later (and rewrite the article), but new content is more fun. Shiny Object Syndrome, if you will.

Here's the high level overview:

  • Scrape Google Results for queries like 'best data science twitter accounts' to collect web pages that Google links to
  • Get all external links on each page
  • Count the frequency of each link to find popular accounts
  • Scrape the Twitter accounts to get followers / following / tweets statistics

And thats it. Super simple. Or at least conceptually super simple.

Without further ado, I present to you

code to find top data science twitter accounts

Import your packages. Again, I'll refer you to pip and conda installation guides.

# import default python packages
import urllib
import requests
import time

# import non-standard python packages
# if you dont have these installed, install them with pip or conda
from bs4 import BeautifulSoup
from readability import Document # this package generates a 'readable' version of a web page that is meant for mobile phones. I'm using it to get rid of a lot of the noise on websites.
import pandas as pd

Define your user agent. We'll be scraping with both a mobile user agent (for google's mobile results) and a desktop user agent (for each google link / twitter account), so lets define both.

# define the desktop user-agent
USER_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1"
# mobile user-agent
MOBILE_USER_AGENT = "Mozilla/5.0 (iPhone; CPU iPhone OS 12_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Mobile/15E148 Safari/604.1"

Create a function where you input a query string / page and it returns the Google URL.

def string_to_twitter_query(string, pg=None):
    query = urllib.parse.quote_plus(f'{string}')
    if pg != None:
        query = query + "&start=" + str(10*(pg-1))
    return f"https://google.com/search?hl=en&lr=en&q={query}", string

The list of Google queries feeding into this 'algorithm':

# Queries
list_kwargs = [
    {"string": 'best data twitter accounts'},
    {"string": 'best data twitter accounts', "pg": 2},
    {"string": 'best data twitter accounts', "pg": 3},
    {"string": 'best data twitter accounts', "pg": 4},
    {"string": 'best data science twitter accounts'},
    {"string": 'best data science twitter accounts', "pg": 2},
    {"string": 'best data science twitter accounts', "pg": 3},
    {"string": 'best artificial intelligence twitter accounts'},
    {"string": 'best machine learning twitter accounts'},
    {"string": 'best data engineering twitter accounts'},
    {"string": 'best data visualization twitter accounts'},
    
    {"string": 'data twitter accounts'},
    {"string": 'data twitter accounts', "pg": 2},
    {"string": 'data twitter accounts', "pg": 3},
    {"string": 'data science twitter accounts'},
    {"string": 'data science twitter accounts', "pg":2},
    {"string": 'artificial intelligence twitter accounts'},
    {"string": 'machine learning twitter accounts'},
    {"string": 'data engineering twitter accounts'},
    {"string": 'data visualization twitter accounts'},
]

Now let's crawl Google as a mobile user. I ran into a problem with featured snippets, so I just skip them. This is because I am being lazy and want to do this exercise for a lot of other types of media (youtube, courses, books). This is like... terrible in practice, so... don't copy me here.

# Crawling Google as a mobile user
headers = {"user-agent" : MOBILE_USER_AGENT}
            
results = []
for x in list_kwargs:
    url, search_term = string_to_twitter_query(**x)
    resp = requests.get(url, headers=headers)
    if resp.status_code == 200:
        soup = BeautifulSoup(resp.content, "html.parser")
    
    for g in soup.find_all('div', class_='mnr-c'):
        anchors = g.find_all('a')
        if len(anchors) > 0:
            try:
                # this code will fail on featured snippets
                link = anchors[0]['href']
            except:
                next
                
            try:
                title = anchors[0].find_all('div')[1].get_text().strip()
            except:
                title = anchors[0].get_text().strip()
            
            item = {
                "title": title,
                "link": link,
                "search_term": search_term
            }
            results.append(item)

    # Wait 2.5s between each page crawl
    time.sleep(2.5)

Put the returned results into a dataframe and drop any duplicate links.

# put the results into a dataframe and drop duplicate links
twitter_df = pd.DataFrame(results)
twitter_df.drop_duplicates(subset='link',inplace=True)

Now its time to scrape the results that we got from google. We'll now switch to a desktop user agent (and stay with that for the rest of the code). Below we also define the article crawler.

# switch the user agent to desktop - articles shouldn't differ on desktop vs mobile and will likely have fewer issues on desktop
headers = {"user-agent" : USER_AGENT}

#define the crawler for each article
def article_link_crawl(link):
    """
    Returns links and either a 1 or 0 if it was a success / failure.
    Only Crawls articles, so there should be a few failures
    """
    try:
        domain = link.split('://')[1].split('/')[0] # defines the site domain
        article_links = []
        resp = requests.get(link, headers=headers) # get request for the link
        
        # pass the article through readibility to get the article content rather than the full webpage
        rd_doc = Document(resp.text)
        if resp.status_code == 200:
            soup = BeautifulSoup(rd_doc.content(), "html.parser")
        link_soup = soup.find_all('a') # find all links
        for link in link_soup:
            # if the link has a href, create an item to add to the aggregate article_links list
            if link.has_attr('href'):
                item = {
                    "text": link.get_text().strip(),
                    "link": link['href']
                    }
                # dont add any blank links, internal links, links starting with '/' or '#' (also internal links)
                if item['text'] != '' and item['link'].find(domain) == -1 and item['link'][0] != '/' and item['link'][0] != '#':
                    article_links.append(item)
        return article_links, 1
    except:
        return None, 0

With the 'crawler' defined, we can now loop through all of the links. I'm adding a few data quality metrics to check out as we crawl - during development, I'd check these after each test cycle to see success / failure on specific URLs.

agg_links = []
# define a few data quality checks for the loop
total_success = 0
total_fail = 0
fail_urls = []

# cycle through all the links and keep track of data quality
for link in twitter_df['link']:
    res_links, is_success = article_link_crawl(link)
    if is_success == 1:
        total_success = total_success+1
    else:
        total_fail = total_fail+1
        fail_urls.append(link)
    
    if res_links != None:
        for lnk in res_links:
            agg_links.append(lnk)
    time.sleep(2.5)

Now that we have the results, we can define the function to count link occurrences.

# function to count occurrences of specific links
def list_freq(tgt_list): 
    freq = {} 
    for item in tgt_list: 
        if (item in freq): 
            freq[item] += 1
        else: 
            freq[item] = 1
    
    result = []
    for key, value in freq.items(): 
        result.append({
            "link": key,
            "count": value
        })
    return result

And then run said function on all of our links:

# use list comprehension to replace http:// with https://
clean_link_list = [x['link'].replace('http://','https://').lower() for x in agg_links]
link_freq = list_freq(clean_link_list)
link_freq_df = pd.DataFrame(link_freq)

Now we've finally arrived at the list of the most popular twitter accounts. We'll clean up the results to sort by occurrence, only including accounts with more than 3 mentions.

# sort by most mentioned
link_freq_df = link_freq_df.sort_values('count',ascending=False)
# only include twitter links
link_freq_df = link_freq_df.loc[link_freq_df['link'].str.contains('https://twitter.com')]
# and people with has more than three mentions
link_freq_df = link_freq_df.loc[link_freq_df['count']>=3]

A few more things to clean up now - we'll take away links to hashtags and lists.

# remove hashtags
link_freq_df = link_freq_df.loc[~link_freq_df['link'].str.contains('hashtag',regex=False)]
# removes lists
link_freq_df = link_freq_df.loc[~link_freq_df['link'].str.contains('list',regex=False)]

So now we have our list of accounts that are mentioned the most frequently, but it leaves a bit to be desired...

link_freq_df.head()

# note, you won't see a table in the format below if you type that command. it'll be prettier.

| link                               | count | 
|------------------------------------|-------| 
| https://twitter.com/hmason         | 15    | 
| https://twitter.com/kdnuggets      | 14    | 
| https://twitter.com/AndrewYNg      | 13    | 
| https://twitter.com/KirkDBorne     | 11    | 
| https://twitter.com/peteskomoroch  | 9     | 

What's missing? Stats. We need each account's number of tweets, following, and followers!

So we'll now define a Twitter scraper.

def twitter_scrape(link):
    resp = requests.get(link, headers=headers)
    if resp.status_code == 200:
        soup = BeautifulSoup(resp.content, "html.parser")

    prof_stats_soup = soup.find('table', {"class":"profile-stats"}).find_all('div',{"class":"statnum"})
    tweets = prof_stats_soup[0].get_text().strip()
    following = prof_stats_soup[1].get_text().strip()
    followers = prof_stats_soup[2].get_text().strip()
    
    acct_name = soup.find('div', {"class":"fullname"}).get_text().strip()

    response = {
        "name": acct_name,
        "tweets": tweets,
        "following": following,
        "followers": followers,
        "link": link,
    }
    return response

And run all of those links through it...

twitter_summ = []
for link in link_freq_df['link']:
    try:
        twitter_resp = twitter_scrape(link)
        twitter_summ.append(twitter_resp)
    except:
        pass
    time.sleep(2)

twitter_summ_df = pd.DataFrame(twitter_summ)

A few last minute tweaks before I upload the table to the top of this article...

# merge with the previous dataframe that included mention counts
twitter_summ_df.merge(link_freq_df, on='link')

# rename count to mention_count
twitter_summ_df = twitter_summ_df.rename(columns={"count": "mention_count"})

twitter_summ_df['handle_text'] = twitter_summ_df['link'].str.split('/').str[-1]
twitter_summ_df['handle'] = '[' + twitter_summ_df['handle_text'] + '](' + twitter_summ_df['link'] + ')'
twitter_summ_df[['name','handle','mention_count','followers','tweets','following']].to_csv('twitter.csv',index=False)

And voila, now we have a list of all the twitter accounts that the rest of the internet recommends. Its much better than whatever I'd recommend, since we all can agree that I don't understand anything about Twitter...

If you enjoyed this article and found it useful, I'd love if you could follow me on Twitter at @greg_on_data. Happy twittererer-ing!