I have been preparing a couple of talks I have to give in the next couple of weeks and I needed some pictures of the people working in Signal to have some nice images about the team and the company in general. Although we have some of them store online, I realised that our Twitter account had some of the best pictures, especially for the early days of the company. Almost at the same time, I was reading a blogpost about mining twitter data with python, written by my good friend and ex-colleague (in Queen Mary), Dr. Marco Bonzanini. These two events together seemed like a good excuse to build a little tool in python to download the pictures that a twitter account has published and this is the main focus of this post. I hope you find it useful, I definitely have…
Marco’s post explains very well how to register a Twitter app, a necessary step to be able to use the Twitter API, and how to set up tweepy to return json format. For the sake of completion, the code used for this purpose is illustrated below, but I encourage you to visit the original post for a detailed explanation.
import tweepy from tweepy import OAuthHandler import json consumer_key = 'YOUR-CONSUMER-KEY' consumer_secret = 'YOUR-CONSUMER-SECRET' access_token = 'YOUR-ACCESS-TOKEN' access_secret = 'YOUR-ACCESS-SECRET' @classmethod def parse(cls, api, raw): status = cls.first_parse(api, raw) setattr(status, 'json', json.dumps(raw)) return status # Status() is the data model for a tweet tweepy.models.Status.first_parse = tweepy.models.Status.parse tweepy.models.Status.parse = parse # User() is the data model for a user profil tweepy.models.User.first_parse = tweepy.models.User.parse tweepy.models.User.parse = parse # You need to do it for all the models you need auth = OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_secret) api = tweepy.API(auth)
After this point, we can access the Twitter API in a pythonic way using the variable api which simplifies the coding process greatly, while producing more readable and elegant code. Just to reiterate our goal, we want to get all the pictures that have been published by a specific twitter user. This will involve the following steps:
- Get all the tweets from a user
- Clean those with images and get their full path
- Download the images
1. Getting the tweets from a user
Listing all the tweets from a given user can be done using the method user_timeline, which allows us to specify the screen_name (i.e., twitter anchor) and the number of tweets we want to get (to a maximum of 200). It also allows more fine grained filtering such as including retweets or replies. In our case, we want 200 tweets which are directly created by the user (i.e., No retweets nor replies):
tweets = api.user_timeline(screen_name='miguelmalvarez', count=200, include_rts=False, exclude_replies=True)
This very simple code provides us with the last tweets from an account (mine in this case). However, it doesn’t allow to get more than 200 of them. This type of problem is usually solved with pagination; however, the real-time characteristic of Twitter makes this approach unusable. For this reason, Twitter API uses cursoring, where we can specify the id of the most recent tweet we want to receive. As a result, we will receive 200 tweets that are older than the one we specified. This is explained in detail in this documentation about Working with Timelines, and it is represented by the following code:
tweets = api.user_timeline(screen_name='miguelmalvarez', count=200, include_rts=False, exclude_replies=True) last_id = tweets[-1].id while (True): more_tweets = api.user_timeline(screen_name=username, count=200, include_rts=False, exclude_replies=True, max_id=last_id-1) # There are no more tweets if (len(more_tweets) == 0): break else: last_id = more_tweets[-1].id-1 tweets = tweets + more_tweets
This code stores all the tweets by a specific user in the variable tweets. Now, we are ready to filter those with images.
2. Obtaining the full path for the images
We have all the tweets (actually the maximum the API supports is 3,200) by a given user and we want to filter those tweets which contain a media file. In order to do this we need to understand the return of the user_timeline call, and the way the API deals with entities. We should explore the field media to see any multimedia content within a tweet. After this, we can access the url of each one of the specific media attachments with media_url. This is probably easier to understand in code:
media_files = set() for status in tweets: media = status.entities.get('media', []) if(len(media) > 0): media_files.add(media[0]['media_url'])
This implementation assumes that either each tweet has only one media attachment or we only care about the first one. Also, we do not check its type. Therefore, we can get the url of any multimedia content such as images or videos. All these assumptions are agreeable for my purposes and this blogpost. At this stage, we have the urls of all the multimedia content stored in the variable media_files.
3. Download the images
Downloading files can be easily achieved in python using the wget library:
import wget ... for media_file in media_files: wget.download(media_file)
This will download all the images (or any other multimedia content) into the current folder. More advance solutions could create a new folder and move the files there, as well as filter them by their specific type (image, video, audio,…).
Summary
I think this blogpost shows a very simple, yet quite powerful, functionality to download pictures from a Twitter account. In addition to allow me to get some pictures for my future talks, this shows how to use some of the functionality of the Twitter API using the tweetpy library.
I have suggested multiple improvements through the post that I will probably implement at some point in the future. Nonetheless, I invite anyone who wants to extend this little tool to create a pull request in GitHub, where all the code is presented.
Reblogged this on Dinesh Ram Kali..
LikeLike
Does it work if the twitter users private?
LikeLike
Hi alfred, thanks for commenting. What do you mean exactly when you say “private” users?
LikeLike
They mean it’s a profile that’s locked to only be accessible to the followers approved by the account owner. They have a little lock near their name. So, does it?
LikeLike
I think the API doesn’t allow you to get access to the private accounts, but I will test this as soon as I can. Thanks!
LikeLike
def parse_arguments():
parser = argparse.ArgumentParser(description=’Download pictures from a Twitter feed.’)
parser.add_argument(‘username’, type=str, help=’The twitter screen name from the account we want to retrieve all the pictures’)
parser.add_argument(‘–num’, type=int, default=100, help=’Maximum number of tweets to be returned.’)
parser.add_argument(‘–retweets’, default=False, action=’store_true’, help=’Include retweets’)
parser.add_argument(‘–replies’, default=False, action=’store_true’, help=’Include replies’)
parser.add_argument(‘–output’, default=’../pictures/’, type=str, help=’folder where the pictures will be stored’)
this code is giving error
usage: run.py [-h] [–num NUM] [–retweets] [–replies] [–output OUTPUT]
username
run.py: error: too few arguments
how to solve this problem?
LikeLike
Can you please share a bit more about how you are calling your script in order to run?
LikeLike
Hi Miguel,
I ran this as:
python3 run.py @Dan1ell
or
python3 run.py Dan1ell
But it gives:
Traceback (most recent call last):
File “run.py”, line 77, in
main()
File “run.py”, line 74, in main
download_images(api, username, retweets, replies, num_tweets, output_folder)
File “run.py”, line 46, in download_images
tweets = api.user_timeline(screen_name=username, count=200, include_rts=retweets, exclude_replies=replies)
File “/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tweepy/binder.py”, line 245, in _call
return method.execute()
File “/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tweepy/binder.py”, line 229, in execute
raise TweepError(error_msg, resp, api_code=api_error_code)
tweepy.error.TweepError: [{‘code’: 215, ‘message’: ‘Bad Authentication data.’}]
What’s wrong with it?
LikeLike
Hi Danielle,
The “Bad Authentication data” error message suggests a problem with the auth. Have you registered your app with Twitter? This might be the reason for it not working. If this is the case, you can follow the steps in this blogpost:
https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/
Regards,
LikeLike
I ran this, i still get this error even though i’ve registered my app
python run.py purefreeman,40,False,False,testFolder
Traceback (most recent call last):
File “run.py”, line 77, in
main()
File “run.py”, line 74, in main
download_images(api, username, retweets, replies, num_tweets, output_folder)
File “run.py”, line 46, in download_images
tweets = api.user_timeline(screen_name=username, count=200, include_rts=retweets, exclude_replies=replies)
File “C:\Users\freeman\AppData\Local\Programs\Python\Python36-32\lib\site-packages\tweepy\binder.py”, line 245, in _call
return method.execute()
File “C:\Users\freeman\AppData\Local\Programs\Python\Python36-32\lib\site-packages\tweepy\binder.py”, line 229, in execute
raise TweepError(error_msg, resp, api_code=api_error_code)
tweepy.error.TweepError: [{‘code’: 215, ‘message’: ‘Bad Authentication data.’}]
LikeLike
First of all, sorry for the delay. I didn’t realise that the comment has been in the “approval pending” state for a long time. This could be because your key and secret are surrounded by quotes.
PS: There are some changes in the code in cases you want to pull the last version.
LikeLike
Hello, I’d like to know how to not get duplicated imgs when download, btw thanks for your help, you rock!
LikeLike
Hi,
First of all, sorry for the delay and thanks for the feedback. I didn’t realise that the comment has been in the “approval pending” state for a long time. There are 2 different ways I can interpret your question:
1. How to not get duplicates from an account that has published the same picture multiple times: I haven’t thought about this case at all, but the system could check if the original media url has been seen before or not before downloading the picture.
2. How to not get duplicates from an account after you have run the script before: This case I have seen, when I ran the code for the second time with my own account, many of the images were duplicates. This could be fixed by not downloading any pictures that are already in the same folder. I might actually add this as a new functionality as I believe it is quite useful.
Regards,
LikeLike
I decided to solve the problem straightaway. The script now doesn’t download any pictures that are already in the folder. This applies to the second case I described. I hope this is what you wanted!
If you share the same picture in different tweets I am not 100% if twitter will treat them as different media files (with different names) and then the solution will not work.
LikeLike
Hi, I am not able to get rid of RecursionError: maximum recursion depth exceeded, it’s occurring at step1 i.e Getting tweets from a user
LikeLike
I will need way more information to be able to help here. What is the code you are trying to run exactly and how are you running it?
LikeLike
Adding to my previous point, I tried copying code from your git and got following error, please guide me through
(–username USERNAME | –hashtag HASHTAG)
[–num NUM] [–retweets] [–replies]
[–output OUTPUT]
ipykernel_launcher.py: error: one of the arguments –username –hashtag is required
An exception has occurred, use %tb to see the full traceback.
SystemExit: 2
LikeLike
You are running the code without specifying either a hashtag to download or a specific user to download the pictures from. One of these must be specified when you run the code in the console. As I said before, it would be much easier if you could share exactly how are you running the code, and what code you are using.
Thanks for commenting.
LikeLike
Hi Miguel
I’m receiving the following error:
Traceback (most recent call last):
File “run.py”, line 105, in
main()
File “run.py”, line 102, in main
download_images_by_user(api, username, retweets, replies, num_tweets, output_folder)
File “run.py”, line 79, in download_images_by_user
download_images(status, num_tweets, output_folder)
File “run.py”, line 74, in download_images
wget.download(media_url +”:orig”, out=output_folder+’/’+file_name)
File “C:\XXX\XXX\XXX\XXX\XXX\Python\Python38\lib\site-packages\wget.py”, line 526, in download
(tmpfile, headers) = ulib.urlretrieve(binurl, tmpfile, callback)
File “C:\XXX\XXX\XXX\XXX\XXX\Python\Python38\lib\urllib\request.py”, line 247, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File “C:\XXX\XXX\XXX\XXX\XXX\Python\Python38\lib\urllib\request.py”, line 222, in urlopen
return opener.open(url, data, timeout)
File “C:\XXX\XXX\XXX\XXX\XXX\Python\Python38\lib\urllib\request.py”, line 531, in open
response = meth(req, response)
File “C:\XXX\XXX\XXX\XXX\XXX\Python\Python38\lib\urllib\request.py”, line 640, in http_response
response = self.parent.error(
File “C:\XXX\XXX\XXX\XXX\XXX\Python\Python38\lib\urllib\request.py”, line 569, in error
return self._call_chain(*args)
File “C:\XXX\XXX\XXX\XXX\XXX\Python\Python38\lib\urllib\request.py”, line 502, in _call_chain
result = func(*args)
File “C:\XXX\XXX\XXX\XXX\XXX\Python\Python38\lib\urllib\request.py”, line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
I’m using the following:
wget 3.2
python 3.8.0
tweepy 3.8.0
Thanks for your support.
LikeLike
Hi,
Can you please let me know how you are running the code for this error to appear?
Regards,
LikeLike
Hi Miguel,
I’m getting the same error as the user above.
PS M:\OneDrive\Documents\Python Scripts\downloadTwitterPictures-master> python run.py –username NASA –num 10
Traceback (most recent call last):
File “run.py”, line 105, in
main()
File “run.py”, line 102, in main
download_images_by_user(api, username, retweets, replies, num_tweets, output_folder)
File “run.py”, line 79, in download_images_by_user
download_images(status, num_tweets, output_folder)
File “run.py”, line 74, in download_images
wget.download(media_url +”:orig”, out=output_folder+’/’+file_name)
File “C:\Users\Aicain\AppData\Local\Programs\Python\Python38-32\lib\site-packages\wget.py”, line 526, in download
(tmpfile, headers) = ulib.urlretrieve(binurl, tmpfile, callback)
File “C:\Users\Aicain\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py”, line 247, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File “C:\Users\Aicain\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py”, line 222, in urlopen
return opener.open(url, data, timeout)
File “C:\Users\Aicain\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py”, line 531, in open
response = meth(req, response)
File “C:\Users\Aicain\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py”, line 640, in http_response
response = self.parent.error(
File “C:\Users\Aicain\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py”, line 569, in error
return self._call_chain(*args)
File “C:\Users\Aicain\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py”, line 502, in _call_chain
result = func(*args)
File “C:\Users\Aicain\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py”, line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
As an artist I’m excited about the potential of this script, but I’m not really that familiar with coding, so when you asked user above, “how are you running the code,” what exactly did you mean?
Thanks again!
LikeLike
I have made a change recently in the original code (in Github) that I believe fixes this problem. Please check the latest version of the code in GitHub and let me know if you still have problems.
LikeLike