Scrape Reddit Posts Using PMAW & Python
Python
In this snippet the PMAW library is used to scrape 60,000 posts from the technology subreddit between 1st October 2019 and 1st October 2021. The results are used to create a dataframe which is then output to a csv file.
1| import pandas as pd 2| from pmaw import PushshiftAPI 3| import datetime as dt 4| import os.path as path 5| 6| api = PushshiftAPI() 7| 8| before = int(dt.datetime(2021,10,1,0,0).timestamp()) 9| after = int(dt.datetime(2019,10,1,0,0).timestamp()) 10| 11| subreddit="technology" 12| limit=60000 13| posts = api.search_submissions(subreddit=subreddit, limit=limit, before=before, after=after) 14| posts_df = pd.DataFrame(posts) 15| filepath = path.abspath(path.join(__file__ ,'../..','data/raw/technology_posts.csv')) 16| posts_df.to_csv(filepath,index=False)
149
132
127
119