Using Reddit for research β
Ethical and technical challenges β
Reddit, once seen as a user-driven community space, has recently evolved into a more monetised platform. They have, for instance, entered into an "expanded partnership" with Google. This challenges users and researchers alike. One content-related problem in working with Reddit data can now be that ads are integrated in user feeds and that drawing lines between community engagement and profit generation may be difficult. A more technical problem is that the Reddit API has created new barriers for developers and academics as outlined in the API and Developer Terms. When using Reddit, it is important that you use a registered OAuth token for authentication. After scraping data, you have to be aware that Reddit requires you to delete content from your own storage if the content (e.g. a post, comment or entire user profile) gets deleted on their site, which means that keeping Reddit data you collected for later reuse or even data sharing is not an option. Itβs recommended to delete stored Reddit data every 48 hours to stay compliant. Moreover, the Reddit API limits the number of API requests that you can send per minute. In November 2024, the rate is at 100 requests per minute for each OAuth client ID.
Pushshift Reddit documentation β
One way to access slightly older Reddit data is the PullPush Reddit API Documentation.
π οΈ More detailed information to follow! π οΈ
R and Python packages β
For users comfortable with R or Python, the following packages allow data collection from Reddit in compliance with their API policies:
- PRAW (Python Reddit API Wrapper) gathers Reddit posts, comments, and metadata. However, it only scrapes up to 1000 posts or comments in total (according to users).
- RedditExtractoR performs Reddit data collection in R.
π οΈ Test script we used in 2023 and see if it is still functional!