Twitter-Sentiment-Analysis-about-ChatGPT

A quantitative study on over 1.25 million tweets about ChatGPT, employed data scrapping, data cleaning, EDA, topic modeling, and sentiment analysis.


TABLE OF CONTENT


BACKGROUND

ChatGPT is an artificial intelligence chatbot developed by OpenAI and launched in November 2022. It is built on top of OpenAI’s GPT-3 family of large language models and has been fine-tuned (an approach to transfer learning) using both supervised and reinforcement learning techniques. Given the advantages of ChatGPT over traditional chatbots, ChatGPT has attracted more than 1 million users in 5 days and 100 million users in 2 months after it was launched, leaving behind other popular online platforms such as Netflix, Facebook, and Instagram in terms of adoption rates. Some early adopters of ChatGPT believe that it will eventually obsolete several professions related to content creation. it has been demonstrated that ChatGPT is capable of producing high-quality responses to a variety of challenges, including solving coding challenges and generating accurate responses to exam queries.


OBJECTIVE

Using a mixed-method approach, analyze tweets from December 2022 to January 2023 that mention ChatGPT and express diverse and unstructures opinions. Identify the main topics and sentiments of the conversations and examine perception of early ChatGPT users. We assert this identification will allow us to understand and assess ChatGPT’s capability, effectiveness, and facing challenges.

Research Questions


METHODOLOGY

TOOLS

Task Technique Description Tools/Packages Used
Data Collection Scraping tweets from Twitter snscrape
Data Preprocessing Duplication removal, lowercasing, noise removal (punctuation, stopwords, URLs, @users), lemmatization re, NLTK, pandas, numpy
Feature Engineering Retrieving geographical info from a user's profile location; retrieving datetime info from tweet timestamps geopy, datetime
Topic Modeling Identifying topics using the Latent Dirichlet Allocation (LDA) modelling pyLDAvis, gensim
Sentiment Analysis Quantitative sentiment analysis of each topic via rule-based and deep learning based model VADER, roBERTa, scipy, torch
Data Visualization Multi-attribute plots matplotlib, seaborn, wordcloud, PowerBI
Environments & Platforms Google Colab, Databricks, Pyspark, Jupyter Notebook, Twitter


DATA-COLLECTION

Method Notes
Tweepy 3200 tweets; no historical data
GetOldTweets3 Twitter has removed the endpoint the GetOldTweets3 uses
Twint Twitter throws a more strict device + IP-ban after a certain amount of queries
snscrape Scrapped 1.25M tweets - 832,924 English tweets

Data Collection: Identifying ChatGPT Content

  • Package used: snscrape
  • Language: English
  • Keywords: ChatGPT
  • Timeframe: December 1, 2022 to January 31, 2023
  • Features: User ID, User Name, User Verification, User Location, User Followers, Tweet Text, Posted Timestamp, and Posed Language
  • Number of tweets collected = 1,255,518
  • December - 474,572 tweets | January - 780,946 tweets

  • DATA-PREPROCESSING

    Data Cleaning

    Feature Engineering

    English Tweet Text Preprocessing

    DATA-MODELING

    Unsupervised LDA

    The unsupervised Latent Dirichlet Allocation (LDA) modelling technique was applied to extract a set of key ChatGPT topics from the collected tweets.

    Sentiment Analysis

    Sentiment analysis is an approach to identifying the emotional tone behind textual data. Various algorithms (models) are available for sentiment analysis tasks, and each has its pros and cons, such as:

    In this study, we used both VADER (rule-based model) from the NLTK library and Twitter-roBERTa (deep learning based)from the TRANSFORMERS package to examine the early users’ attitude towards ChatGPT.


    RESULTS

    EDA

    Tweets about ChatGPT over time

    Users’ Features


    TOPIC-MODELING

    Optimal Number of Topics and Iterations

    After evaluating coherence score, comprehensibility of top keywords, and computational cost, the study determined the optimal number of topics (10) and iterations (60) for LDA analysis.

    Topics

    Through a meticulous analysis of the top 20 keywords and hundreds of highly correlated tweets for each topic within the LDA topic modeling results, the study determined a descriptive and meaningful name for each topic. The top 3 most discussed topics are:


    SENTIMENT-ANALYSIS

    Overall, VADER is a faster option, but may not capture the nuances of natural language as well as roBERTa. The choice between VADER and roBERTa will depend on the specific task requirements and available computational resources.


    CONCLUSION

    This study focused on showcasing the discussions about ChatGPT on Twitter, utilizing a dataset comprising 1.25 million tweets by leveraging machine learning and text analytics. Exploratory data analysis was conducted first on the collected dataset to understand the characteristics of early ChatGPT users. Further, topic modeling was performed to identify the main topics, followed by quantitative sentiment analysis on each topic. The study provides valuable insights into the sentiments of early ChatGPT users and emphasizes the importance of continued research and conversation to develop best practises for the responsible use of large language models.


    REFERENCES