Prediction of Personality Traits on Tweets Text with Feature Analysis

Yucen Sun
7 min readApr 29, 2020

Introduction

Nowadays we express ourselves a lot on social media like Twitter and Facebook, and we also get to know others through their posts on social media. Now that we can generally get a sense of someone’s personality from their tweets, isn’t it interesting to mine the patterns of how the tweets of users relate to their personality traits? This is the motivation of the natural language processing project behind this post.

To understand the problem more deeply, we want to incorporate more linguistic features and meta-features than just analyzing the words in the text of the tweets. For example, how would the usage of personal pronouns help to identify whether the person is sensing or intuitive? How would the average length of tweets indicate the person is more a thinking type or feeling type? In this project, we approached the question by introducing a comprehensive feature analysis and incorporate the features on a logistic regression classification model to predict Myer-Briggs Type Indicator (MBTI) personal traits.

Data

In this project, we use the Myers-Briggs Type Indicator (MBTI) as our personality traits standard. It indicates psychological preferences in how people perceive the world and make decisions. MBTI has four categories — Introversion/Extraversion, Sensing/Intuition, Thinking/Feeling, and Judging/Perception. Each person has one preferred quality from each category, producing 16 unique types such as “ENTJ” and “ISTP”.

The table shows the personality type distribution in the U.S. general population.

Estimated Frequencies
of the Types in the United States Population (Data from the Myers-Briggs Company; and Stanford Research Institute (SRI))

I used the data set described and provided in Plank’s work (https://www.aclweb.org/anthology/W15-2913/). The data set contains in total of 1.2 million tweets from 1500 twitter users with MBTI personality traits type labels. Each of the users has a minimum of 100 tweets and a maximum of 2000 recent tweets. The data is described in the following tables.

In the project, we view the prediction as four independent binary classification problems. Therefore, we also compute the data distribution over each of the four dimensions.

Methods

  • Data preprocessing

I applied the following data pre-process steps. First, I converted all text to lowercase; I remove the stopwords using an English stopword list3. It’s also important to do Twitter-specific text pre-process, i.e. to replace all @username to the unified ”@USER”, hashtags to ”@HASHTAG”, URLs to ”@URL”, and retweets marks to ”rt”. Then, tokenize the twitter text by splitting wherever whites-paces are presently using both string split function and SpaceTokenizer from nltk.tokenize module. I also tried TweetTokenizer, which is tailored to tokenize tweets text and is able to split punctuation like parenthesis and also recognize simple textpicssuch as ”¡3” and ”:-p”.Additionally, we need to balance the data by re-sampling given that the data is unbalanced. I chose to use the built-in data balance in the LogisticRegression in sklearn module instead of resampling by hand. This works simple and well.

  • Baseline Model

There is always a baseline method that “solve” the classification problem in a naive way. Mine is to randomly assign one of the binary choices based on the probabilities in the training set.

We use the F-1 score, which considers both recall and precision, as our evaluation metric of the model performance. The baseline model receives an average f1-score = 0.4663.

We then list all the features we would analyze and programmed to construct the features.

We use the F-1 score, which considers both recall and precision, as our evaluation metric of the model performance. The baseline model receives an average f1-score = 0.4663.

  • Feature Construction

We then list all the features we would analyze and programmed to construct the features.

  1. word unigram, bag of word n-gram (n=1,2,3)
n-gram tokenization with nltk
Examples:uni-gram: ('hardest',): 1, ('habit',): 1, ('shake',): 1, ('crave',): 1, ('sweet',): 1bi-gram: ('of', 'the'): 1, ('the', 'slideshow'): 1, ('slideshow', '@USER'): 1, ('much', '!'): 1

2. word embedding, averaged across every word vectors for the user’s all tweets

import pre-trained word2vec model

3. metadata of the user, including gender, number of followers, listed count (number of lists the user appears in), statuses (total of tweets and retweets), number of favorites) provided in the data set(Plank and Hovy, 2015b)

4. Twitter Specific features (average number of hashtags, @users, retweet times and URLs per tweet)

5. Empath features, from lexical categories analysis; the scores reveal sentiment and topics

demo on empath module

We first experiment on three different basic logistic regression models: the one based on word embedding, on uni-gram, or on the bag of n-grams (n = 1,2,3). After evaluating the performance of the three candidate basic models and also the basic models plus some feature groups, we decided that bag of n-grams logistic regression model is the most stable basic model and the most suitable one to incorporate feature groups on.

Then, we add the grouped features, namely TWEET count features, EMPTH features, and METADATA features as well as word embedding respectively to the basic model and see the performance change. We also did the ablation tests, where we leaf out one group of features each time to see how much the left-out features contribute to the final performance.

Finally, after complete feature analysis and experiments on models, we selected the best-performing model and tuned it so that it achieved the best we can do.

Results and Discussions

The results are presented in the two tables below.

We can easily see from the first table that adding each of the three feature groups can improve the prediction performance.

  • Basic Model Uses Bag of N-grams

The basic logistic regression model which uses bag of n-grams (n=1,2,3) performs much better than the baseline model does. The average f1-score on the basic model is 0.6325 (the line in italic fond) compared to the baseline’s 0.4663.

  • Influence of the Feature Groups

The metadata features, including gender, count of tweets, followers etc. improves the least (0.025). The little improvement can be attributed to the relatively small size of the data set. With only 1500 users, the metadata may not help as much as it can for large-scale data. Though the improvement is small, the metadata features still benefit (or not change) each of the four categories.

The tweet features are some count features of the tweets. Here we have the number of hashtags, @ users, retweets, and URL’s per tweet by a user. This feature group improved the average f-1 score by 0.0185. Note that it also harmed the performance on E/I and J/P classification.

The empath features are the outcome scores of lexicon category analysis. It largely improves performance by 0.0379. One of the main reasons behind this should be that the empath features reveal the distribution of topics and also semantics in the tweet text. I believed if in future work we closely analyze the categories in empath and find out the strongest subsets of categories, it has the potential to further improves the classification ability.

  • Ablation Tests and the Best Model with Feature Sets:

The ablation test results are presented in Table 6. Since we only have three major feature groups, we actually looped through all the combinations of feature groups in the two tables. The first I noticed was that the performance of the model incorporating all feature groups doesn’t perform the best even after I tuned the logistic regression model. The best performance is 0.6704 (the line in bold fond) in averaged f1-score using only n-grams and empath features. In the ablation test, we observe that leaving out empath data decreased the f-1 score by 0.016. Leaving out metadata features and the absence of tweets count features would cause the performance even worse than the basic model. This finding combined with information in table 5 indicates there can be dependencies and correlations between features, and that to further construct a model that maximizes the benefits of linguistic features, closer analysis, and sharp fine selections are needed in the future work.

What’s Next

Since from the results of this study there can be dependencies and correlations between features, closer analysis, and sharp fine selections are needed in the future work in order to further construct a model that maximizes the benefits of linguistic features.

--

--