An AI solution to text data processing & analysis

At Quordata, we have developed an automated system which collects millions of Tweets and combines them with other news sources to derive valuable sentiment insights that can help you make smarter trade and investment decisions.

Social media has become a key source for gauging public opinion on varying topics. However, with millions of posts made across the internet every day, it is impossible for humans to effectively process this volume of information, let alone derive any kind of meaningful insight.

Quordata introduces a novel method for textual data analysis that identifies unique trends and generates insights to support informed decision making.

Data Collection Methodology

For our closed-beta, we chose 10 different high-interest S&P 500 companies to demonstrate Quordata’s analytic capabilities. We then used the Twitter API to collect over 50,000 tweets per month related to these topics and saved them on our servers.

Data Pre-Processing

At Quordata, we pre-process all data before running it through our models. Pre-processing includes: Removing punctuation and URLs; lemmatizing text; removing stopwords; generating word embeddings; and vectorizing data using predefined tokenizers. This ensures that our models can perform at the highest level possible without losing relevant textual information.

Spam Model

Filtering spam is critically important for obtaining accurate data from social media. We developed a custom Twitter spam filtering model using Tensorflow and HuggingFace transformers. Our model combines text & metadata (followers, retweets, likes) to evaluate the likelihood that any given tweet is spam. We define spam as a tweet that does not contain a human user’s genuine input. Tweets made by bots, automated messages, tweets which are exclusively promotional, and effortless tweets (such as news story reposts) are all considered spam. Using this criteria, we manually labeled 15,000 data points and proceeded to train our model, achieving an overall accuracy of 90%. This means that in a dataset of 1000 tweets our model would correctly label approximately 900, but mislabel 100. This performance is comparable to other industry leading spam detection models. Each new tweet that is run through the model receives a label (0 for clean and 1 for spam) and a confidence score (0-100) based on how probable that label is.

Sentiment Model

In addition to the spam filtering model, we developed a sentiment analysis model using a pre-trained Siebert transformer. We manually labeled 30,000 tweets, which the sentiment model uses to predict the sentiment of new data points. It uses the softmax probability to choose the classification. Every tweet is given a label (0 for positive, 1 for neutral, and 2 for negative) and a confidence score similar to the Spam Model. For example, a tweet labeled as 0 with a confidence score of 95 is very likely positive. On the other hand a tweet labeled as 2 with a confidence score of 51 is only potentially negative.

Biterm Topic Model

Having generated thousands of spam filtered, sentiment labeled tweets our next goal was to discover what subtopics were relevant within each of the queries. We define a subtopic as some category of closely related tweets within an existing query. For example, if we take tweets that are all about Apple there will be subtopics of tweets about the iPhone, Apple music, Apple’s competitor Samsung, and so on. The goal of the Biterm Topic Model (BTM) is to automatically identify these subtopics. Running tweets through the BTM, yields a list of the top 8 subtopics and a corresponding subtopic label for each tweet. This data is presented as the closely related topics web on the dashboard.