Rumours Detection using Neural Network based Model

Rajveer Beerda
5 min readMar 15, 2020

Rumours Detection is a task of Codalab Challenge-SemEval 2019. The main aim here is to determine the veracity of rumours, to determine how other users in social media, acts to rumour by replying to the post that shows the rumour statement. The problem was divided into two tasks — Subtask A and Subtask B. The goal is to satisfy each of the tweets in the conversation thread as either supporting, querying, denying or commenting (SQDC) for Subtask A and as true, false or unverified for Subtask B on the rumour initiated by the source tweet.

About the Dataset

The dataset is provided by the orgranizers. The dataset contains posts of Twitter and Reddit which provide diversity in the types of users, more focused discussions and longer posts in tree-structured conversations. Each conversation is defined by a tweet that initiates the conversation and a set of nested replies to it that form a conversational thread. The dataset also contains number of likes and retweets for each post.

Number of Labelled instances for Subtask A
Number of Labelled instances for Subtask B

Long Short-Term Memory (LSTM)

Long short-term memory (LSTM) is an artificial Recurrent Neural Network (RNN) architecture used in the fields of deep learning. It can not only process single data points but also entire sequences of data. LSTM networks are well-suited to classifying, processing and making predictions based on time series data since there can be lags of unknown duration between important events in a time series.

LSTM Architecture

LSTM Unit Cell
LSTM Cell Internals

A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. Some variations of the LSTM unit do not have one or more of these gates or maybe have other gates. For example, gated recurrent units (GRUs) do not have an output gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell. The input gate controls the extent to which a new value flows into the cell, the forget gate controls the extent to which a value remains in the cell and the output gate controls the extent to which the value in the cell is used to compute the output activation of the LSTM unit. The activation function of the LSTM gates is often the logistic sigmoid function.

Approach

The main features for this task are number of likes, retweets and replies to the source post. For subtask B, if majority of replies (including their likes and retweets count) are in favour of the source post, then the post is classified as true rumour, or if majority of replies are against the source post then that post is classified as a false rumour post. If posts that does not contain replies are therefore classified as unverified posts.

Importing Libraries

import numpy as np
import pandas as pd
from collections import defaultdict
import seaborn as snsimport keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from keras.layers import Embedding
from keras.layers import Dense, Input, Flatten
from keras.layers import Conv1D, MaxPooling1D, Dropout
from keras.models import Model
from keras.models import Sequential
from keras.layers.normalization import BatchNormalization
from matplotlib import pyplot as plt
from keras.layers import LSTM, GRU

from sklearn.model_selection import train_test_split

Data Preprocessing

In preprocessing part, we remove all hashtags, user names and web links using Python library preprocessor. We then use tokenizer for splitting sentences to word tokens, padding sentences to ensure that all sentences in a list have the same length and converting labels to one-hot vectors.

import preprocessor as p
def clean_str(string):
string = re.sub(r"\\", "", string)
string = re.sub(r"\'", "", string)
string = re.sub(r"\"", "", string)
return string.strip().lower()

data_train = pd.read_csv('../twitter_training_dataset_a.csv')

list_labels = list(set(data_train.labels))
texts = []
labels = []

for i in range(data_train.text.shape[0]):
text = p.clean(str(data_train.text[i]))
texts.append(text)
labels.append(data_train.labels[i])

tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(labels),num_classes = len(list_labels))

Splitting data into training and testing set

Shuffling data and splitting it into training and testing set with test size of 20%.

indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
x_train, x_test, y_train, y_test = train_test_split( data, labels, test_size=0.20, random_state=42)

Using pre-trained word embeddings

For word embeddings, we used a pre-trained word-embedded vector - Glove. Glove is a word-embedded vector which contains around 400000 words mapped with 100-dimensional vector each.

embeddings_index = {}
with open('../../glove.6B.100d.txt') as f:
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
embedding_matrix = np.random.random((len(word_index) + 1, 100))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector

embedding_layer = Embedding(len(word_index) + 1, 100, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH)

Building the model

The model contains pre-trained embedding layer followed by a Conv1D layer with kernel size of 5, a LSTM layer with 100 nodes with dropout of 20%, 3 dense layers with 128, 64, 32 nodes respectively and the output layer (4 nodes for Task A, or 3 nodes for Task B). The model was trained on training set for 50 epochs with batch size of 128.

model = Sequential()
model.add(embedding_layer)
model.add(Conv1D(filters=32, kernel_size=5, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(BatchNormalization())
model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(len(list_labels), activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=50, batch_size=128)
Training Accuracy and Loss

Testing the model

The model got an accuracy of 47.57% and 42.86% with loss of 1.204 and 0.453 for Subtask A and Subtask B respectively.

Conclusion

The main factor for low performance of model is very small dataset size. Due to small dataset size, the model was not trained properly. Other factor can be posts with language other than English as the words of other languages does not have word vector in Glove word embedding. So the embedding vector for these words will be a zero vector.

--

--

Rajveer Beerda

Pre-final year Computer Science Engineering Undergraduate | Artificial Intelligence and Deep Learning enthusiast | Passionate about Data Science