Rumours Detection using Neural Network based Model
Rumours Detection is a task of Codalab Challenge-SemEval 2019. The main aim here is to determine the veracity of rumours, to determine how other users in social media, acts to rumour by replying to the post that shows the rumour statement. The problem was divided into two tasks — Subtask A and Subtask B. The goal is to satisfy each of the tweets in the conversation thread as either supporting, querying, denying or commenting (SQDC) for Subtask A and as true, false or unverified for Subtask B on the rumour initiated by the source tweet.
About the Dataset
The dataset is provided by the orgranizers. The dataset contains posts of Twitter and Reddit which provide diversity in the types of users, more focused discussions and longer posts in tree-structured conversations. Each conversation is defined by a tweet that initiates the conversation and a set of nested replies to it that form a conversational thread. The dataset also contains number of likes and retweets for each post.
Long Short-Term Memory (LSTM)
Long short-term memory (LSTM) is an artificial Recurrent Neural Network (RNN) architecture used in the fields of deep learning. It can not only process single data points but also entire sequences of data. LSTM networks are well-suited to classifying, processing and making predictions based on time series data since there can be lags of unknown duration between important events in a time series.
LSTM Architecture
A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. Some variations of the LSTM unit do not have one or more of these gates or maybe have other gates. For example, gated recurrent units (GRUs) do not have an output gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell. The input gate controls the extent to which a new value flows into the cell, the forget gate controls the extent to which a value remains in the cell and the output gate controls the extent to which the value in the cell is used to compute the output activation of the LSTM unit. The activation function of the LSTM gates is often the logistic sigmoid function.
Approach
The main features for this task are number of likes, retweets and replies to the source post. For subtask B, if majority of replies (including their likes and retweets count) are in favour of the source post, then the post is classified as true rumour, or if majority of replies are against the source post then that post is classified as a false rumour post. If posts that does not contain replies are therefore classified as unverified posts.
Importing Libraries
import numpy as np
import pandas as pd
from collections import defaultdictimport seaborn as snsimport keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from keras.layers import Embedding
from keras.layers import Dense, Input, Flatten
from keras.layers import Conv1D, MaxPooling1D, Dropout
from keras.models import Model
from keras.models import Sequential
from keras.layers.normalization import BatchNormalization
from matplotlib import pyplot as plt
from keras.layers import LSTM, GRU
from sklearn.model_selection import train_test_split
Data Preprocessing
In preprocessing part, we remove all hashtags, user names and web links using Python library preprocessor. We then use tokenizer for splitting sentences to word tokens, padding sentences to ensure that all sentences in a list have the same length and converting labels to one-hot vectors.
import preprocessor as p
def clean_str(string):
string = re.sub(r"\\", "", string)
string = re.sub(r"\'", "", string)
string = re.sub(r"\"", "", string)
return string.strip().lower()
data_train = pd.read_csv('../twitter_training_dataset_a.csv')
list_labels = list(set(data_train.labels))
texts = []
labels = []
for i in range(data_train.text.shape[0]):
text = p.clean(str(data_train.text[i]))
texts.append(text)
labels.append(data_train.labels[i])
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(labels),num_classes = len(list_labels))
Splitting data into training and testing set
Shuffling data and splitting it into training and testing set with test size of 20%.
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]x_train, x_test, y_train, y_test = train_test_split( data, labels, test_size=0.20, random_state=42)
Using pre-trained word embeddings
For word embeddings, we used a pre-trained word-embedded vector - Glove. Glove is a word-embedded vector which contains around 400000 words mapped with 100-dimensional vector each.
embeddings_index = {}
with open('../../glove.6B.100d.txt') as f:
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefsembedding_matrix = np.random.random((len(word_index) + 1, 100))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
embedding_layer = Embedding(len(word_index) + 1, 100, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH)
Building the model
The model contains pre-trained embedding layer followed by a Conv1D layer with kernel size of 5, a LSTM layer with 100 nodes with dropout of 20%, 3 dense layers with 128, 64, 32 nodes respectively and the output layer (4 nodes for Task A, or 3 nodes for Task B). The model was trained on training set for 50 epochs with batch size of 128.
model = Sequential()
model.add(embedding_layer)
model.add(Conv1D(filters=32, kernel_size=5, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(BatchNormalization())
model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(len(list_labels), activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=50, batch_size=128)
Testing the model
The model got an accuracy of 47.57% and 42.86% with loss of 1.204 and 0.453 for Subtask A and Subtask B respectively.
Conclusion
The main factor for low performance of model is very small dataset size. Due to small dataset size, the model was not trained properly. Other factor can be posts with language other than English as the words of other languages does not have word vector in Glove word embedding. So the embedding vector for these words will be a zero vector.