Handling Fake News Using Python and ML

Introduction

Are all news real? Should we trust all the news presented to us?
Sadly, fake news exist and they tend to become viral reducing the impact of real news.

Fake news is one of the most significant new disturbing trend that must be resolved, otherwise the internet cannot truly serve and benefit humanity
~Tim Berners-Lee

What is Fake News?

Fake news refers to anything from intentional fabrication reporting on a controversial topic or misleading information presented as news. It is generally spread through either social media or other online media, and aims at damaging the reputation of a person or an entity by imposing certain ideas with harmful intent.

Types of fake news

Satire or parody
False connection
Misleading content
False content
Impostor content
Manipulated content
Fabricated content

Such news may contain false claims and may end up being viralized by algorithms! Follow through, below, to see how all this goes down:

TfidfVectorizer

The TfidfVectorizer converts a collection of raw document into a matrix of TF-IDF feature.

Term Frequency(TF):

TF is the number of times a word appears in a document. A higher value suggest a term appears more often than others. Presumably, the documents is a good match when the term is part of the search.

Inverse Document Frequency(IDF):

IDF is the measure of how significant a term is in the the entire corpus.

Passive Aggressive Classifier

PassiveAggresssive Algorithms are online learning algorithms. The algorithms remain passive for a correct classification outcome, and turn aggressive incase of miscalculation, updating and adjusting. It does not converge. Its purpose is to make update correcting losses, causing little change in the norm of the weight vector.

Detecting Fake News with Python: The Project

This Python project module of detecting fake and real news, makes use of sklearn. We will build a TfidfVectorizer on our dataset, initialize a PassiveAggressive Classifier and fit the model to accurately classify a piece of news as either real or fake.

Project Prerequisites:

You will need the following:
1.Fake News Dataset
The data set has a filename with extension .csv.
The dataset that we will use we will call it news.csv. The dataset has a shape and columns identifying the news, text, title, and labels denoting whether its fake or real.
2.The Libraries
In order to perform this classification, you will need the basic data science pack. You will need to install the libraries; sklearn, numpy, pandas plus other libraries like transformers and Pycarets.
For us we'll install the following libraries with pip:

pip install numpy pandas sklearn.

3.Jupyter Lab
You will need to install Jupyter Lab to run your codes.

STEPS: DETECTING FAKE NEWS WITH PYTHON

1.Import the Libraries:

import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

2.Load the data:

Let's read the data into a DataFrame, get the shape and columns. Next get the labels from the DataFrame and finally split the dataset into training and testing sets.

#Read the data
df=pd.read_csv

#Get the shape and head
df.shape 
df.head()

#Get the labels
labels=df.labels
labels.head()

#Split the dataset
x_train,x_test,y_train,y_test=train_test_split(df['text'], labels, test_size=0.2, random_state=7)

3.Initializing TfidfVectorizer: .

We'll initialize TfidfVecorizer with stop words and a maximum document frequency of 0.7.
Stop words are simply the useless words in a language or the most common words that are to be filtered out before processing the natural language data.
TfidfVectorizer turns a collection of raw data docment into a matrix of TF-IDF feature. We have to fit and transform the vectorizer on the train set and transform the it on the test set.

#Initialize a TfidfVectorizer
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)

#Fit and transform train set, transform test set
tfidf_train=tfidf_vectorizer.fit_transform(x_train) 
tfidf_test=tfidf_vectorizer.transform(x_test)

Finally initialize a PasiveAggressiveClassifier:. We will initialize the PasiveAggressiveClassifier then fit it on the tfidf_train and y_train.

#Initialize a PassiveAggressiveClassifier
pac=PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train,y_train)

We will predict the test set from TfidfVectorizer and calculate the accuracy of the project module.

#Predict on the test set and calculate accuracy
y_pred=pac.predict(tfidf_test)
score=accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%

Finally we can print out the confusion matrix to gain insight into the number of Real and False. after which we can have out results.

confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])

Conclusion.

To this point, we've effectively learnt how to easily detect fake news with Python.
We have learnt how to load Fake news dataset, initialize and implement TfidfVectorizer and PasiveAggressiveClassifier to fit our model theoretically.

Thanks for the read!

Bibliography:
Dataflair team
Fake News according to Wikipedia.

Reagan's Typo