Using GCNs to Predict the Political Divide from Blog Posts
Motivation & Explanation of Data and Problem Statement
Political Landscape and Node Prediction Task
This post is mostly tailored towards social scientists looking to get some more familiarity with some of the graphical deep learning methods that have allowed for some exciting breakthroughs and advancements in the analysis of complex graphical systems. With over 5 billion internet users in 2023, over 60% of the globe’s population, having access ot the internet, the scale at which we can observe social and behavioral phenomena necessitates social scientists to utilize more methods that are capable of capturing that sheer volume of data and information. For this reason, we won’t be diving too much into the technical details, but will rather be focusing on some of the basic componentry and analysis that are afforded to us. Ideally, the only pre-requisite is a basic familiarity with Python!
Over the last several years, we’ve seen a sharp increase in political polarization. This has been, in part, fueled by technologically constructed echo chambers, where individuals find themselves increasingly surrounded by like-minded individuals, and decreasingly exposed to ideas that challenge their current worldview in productive ways. Measuring and identifying echo chambers has been an ongoing field of research, but a critical component of assessing the information diversity of a group is identifying characteristics of the media that group consumes.
For individuals interested in practicing Computational Social Science, we seek to better analyze some of the underlying characteristics of the large groups of individuals whose behavior we hope to study and understand. For the purpose of this tutorial, we will demonstrate how we can utilize some of the great open source Python libraries like DeepSNAP, Pytorch, and Pytorch Geometric, to conduct machine learning prediction tasks to help us better characterize the social phenomena ubiquitous to our politically complex technological environment.
In hopes to make this as accessible as possible, let’s cut through the jargon!
We’ll specifically be conducting tasks over Graphical Representations of networks. What does this mean? Simply put, graphical representations are a way of representing complex systems with some sets of nodes and edges. If you were lucky enough to be a 90’s kid and got to play around with K’nex, you can think of the nodes as the connector hubs and the edges as the rods. This generalizes out to broader systems, as well! Subway systems can be represented where stops are nodes and routes are edges; complex molecules can be represented with atoms being nodes and chemical bonds being edges; and in our case, the ecosystem of political blogs can be represented by individual blogs being nodes and the edges being links between the blogs.
Now that we’ve got some familiarity with the terminology, let’s get a little bit more acquainted with our task.
As previously mentioned, we’re interested in some of the tasks involved with the identification of political echo chambers in communities. This is a very challenging task, so it requires us to be smart with how we decompose it into more manageable subcomponents. One of such subcomponents could be to characterize the sources of information being shared in this network. Naturally, if all of the information is coming from one side of the political spectrum, that’s a strong indicator that we may be observing an echo chamber — however, once you jump into the real-world and start looking at real-world datasets, you may find that things are a bit messier than you initially thought.
While there are great online resources that provide rather objective reporting on the bias/slant of certain news outlets, we may observe thousands of relatively unknown microblogs or independent websites that aren’t captured by these resources. In certain subcommunities, these smaller, less “mainstream” media sources, may make up the bulk of the shared information. So, it’s on us to learn how to classify these microblogs! This as its own task is important, but it will ideally allow us to use the political slant of the microblogs being shared in a community as a meaningful feature — or source of information for our broader classification task — as we look to tackle the larger problem.
Dataset Choice
Great! We’ve gotten familiar with our terminology and now understand why our task is important as a subcomponent to answering a larger question. Let’s talk about data.
We could write an entire other blog post on the importance of data and the countless ways of improving it for use in a machine learning pipeline, but for the purpose of this tutorial, we will be making use of a nice little dataset that is built in to the Pytorch Geometric library: polblog.
This dataset was derived from the paper “The Political Blogosphere and the 2004 Election: Divided they Blog”. This dataset contains 1,490 nodes which represent political blogs, and 19,025 edges, which represent links between the blogs. At this point, you can probably see why we’ve chosen this particular dataset to demonstrate this task. It’s important to note that different architectures/modeling approaches have different requirements for datasets. Again, that could be a topic of an entirely different blog post, but the critical takeaway here is that the end feature or characteristic that you’re trying to predict (for these kinds of architectures), generally has to be represented in the dataset that you’re using for model training.
Splitting Dataset
Now that we’ve identified our dataset, there are three sub-tasks that we need to conduct that are specific to our dataset. These three steps are training, tuning, and inference. It’s generally considered good “hygiene” for each of these steps to have their own unique subset of the data.
A quick primer on these three different steps and why it’s important for us to delineate these dataset “splits”:
For training, we normally allocate the largest proportion of the dataset to this split. The purpose of this split is to show our algorithm examples of our data, as well as the ground-truth label associated with that data
For development/validation split, we normally use half of the remaining proportion for this purpose. Use this split for tuning hyperparameters and making other adjustments to our architecture to maximize performance
Lastly, for inference/prediction, we will have a hold-out set of the remaining data points that we will use to determine the end-performance of our model. It’s important that this data is *not* the same datapoints that we trained on. We want our model to be forced to make predictions on data it hasn’t seen before to see if it’s actually learning!
With that out of the way, let’s take a look at how we actually do this with Python:
Data Cleaning
To get started, we partition our dataset into train, validation and test groups using an 80–10–10 split
inds = np.arange(0, len(data['y']))
np.random.shuffle(inds)
train_set = inds[:int(len(inds) * .8)]
not_train = inds[int(len(inds) * .8):]
val_set = not_train[:len(not_train) // 2]
test_set = not_train[len(not_train) // 2:]
split_idx = {
'train': train_set,
'valid': val_set,
'test': test_set
}
We use uniform node embeddings to provide unbiased context to each node.
data.x = torch.tensor(np.ones((data.num_nodes, 1)), dtype=torch.double)
Dataset Exploration
Sometimes it’s tough to get an understanding of what’s happening under the hood with your datasets from strictly looking at a DataFrame or tabular representation. For this reason, we’ll start exploring some of the other libraries that help us characterize our data, especially when it’s represented as a graph.
One of such is known as NetworkX. This library can be a powerful tool in your arsenal, and allows you to explore your data in greater depth and also with greater ease! Let’s take a look at how we can use NetworkX in conjunction with PyG to see what’s happening with our data.
# NetworkX Visualization
import networkx as nx
from torch_geometric.utils import to_networkx
# Check out Datatype
print(type(data))
# Let's take advantage of the builtin functions to convert straght to a networkx DiGraph!
G = to_networkx(data)
print(type(G))
# Confirm that our dataset does indeed have the right number of nodes and edges (blogs and links)
print(f'This dataset has {G.order()} nodes and {G.size()} edges.')
When we run these cells, we confirm:
<class 'torch_geometric.data.data.Data'>
<class 'networkx.classes.digraph.DiGraph'>
This dataset has 1490 nodes and 19025 edges.
There’s a lot we can explore in NetworkX. Thanks to the functionality built-in to the library, we can answer some initial questions we might have about our network of political blogs. One of such questions might be about the general characteristics of link sharing in this ecosystem. Let’s take a quick peek at how to get the average number of links shared from each of these blogs.
print(list(G.out_degree())[0:5])
total_out = 0
for value in G.out_degree:
total_out += value[1]
print(f'Average out degree: {total_out/len(G.out_degree)}')
>>> [(0, 15), (1, 43), (2, 0), (3, 0), (4, 3)]
>>> Average out degree: 12.768456375838927
Here, we’ve taken advantage of some of the nice additional functionalities provided by the networkx DiGraph class — with the out_degree() method, we’re able to simply iterate through all the nodes in the Graph and calculate the average number of outgoing links provided per blog. We can see that the dataset mean is approximately 13 links. Let’s see if we can get some more insights and a dig in a little bit more.
# Let's break down the average out degree by class:
# First, let's look at what how are dataset is distributed by class label.
print(data.y)
>>> tensor([0, 0, 0, ..., 1, 1, 1], device='cuda:0')
# It appears that the dataset is split by label
# Here we use np.argmax to give us the index of the class switch
first_index = np.argmax(torch.Tensor.cpu(data.y)>0)
print(f'Index of first example of class 1: {first_index}')
total_0 = 0
total_1 = 0
for value in G.out_degree:
if value[0] < first_index: total_0 += value[1]
else: total_1 += value[1]
print(f'{total_0/first_index, total_1/(len(G.out_degree) - first_index)})')
>>> Index of first example of class 1: 758
>>> (tensor(12.0950), tensor(13.4658)))
Fortunately, and unfortunately, it apperas that there isn’t an extremely obvious delineation between the classes in terms of out degree. Sometimes it’s worth checking this out! Not every approach needs to utilize complex deep learning methods. If there were a simple heuristic that we could easily reference, it could be more sensible for us to utilize a different approach. However, it appears that we’re going to need something a bit more robust.
In the next section, we’ll explain the approach we’re going to take that leverages some of the modern graphical methods in conjunction with deep learning. Again, we may not need to throw the entire kitchen sink at this task, so we’re going to start with some of the basic, tried and trued methods.
Explanation of Models
Our Choice of Graph ML model: GCN
Our task at hand is to use PyG for the task of node property prediction (node classification) of the political inclinations of each of the nodes that represent blogs. Specifically, we will use Graph Convolutional Networks (GCN) as the foundation of our graph neural network that was first introduced by Kipf, et. al. in 2017.
GCN Model
We use a GCN with three GCNConv layers with hidden dimension of 256, batch norm, ReLU, dropout of 0.5, and a log softmax final layer. Our hyperparameters include a learning rate of 0.01, and train for 100 epochs.
If you are familiar with traditional Convolutional Neural Networks (CNNs) that are applied to images, graph machine learning models use similar convolutional layers to propagate information along the edges of an input graph. Graph convolutional layers take as input the features of nodes connected together by edges and propagate, transform, and aggregate those features similar to an image convolutional layer. The difference is that graph convolutional layers have their computational graphs at each step defined by the unique structure of the graph and its edge connections, whereas image convolutional layers have a fixed size kernel to capture the pixels of the image.
We then use batch norm that takes the outputs of a hidden layer and normalizes them before passing them as input into the next hidden layer. The mean and standard-deviation are calculated per-dimension over the mini-batches and γ and β are learnable parameter vectors. During training, this layer keeps track of the computed mean and variance, which are used for normalization during evaluation.
Additionally, we use a ReLU activation function to prevent against the vanishing gradient problem and a dropout layer that randomly zeroes some of the elements of the input tensor during training to prevent model overfitting on the training data. Now that we’ve defined the GCN model architecture, we can move onto our process for training it.
Model Training
def train(model, data, train_idx, optimizer, loss_fn):
model.train()
loss = 0
x, edge_index = data.x, data.edge_index
optimizer.zero_grad()
pred = model(x, edge_index)
pred = pred[train_idx]
label = data.y[train_idx]
loss = loss_fn(pred, label.squeeze())
loss.backward()
optimizer.step()
return loss.item()
In PyTorch, for every mini-batch during training, we want to explicitly set the gradients to zero before starting backpropagation because PyTorch accumulates the gradients on backward passes. Thus, when we start each training loop, we need to zero out the gradients so we do parameter update correctly or else the gradient would be a combination of the old gradient, which was already used to update the parameters.
Results
Ultimately, after 100 epochs of training, the best accuracy on the train set was 64.93%, 63.76% for the validation set, and 65.10% for the test set for predicting the correct political party label of the blog post nodes. In this case, we are essentially conducting a binary classification task between our two political affiliations, so there’s definitely some improvement we could make here. This is a good opportunity to speak a little bit more about what options we have to improve performance.
On the lower lift-side, adjusting and tuning our hyperparameters can be a simple, but effective way of honing in on the right conditions to ensure our prediction tasks. Tools such as Ray Tune [https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html] can be helpful in this task.
There are other model architectures out there; Graph Convolutional Networks are a more simple implementation. We can absolutely leverage architectures like GraphSAGE, or Graph Attention Networks which utilize some more mathematically complex mechanisms to yield higher performance in applications that can fully take advantage of those mechanisms. However, we do need to be cognizant of what these architectures are good for before we simply throw them at any use-case.
You can read more about GraphSAGE here: https://snap.stanford.edu/graphsage/
and Graph Attention Networks here: https://petar-v.com/GAT/
Iteration and gradual improvement is a large component of building sensible technical solutions for large computational social science applications. Hopefully this tutorial has shed some light on some of the basic componentry or how we can use graphical learning methods in this space, and how to possibly improve our results to better characterize the phenomena we seek to understand.
Link to Colab: https://colab.research.google.com/drive/1W47b3ouD7s-oBRIyu6Urg57raMaTqY2a?usp=sharing
References
[1] Adamic, L. A., & Glance, N. (2005, August). The political blogosphere and the 2004 US election: divided they blog. In Proceedings of the 3rd international workshop on Link discovery (pp. 36–43).
[2] Hamilton, W., Ying, Z., & Leskovec, J. (2017). Inductive representation learning on large graphs. Advances in neural information processing systems, 30.
[3] Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J. E., & Stoica, I. (2018). Tune: A research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118.
[4] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2017). Graph attention networks. arXiv preprint arXiv:1710.10903.