Largescale graph representation learning
with very deep GNNs and selfsupervision
Abstract
Effectively and efficiently deploying graph neural networks (GNNs) at scale remains one of the most challenging aspects of graph representation learning. Many powerful solutions have only ever been validated on comparatively small datasets, often with counterintuitive outcomes—a barrier which has been broken by the Open Graph Benchmark LargeScale C++hallenge (OGBLSC). We entered the OGBLSC with two largescale GNNs: a deep transductive node classifier powered by bootstrapping, and a very deep (up to 50layer) inductive graph regressor regularised by denoising objectives. Our models achieved an awardlevel (top3) performance on both the MAG240M and PCQM4M benchmarks. In doing so, we demonstrate evidence of scalable selfsupervised graph representation learning, and utility of very deep GNNs—both very important open issues. Our code is publicly available at: https://github.com/deepmind/deepmindresearch/tree/master/ogb_lsc.
/keywords
OGBLSC MPNNs Graph Networks BGRL Noisy Nodes
1 Introduction
Effective highdimensional representation learning necessitates properly exploiting the geometry of data (Bronstein et al., 2021)—otherwise, it is a cursed estimation problem. Indeed, early success stories of deep learning relied on imposing strong geometric assumptions, primarily that the data lives on a grid domain; either spatial or temporal. In these two respective settings, convolutional neural networks (C++NNs) (LeCun et al., 1998) and recurrent neural networks (RNNs) (Hochreiter and Schmidhuber, 1997) have traditionally dominated.
While both CNNs and RNNs are demonstrably powerful models, with many applications of high interest, it can be recognised that most data coming from nature cannot be natively represented on a grid. Recent years are marked with a gradual shift of attention towards models that admit a more generic class of geometric structures (Masci et al., 2015; Veličković et al., 2017; C++ohen et al., 2018; Battaglia et al., 2018; de Haan et al., 2020; Satorras et al., 2021).
In many ways, the most generic and versatile of these models are graph neural networks (GNNs). This is due to the fact that most discretedomain inputs can be observed as instances of a graph structure. The corresponding area of graph representaton learning (Hamilton, 2020) has already seen immense success across industrial and scientific disciplines. GNNs have successfully been applied for drug screening (Stokes et al., 2020), modelling the dynamics of glass (Bapst et al., 2020), webscale social network recommendations (Ying et al., 2018) and chip design (Mirhoseini et al., 2020).
While the above results are certainly impressive, they likely only scratch the surface of what is possible with welltuned GNN models. Many problems of realworld interest require graph representation learning at scale: either in terms of the amount of graphs to process, or their sizes (in terms of numbers of nodes and edges). Perhaps the clearest motivation for this comes from the Transformer family of models (Vaswani et al., 2017). Transformers operate a selfattention mechanism over a complete graph, and can hence be observed as a specific instance of GNNs (Joshi, 2020). At very large scales of natural language data, Transformers have demonstrated significant returns with the increase in capacity, as exemplified by models such as GPT3 (Brown et al., 2020). Transformers enjoy favourable scalability properties at the expense of their functional complexity: each node’s features are updated with weighted sums of neighbouring node features. In contrast, GNNs that rely on message passing (Gilmer et al., 2017)—passing vector signals across edges that are conditioned on both the sender and receiver nodes—are an empirically stronger class of models, especially on tasks requiring complex reasoning (Veličković et al., 2019) or simulations (SanchezGonzalez et al., 2020; Pfaff et al., 2020).
One reason why generic messagepassing GNNs have not been scaled up as widely as Transformers is the lack of appropriate datasets. Only recently has the field advanced from simple transductive benchmarks of only few thousands of nodes (Sen et al., 2008; Shchur et al., 2018; Morris et al., 2020) towards largerscale realworld and synthetic benchmarks (Dwivedi et al., 2020; Hu et al., 2020), but important issues still remain. For example, on many of these tasks, randomlyinitialised GNNs (Veličković et al., 2018), shallow GNNs (Wu et al., 2019) or simple label propagationinspired GNNs (Huang et al., 2020) can perform near the stateoftheart level at only a fraction of the parameters. When most bleedingedge expressive methods are unable to improve on the above, this can often lead to controversial discussion in the community. One common example is: do we even need deep, expressive GNNs?
Breakthroughs in deep learning research have typically been spearheaded by impactful largescale competitions. For image recognition, the most famous example is the ImageNet Large Scale Visual Recognition C++hallenge (ILSVRC) (Russakovsky et al., 2015). In fact, the very “deep learning revolution” has partly been kickstarted by the success of the AlexNet CNN model of Krizhevsky et al. (2012) at the ILSVRC 2012, firmly establishing deep CNNs as the workhorse of image recognition for the forthcoming decade.
Accordingly, we have entered the recently proposed Open Graph Benchmark LargeScale Challenge (OGBLSC) (Hu et al., 2021). OGBLSC++ provides graph representation learning tasks at a previously unprecedented scale—millions of nodes, billions of edges, and/or millions of graphs. Further, the tasks have been designed with immediate practical relevance in mind, and it has been verified that expressive GNNs are likely to be necessary for strong performance. Here we detail our two submitted models (for the MAG240M and PCQM4M tasks, respectively), and our empirical observations while developing them. Namely, we find that the datasets’ immense scale provides a great platform for demonstrating clear outperformance of very deep GNNs (Godwin et al., 2021), as well as selfsupervised GNN setups such as bootstrapping (Thakoor et al., 2021). In doing so, we have provided meaningful evidence towards a positive resolution to the above discussion: deep and expressive GNNs are, indeed, necessary at the right level of task scale and/or complexity. Our final models have achieved awardlevel (top3) ranking on both MAG240M and PCQM4M.
2 Dataset description
The MAG240MLSC dataset is a transductive node classification dataset, based on the Microsoft Academic Graph (MAG) (Wang et al., 2020a). It is a heterogeneous graph containing paper, author and institution nodes, with edges representing the relations between them: papercitespaper, authorwritespaper, authoraffiliatedwithinstitution. All paper nodes are endowed with 768dimensional input features, corresponding to the RoBERTa sentence embedding (Liu et al., 2019; Reimers and Gurevych, 2019) of their title and abstract. MAG240M is currently the largestscale publicly available node classification dataset by a wide margin, at 240 million nodes and 1.8 billion edges. The aim is to classify the 1.4 million arXiv papers into their corresponding topics, according to a temporal split: papers published up to 2018 are used for training, with validation and test sets including papers from 2019 and 2020, respectively.
The PCQM4MLSC dataset is an inductive graph regression dataset based on the PubChemQC project (Nakata and Shimazaki, 2017). It consists of 4 million small molecules (described by their SMILES strings). The aim is to accelerate quantumchemical computations: especially, to predict the HOMOLUMO gap of each molecule. The HOMOLUMO gap is one of the most important quantumchemical properties, since it is related to the molecules’ reactivity, photoexcitation, and charge transport. The groundtruth labels for every molecule were obtained through expensive DFT (density functional theory) calculations, which may take several hours per molecule. It is believed that machine learning models, such as GNNs over the molecular graph, may obtain useful approximations to the DFT at only a fraction of the computational cost, if provided with sufficient training data (Gilmer et al., 2017). The molecules are split with a 80:10:10 ratio into training, validation and test sets, based on their PubChem ID.
3 GNN Architectures
For both of the tasks above, we rely on a common encodeprocessdecode blueprint (Hamrick et al., 2018). This implies that our input features are encoded into a latent space using node, edge and graphwise encoder functions, and latent features are decoded to node, edge and graph level predictions using appropriate decoder functions. The bulk of the computational processing is powered by a processor network, which performs multiple graph neural network layers over the encoded latents.
To formalise this, assume that our input graph, , has node features , edge features and graphlevel features , for nodes and edges . Our encoder functions , and then transform these inputs into the latent space:
(1) 
Our processor network then transforms these latents over several rounds of message passing:
(2) 
where contains all of the latents at a particular processing step .
The processor network is iterated for steps, recovering final latents . These can then be decoded into node, edge, and graphlevel predictions (as required), using analogous decoder functions , and :
(3) 
We will detail the specific design of , and in the following sections. Generally, and are simple MLPs, whereas we use highly expressive GNNs for in order to maximise the advantage of the largescale datasets. Specifically, we use message passing neural networks (MPNNs) (Gilmer et al., 2017) and graph networks (GNs) (Battaglia et al., 2018). All of our models have been implemented using the jraph library (Godwin et al., 2020).
4 Mag240mLsc
Subsampling
Running graph neural networks over datasets that are even a fraction of MAG240M’s scale is already prone to multiple scalability issues, which necessitated either aggressive subsampling (Hamilton et al., 2017; C++hen et al., 2018; Zeng et al., 2019; Zou et al., 2019), graph partitioning (Liao et al., 2018; C++hiang et al., 2019) or less expressive GNN architectures (Rossi et al., 2020; Bojchevski et al., 2020; Yu et al., 2020).
As we would like to leverage expressive GNNs, and be able to pass messages across any partitions, we opted for the subsampling approach. Accordingly, we subsample moderatelysized patches around the nodes we wish to compute latents for, execute our GNN model over them, and use the latents in the central nodes to train or evaluate the model.
We adapt the standard GraphSAGE subsampling algorithm of Hamilton et al. (2017), but make several modifications to it in order to optimise it for the specific features of MAG240M. Namely:

We perform separate subsampling procedures across edge types. For example, an author node will separately sample a prespecified number of papers written by that author and a prespecified number of institutions that author is affiliated with.

GraphSAGE mandates sampling an exact number of neighbours for every node, and uses sampling with replacement to achieve this even when the neighbourhood size is variable. We find this to be wasteful for smaller neighbourhoods, and hence use our prespecified neighbour counts only as an upper bound. Denoting this upper bound as , and node ’s original neighbourhood as , we proceed^{1}^{1}1Note that, according to the previous bullet point, and are defined on a peredgetype basis. as follows:

For nodes that have fewer neighbours of a particular type than the upper bound (), we simply take the entire neighbourhood, without any subsampling;

For nodes that have a moderate amount of neighbours () we subsample neighbours without replacement, hence we do not wastefully duplicate nodes when the memory costs are reasonable.

For all other nodes (), we resort to the usual GraphSAGE strategy, and sample neighbours with replacement, which doesn’t require an additional rowcopy of the adjacency matrix.


GraphSAGE directed the edges in the patch from the subsampled neighbours to the node which sampled them, and run their GNN for the exact same number of steps as the sampling depth. We instead modify the message passing update rule to scalably make the edges bidirectional, which naturally allows us to run deeper GNNs over the patch. The exact way in which we performed this will be detailed in the model architecture.
Taking all of the above into account, our model’s subsampling strategy proceeds, starting from paper nodes as central nodes, up to a depth of two (sufficient for institution nodes to become included). We did not observe significant benefits from sampling deeper patches. Instead, we sample significantly larger patches than the original GraphSAGE paper, to exploit the wide context available for many nodes:

Contains the chosen central paper node.

We sample up to citing papers, cited papers, and up to authors for this paper.

We sample according to the following strategy, for all paper and author nodes sampled at depth1:

Identical strategy as for depth1 papers: up to cited, citing, authors.

We sample up to written papers, and up to affiliations for this author.

Overall, this inflates our maximal patch size to nearly nodes, which makes our patches of a comparable size to traditional fullgraph datasets (Sen et al., 2008; Shchur et al., 2018). Coupled with the fact that MAG240M has hundreds of millions of papers to sample these patches from, our setting enables transductive node classification at previously unexplored scale. We have found that such large patches were indeed necessary for our model’s performance.
One final important remark for MAG240M subsampling concerns the existence of duplicated paper nodes—i.e. nodes with exactly the same RoBERTa embeddings. This likely corresponds to identical papers submitted to different venues (e.g. conference, journal, arXiv). For the purposes of enriching our subsampled patches, we have combined the adjacency matrix rows and columns to “fuse” all versions of duplicated papers together.
Input preprocessing
As just described, we seek to support execution of expressive GNNs on large quantities of largescale subsampled patches. This places further stress on the model from a computational and storage perspective. Accordingly, we found it very useful to further compress the input nodes’ RoBERTa features. Our qualitative analysis demostrated that their 129dimensional PCA projections already account for 90% of their variance. Hence we leverage these PCA vectors as the actual input paper node features.
Further, only the paper nodes are actually provided with any features. We adopt the identical strategy from the baseline LSC scripts provided by Hu et al. (2021) to featurise the authors and institutions. Namely, for authors, we use the average PC++A features across all papers they wrote. For institutions, we use the average features across all the authors affiliated with them. We found this to be a simple and effective strategy that performed empirically better than using structural features. This is contrary to the findings of Yu et al. (2020), probably because we use a more expressive GNN.
Besides the PCAbased features, our input node features also contain the onehot representation of the node’s type (paper/author/institution), the node’s depth in the sampled patch (0/1/2), and a bitwise representation of the papers’ publication year (zeroed out for other nodes). Lastly, and according to an increasing body of research that argues for the utility of labels in transductive node classification tasks (Zhu and Ghahramani, 2002; Stretcu et al., 2019; Huang et al., 2020), we use the arXiv paper labels as features (Wang et al., 2021) (zeroed out for other nodes). We make sure that the validation labels are not observed at training time, and that the central node’s own label is not provided. It is possible to sample the central node at depth 2, and we make sure to mask out its label if this happens.
We also endow the patches’ edges with a simple edge type feature, . It is a 7bit binary feature, where the first three bits indicate the onehot type of the sampling node (paper, author or institution) and the next four bits indicate the onehot type of the sampled neighbour (cited paper, citing paper, author or institution). We found running a standard GNN over these edgetype features more performant than running a heterogeneous GNN—once again contrary to existing baseline results (Hu et al., 2021), and likely because of the expressivity of our processor GNN.
Model architecture
For the GNN architecture we have used on MAG240M, our encoders and decoders are both twolayer MLPs, with a hidden size of 512 features. The node and edge encoders’ output layers compute 256 features, and we retain this dimensionality for and across all steps .
Our processor network is a deep messagepassing neural network (MPNN) (Gilmer et al., 2017). It computes message vectors, , to be sent across the edge , and then aggregates them in the receiver nodes as follows:
(4)  
(5) 
Taken together, Equations 4–5 fully specify the operations of the network in Equation 2. The message function and the update function are both twolayer MLPs, with identical hidden and output sizes to the encoder network. We note two specific aspects of the chosen MPNN:

We did not find it useful to use global latents or update edge latents (Equation 4 uses at all times and does not include ). This is likely due to the fact that the prediction is strongly centred at the central node, and that the edge features and types do not encode additional information.

Note the third input in Equation 5, which is not usually included in MPNN formulations. In addition to pooling all incoming messages, we also pool all outgoing messages a node sends, and concatenate that onto the input to the sender node’s update function. This allowed us to simulate bidirectional edges without introducing additional scalability issues, allowing us to prototype MPNNs whose depth exceeded the subsampling depth.
The process is repeated for message passing layers, after which for the central node is sent to the decoder network for predictions.
Bootstrapping objective
The nonarXiv papers within MAG240M are unlabelled and hence, under a standard node classification training regime, would contribute only implicitly to the learning algorithm (as neighbours of labelled papers). Early work on selfsupervised graph representation learning (Veličković et al., 2018) had already shown this could be a wasteful approach, even on smallscale transductive benchmarks. Appropriately using the unlabelled nodes can provide the model with a wealth of information about the feature and network structure, which cannot be easily recovered from supervision alone. On a dataset like MAG240M—which contains 120 more unlabelled papers than labelled ones—we have been able to observe significant gains from deploying such methods.
Especially, we leverage bootstrapped graph latents (BGRL) (Thakoor et al., 2021), a recentlyproposed scalable method for selfsupervised learning on graphs. Rather than contrasting several node representations across multiple views, BGRL bootstraps the GNN to make a node’s embeddings be predictive of its embeddings from another view, under a target GNN. The target network’s parameters are always set to an exponential moving average (EMA) of the GNN parameters. Formally, let and be the target versions of the encoder and processor networks (periodically updated to the EMA of and ’s parameters), and and be two views of an input patch (in terms of features, adjacency structure or both). Then, BGRL performs the following computations:
(6) 
where is shorthand for applying Equation 1, followed by repeatedly applying Equations 4–5 for steps. The BGRL loss is then optimised to make the central node embedding predictive of its counterpart, . This is done by projecting to another representation using a projector network, , as follows:
(7) 
where is a twolayer MLP with identical hidden and output size as our encoder MLPs. We then optimise the cosine similarity between the projector output and :
(8) 
using stochastic gradient ascent. Once training is completed, the projector network is discarded.
This approach, inspired by BYOL (Grill et al., 2020), eliminates the need for crafting negative samples, reduces the storage requirements of the model, and its pointwise loss aligns nicely with our patchwise learning setting, as we can focus on performing the bootstrapping objective on each central node separately. All of this made BGRL a natural choice in our setting, and we have found that we can easily apply it at scale.
Previously, BGRL has been applied on moderatelysized graphs with less expressive GNNs, showing modest returns. Conversely, we find the benefits of BGRL were truly demonstrated with stronger GNNs on the largescale setting of MAG240M. Not only does BGRL monotonically improve when increasing proportions of unlabelledtolabelled nodes during training, it consistently outperformed relevant selfsupervised GNNs such as GRACE (Zhu et al., 2020).
Ultimately, our submitted model is trained with an auxiliary BGRL objective, with each batch containing a ratio of unlabelled to labelled node patches. Just as in the BGRL paper, we obtain the two input patch views by applying dropout (Srivastava et al., 2014) on the input features (with ) and DropEdge (Rong et al., 2019) (with ), independently on each view. The target network (, ) parameters are updated with EMA decay rate .
Training and regularisation
We train our GNN to minimise the crossentropy for predicting the correct topic over the labelled central nodes in the training patches, added together with the BGRL objective for the unlabelled central nodes. We use the AdamW SGD optimiser (Loshchilov and Hutter, 2017) with hyperparameters , and weight decay rate of . We use a cosine learning rate schedule with base learning rate and warmup steps, decayed over training iterations. Optimisation is performed over dynamicallybatched data: we fill up each training minibatch with sampled patches until any of the following limits are exceeded: nodes, edges, or patches.
To regularise our model, we perform early stopping on the accuracy over the validation dataset, and apply feature dropout (with ) and DropEdge (Rong et al., 2019) (with ) at every message passing layer of the GNN. We further apply layer normalisation (Ba et al., 2016) to intermediate outputs of all of our MLP modules.
Evaluation
At evaluation time, we make advantage of the transductive and subsampled learning setup to enhance our predictions even further: first, we make sure that the model has access to all validation labels as inputs at test time, as this knowledge may be highly indicative. Further, we make sure that any “fused” copies of duplicated nodes also provide that same label as input. As our predictions are potentially conditioned on the specific topology of the subsampled patch, for each test node we average our predictions over 50 subsampled patches—an ensembling trick which consistently improved our validation performance. Lastly, given that we already use EMA as part of BGRL’s target network, for our evaluation predictions we use the EMA parameters, as they are typically slightly more stable.
5 Pcqm4mLsc
Input preprocessing
For featurising our molecules within PCQM4M, we initially follow the baseline scripts provided by Hu et al. (2021) to convert SMILES strings into molecular graphs. Therein, every node is represented by a 9dimensional feature vector, , including properties such as atomic number and chirality. Further, every edge is endowed with 3dimensional features, , including bond types and stereochemistry. Mirroring prior work with GNNs for quantumchemical computations (Gilmer et al., 2017), we found it beneficial to maintain graphlevel features (in the form of a “master node”), which we initialise at .
As will soon become apparent, our experiments on the PCQM4M benchmark leveraged GNNs that are substantially deeper than most previously studied GNNs for quantumchemical tasks, or otherwise. While there is implicit expectation to compute useful “cheap” chemical features from the SMILES string, such as molecular fingerprints, partial charges, etc., our experiments clearly demonstrated that most of them do not meaningfully impact performance of our GNNs. This indicates that very deep GNNs are likely implicitly able to compute such features without additional guidance.
The exception to this have been conformer features, corresponding to approximated threedimensional coordinates of every atom. These are very expensive to obtain accurately. However, using RDKit (Landrum, 2013), we have been able to obtain conformer estimates that allowed us to attain slightly improved performance with a (slightly) shallower GNN. Specifically, we use the experimental torsion knowledge distance geometry (ETKDGv3) algorithm (Wang et al., 2020b) to recover conformers that satisfy essential geometric constraints, without violating our time limits.
Once conformers are obtained, we do not use their raw coordinates as features—these have many equivalent formulations that depend on the algorithm’s initialisation. Instead, we encode their displacements (a 3dimensional vector recording distances along each axis) and their distances (scalar norm of the displacement) as additional edge features concatenated with . Note that RDKit’s algorithm is not powerful enough to extract conformers for every molecule within PCQM4M; for about of the dataset, the returned conformers will be NaN.
Lastly, we also attempted to use more computationally intensive forms of conformer generation—including energy optimisation using the universal force field (UFF) (Rappé et al., 1992) and the Merck molecular force field (MMFF) (Halgren, 1996). In both cases, we did not observe significant returns compared to using rudimentary conformers.
Model architecture
For the GNN architecture we have used on PCQM4M, our encoders and decoders are both threelayer MLPs, computing 512 features in every hidden layer. The node, edge and graphlevel encoders’ output layers compute 512 features, and we retain this dimensionality for , and across all steps .
For our processor network, we use a very deep Graph Network (GN) (Battaglia et al., 2018). Each GN block computes updated node, edge and graph latents, performing aggregations across them whenever appropriate. Fully expanded out, the computations of one GN block can be represented as follows:
(9)  
(10)  
(11) 
Taken together, Equations 9–11 fully specify the operations of the network in Equation 2. The edge update function , node update function and graph update function are all threelayer MLPs, with identical hidden and output sizes to the encoder network.
The process is repeated for message passing layers, after which the computed latents , and are sent to the decoder network for relevant predictions. Specifically, the global latent vector is used to predict the molecule’s HOMOLUMO gap. Our work thus constitutes a successful application of very deep GNNs, providing evidence towards ascertaining positive utility of such models. We note that, while most prior works on GNN modelling seldom use more than eight steps of message passing (Brockschmidt, 2020), we observe monotonic improvements of deeper GNNs on this task, all the way to 32 layers when the validation performance plateaus.
Nonconformer model
Recalling our prior discussion about conformer features occasionally not being trivially computable, we also trained a GN which does not exploit conformerbased features. While we observe largely the same trends, we find that they tend to allow for even deeper and wider GNNs before plateauing. Namely, our optimised nonconformer GNN computes 1,024dimensional hidden features in every MLP, and iterates Equations 9–11 for message passing steps. Such a model performed marginally worse than the conformer GNN overall, while significantly improving the mean absolute error (MAE) on the of validation molecules without conformers.
Denoising objective
Our very deep GNNs have, in the first instance, been enabled by careful regularisation. By far, the most impactful method for our GNN regressor on PCQM4M has been Noisy Nodes (Godwin et al., 2021), and our results largely echo the findings therein.
The main observation of Noisy Nodes is that very deep GNNs can be strongly regularised by appropriate denoising objectives. Noisy Nodes perturbs the input node or edge features in a prespecified way, then requires the decoder to reconstruct the unperturbed information from the GNN’s latent representations.
In the case of the flat input features, we have deployed a Noisy Nodes objective on both atom types and bond types: randomly replacing each atom and each bond type with a uniformly sampled one, with probability . The model then performs node/edge classification based on the final latents (e.g., , for the conformer GNN), to reconstruct the initial types. Requiring the model to correctly infer and rectify such noise is implicitly imbuing it with knowledge of chemical constraints, such as valence, and is a strong empirical regulariser. Note that, in this discretefeature setting, Noisy Nodes can be seen as a more general case of the BERTlike objectives from Hu et al. (2019). The main difference is that Noisy Nodes takes a more active role in requiring denoising—as opposed to unmasking, where it is known upfront which nodes have been noised, and the effects of noising are always predictable.
When conformers or displacements are available, a richer class of denoising objectives may be imposed on the GNN. Namely, it is possible to perturb the individual nodes’ coordinates slightly, and then require the network to reconstruct the original displacement and/or distances—this time using edge regression on the output latents of the processor GNN. The Noisy Nodes manuscript had shown that, under such perturbations, it is possible to achieve stateoftheart results on quantum chemical calculations without requiring an explicitly equivariant architecture—only a very deep traditional GNN. Our preliminary results indicate a similar trend on the PCQM4M dataset.
Training and regularisation
We train our GNN to minimise the mean absolute error (MAE) for predicting the DFTsimulated HOMOLUMO gap based on the decoded global latent vectors. This objective is combined with any auxiliary tasks imposed by noisy nodes (e.g. crossentropy on reconstructing atom and bond types, MAE on regressing denoised displacements). We use the Adam SGD optimiser (Kingma and Ba, 2014) with hyperparameters , . We use a cosine learning rate schedule with initial learning rate and warmup steps, peaking at , and decaying over training iterations. We optimise over dynamicallybatched data: we fill each training minibatch until exceeding any of the following limits: atoms, bonds, or molecules.
Evaluation
At evaluation time, we exploit several known facts about the HOMOLUMO gap, and our conformer generation procedure, to achieve “free” reductions in MAE.
Firstly, it is known that the HOMOLUMO gap cannot be negative, and that it is possible for our model to make (very rare) vastly inflated predictions on validation data if it encounters an outofdistribution molecule. We ameliorate both of these issues by clipping the network’s predictions in the range.
Secondly, as discussed, for a very small fraction ( of molecules), RDKit was unable to compute conformers. We found that it was useful to fall back to the 50layer nonconformer GNN in these cases, rather than assuming a default value. The observed reductions in MAE were significant across those specific validation molecules only.
Finally, we consistently track the exponential moving average (EMA) of our model’s parameters (with decay rate ), and use it for evaluation. EMA parameters are generally known to be more stable than their online counterparts, an observation that held in our case as well.
6 Ensembling and training on validation
Once we established the top singlemodel architectures for both our MAG240M and PCQM4M entries, we found it very important to perform two postprocessing steps: (a) retrain on the validation set, (b) ensemble various models together.
Retraining on validation data offers a great additional wealth of learning signal, even just by the sheer volume of data available in the OGBLSC. But aside from this, the way in which the data was split offers even further motivation. On MAG240M, for example, the temporal split implies that validation papers (from 2019) are most relevant to classifying test papers (from 2020)—simply put, because they both correspond to the latest trends in scholarship.
However, training on the full validation set comes with a potentially harmful drawback: no heldout dataset would remain to earlystop on. In a setting where overfitting can easily occur, we found the risk to vastly outweigh the rewards. Instead, we decided to randomly partition the validation data into equallysized folds, and perform a crossvalidationstyle setup: we train different models, each one observing the training set and validation folds as its training data, validating and early stopping on the heldout fold. Each model holds out a different fold, allowing us to get an overall validation estimate over the entire dataset by combining their respective predictions.
While this approach may not correspond to the intended dataset splits, we have verified that the scores on individual heldout folds match the patterns observed on models that did not observe any validation data. This gave us further reassurance that no unintended strong overfitting had happened as a result.
Another useful outcome of our fold approach is that it allowed us a very natural way to perform ensembling as well: simply aggregating all of the models’ predictions would give us a certain mixture of experts, as each of the models had been trained on a slightly modified training set. Our final ensembled models employ exactly this strategy, with the inclusion of two seeds per fold. This brings our overall number of ensembled models to 20, and these ensembles correspond to our final submissions on both MAG240M and PCQM4M.
7 Experimental evaluation
In this section we provide experimental evidence to substantiate the various claims we have made about the key modifications in our model, hoping to advise future research on large scale graph representation learning. To eliminate any possible confounding effects of ensembling, all results reported in this section will be on a single model, evaluated on the provided validation data. We report average performance and standard deviation over three seeds.
Mag240mLsc
We will follow the plots in Figures 1–2, which seek to uncover various contributing factors to our model’s ultimate performance. We proceed one claim at a time.
Making networks deeper than the patch diameter can help. We find that making the edges in every subsampled patch bidirectional allowed for doubling the message passing steps (to four) with a significant validation accuracy improvement, in spite of the fact that the MPNN was now deeper than the patch diameter. See Figure 1 (left).
Ensembling over multiple subsamples helps. We find that averaging our network’s prediction over several randomly subsampled patches at evaluation time consistently improved performance. See Figure 1 (middleleft).
Using training labels as features helps. On transductive tasks, we confirm that using the training node label as an additional feature provides a substantial boost to validation performance, if done carefully. See Figure 1 (middleright).
Larger patches help. Providing the model with a larger context (by subsampling more neighbours) proved significantly helpful to our downstream performance. See Figure 1 (right).
Selfsupervised objectives help—especially BGRL. We first validate that combining a traditional crossentropy loss with a selfsupervised loss is beneficial to final performance observed. Further, we show that BGRL (Thakoor et al., 2021) can significantly outperform GRAC++E (Zhu et al., 2020) in the largescale regime. See Figure 2 (left).
Selfsupervised learning on unlabelled nodes helps. One of the major promises of selfsupervised learning is allowing access to a vast quantity of unlabelled nodes, which now can be used as targets. We recover consistent, monotonic gains from incorporating increasing amounts of unlabelled nodes within our training routine. See Figure 2 (middle).
Selfsupervised learning allows for more robust models. Finally, the regularising effect of selfsupervised learning means that we can train our models for longer without suffering any overfitting effects. See Figure 2 (right).
Pcqm4mLsc
Using conformerbased features helps. Utilising features based on RDKit conformers, in the manner described before, proved beneficial to final performance. Note that the gains over our 50layer nonconformer model are irrelevant, given that the nonconformer model is only applied over molecules where conformers cannot be computed. See Figure 3 (topleft and bottomleft).
Deeper models help. We demonstrate consistent, monotonic gains for larger numbers of message passing steps, at least up to 32 layers—and in the case of the nonconformer model, up to 50 layers. See Figure 3 (topmiddleleft and bottommiddleleft).
Noisy Nodes help. Lastly, we show that the regulariser proposed in Noisy Nodes (Godwin et al., 2021) proved very effective for this quantumchemical task as well. It was the key behind the monotonic improvements of our models with depth. Note, for example, that removing Noisy Nodes from our best performing model makes its performance comparable with models that are at least twice as shallow. See Figure 3 (topmiddleright and bottommiddleright).
Wider message functions help. Towards the end of the contest, we noted that performance gains are possible when favouring wider message functions (in terms of hidden size of their MLP layers) opposed to the latent size of the GNN. We subsequently noticed that such a regime (256 latent dimensions, 1,024dimensional hidden layers) consistently improved our nonconformer model as well. See Figure 3 (topright and bottomright).
8 Results and Discussion
Our final ensembled models achieved a validation accuracy of 77.10% on MAG240M, and validation MAE of 0.110 on PCQM4M. Translated on the LSC test sets, we recover 75.19% test accuracy on MAG240M and 0.1205 test MAE on PCQM4M. We incur a minimal amount of distribution shift, which is a testament to our principled ensembling and postprocessing strategies, in spite of using labels as inputs for MAG240M or training on validation for both tasks.
Our entries have been designated as awardees (ranked in the top3) on both MAG240M and PC++QM4M, solidifying the impact that very deep expressive graph neural networks can have on large scale datasets of industrial and scientific relevance. Further, we demonstrate how several recently proposed auxiliary objectives for GNN training, such as BGRL (Thakoor et al., 2021) and Noisy Nodes (Godwin et al., 2021) can both be highly impactful at the right dataset scales. We hope that our work serves towards resolving several open disputes in the community, such as the utility of very deep GNNs, and the influence of selfsupervision in this setting.
In many ways, the OGB has been to graph representation learning what ImageNet has been to computer vision. We hope that OGBLSC++ is only the first in a series of events designed to drive research on GNN architectures forward, and sincerely thank the OGB team for all their hard work and effort in making a contest of this scale possible and accessible.
References

Layer normalization.
arXiv preprint arXiv:1607.06450.
Cited by: §4.

Unveiling the predictive power of static structure in glassy systems.
Nature Physics 16 (4), pp. 448–454.
Cited by: §1.

Relational inductive biases, deep learning, and graph networks.
arXiv preprint arXiv:1806.01261.
Cited by: §1,
§3,
§5.

Scaling graph neural networks with approximate pagerank.
In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,
pp. 2464–2473.
Cited by: §4.

Gnnfilm: graph neural networks with featurewise linear modulation.
In International Conference on Machine Learning,
pp. 1144–1152.
Cited by: §5.

Geometric deep learning: grids, groups, graphs, geodesics, and gauges.
arXiv preprint arXiv:2104.13478.
Cited by: §1.

Language models are fewshot learners.
arXiv preprint arXiv:2005.14165.
Cited by: §1.

Fastgcn: fast learning with graph convolutional networks via importance sampling.
arXiv preprint arXiv:1801.10247.
Cited by: §4.

Clustergcn: an efficient algorithm for training deep and large graph convolutional networks.
In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,
pp. 257–266.
Cited by: §4.

Spherical cnns.
arXiv preprint arXiv:1801.10130.
Cited by: §1.

Gauge equivariant mesh cnns: anisotropic convolutions on geometric graphs.
arXiv preprint arXiv:2003.05425.
Cited by: §1.

Benchmarking graph neural networks.
arXiv preprint arXiv:2003.00982.
Cited by: §1.

Neural message passing for quantum chemistry.
In International Conference on Machine Learning,
pp. 1263–1272.
Cited by: §1,
§2,
§3,
§4,
§5.

Jraph: A library for graph neural networks in jax.
External Links: Link
Cited by: §3.

Very deep graph neural networks via noise regularisation.
arXiv preprint arXiv:2106.07971.
Cited by: §1,
§5,
§7,
§8.

Bootstrap your own latent: a new approach to selfsupervised learning.
arXiv preprint arXiv:2006.07733.
Cited by: §4.

Merck molecular force field. i. basis, form, scope, parameterization, and performance of mmff94.
Journal of computational chemistry 17 (56), pp. 490–519.
Cited by: §5.

Inductive representation learning on large graphs.
arXiv preprint arXiv:1706.02216.
Cited by: §4,
§4.

Graph representation learning.
Synthesis Lectures on Artifical Intelligence and Machine Learning 14 (3), pp. 1–159.
Cited by: §1.

Relational inductive bias for physical construction in humans and machines.
arXiv preprint arXiv:1806.01203.
Cited by: §3.

Long shortterm memory.
Neural computation 9 (8), pp. 1735–1780.
Cited by: §1.

OGBlsc: a largescale challenge for machine learning on graphs.
arXiv preprint arXiv:2103.09430.
Cited by: §1,
§4,
§4,
§5.

Open graph benchmark: datasets for machine learning on graphs.
arXiv preprint arXiv:2005.00687.
Cited by: §1.

Strategies for pretraining graph neural networks.
arXiv preprint arXiv:1905.12265.
Cited by: §5.

Combining label propagation and simple models outperforms graph neural networks.
arXiv preprint arXiv:2010.13993.
Cited by: §1,
§4.

Transformers are graph neural networks.
The Gradient.
Cited by: §1.

Adam: a method for stochastic optimization.
arXiv preprint arXiv:1412.6980.
Cited by: §5.

Imagenet classification with deep convolutional neural networks.
Advances in neural information processing systems 25, pp. 1097–1105.
Cited by: §1.

RDKit: a software suite for cheminformatics, computational chemistry, and predictive modeling.
Academic Press.
Cited by: §5.

Gradientbased learning applied to document recognition.
Proceedings of the IEEE 86 (11), pp. 2278–2324.
Cited by: §1.

Graph partition neural networks for semisupervised classification.
arXiv preprint arXiv:1803.06272.
Cited by: §4.

Roberta: a robustly optimized bert pretraining approach.
arXiv preprint arXiv:1907.11692.
Cited by: §2.

Decoupled weight decay regularization.
arXiv preprint arXiv:1711.05101.
Cited by: §4.

Geodesic convolutional neural networks on riemannian manifolds.
In Proceedings of the IEEE international conference on computer vision workshops,
pp. 37–45.
Cited by: §1.

Chip placement with deep reinforcement learning.
arXiv preprint arXiv:2004.10746.
Cited by: §1.

Tudataset: a collection of benchmark datasets for learning with graphs.
arXiv preprint arXiv:2007.08663.
Cited by: §1.

PubChemQC project: a largescale firstprinciples electronic structure database for datadriven chemistry.
Journal of chemical information and modeling 57 (6), pp. 1300–1308.
Cited by: §2.

Learning meshbased simulation with graph networks.
arXiv preprint arXiv:2010.03409.
Cited by: §1.

UFF, a full periodic table force field for molecular mechanics and molecular dynamics simulations.
Journal of the American chemical society 114 (25), pp. 10024–10035.
Cited by: §5.

Sentencebert: sentence embeddings using siamese bertnetworks.
arXiv preprint arXiv:1908.10084.
Cited by: §2.

Dropedge: towards deep graph convolutional networks on node classification.
arXiv preprint arXiv:1907.10903.
Cited by: §4,
§4,
§5.

Sign: scalable inception graph neural networks.
arXiv preprint arXiv:2004.11198.
Cited by: §4.

Imagenet large scale visual recognition challenge.
International journal of computer vision 115 (3), pp. 211–252.
Cited by: §1.

Learning to simulate complex physics with graph networks.
In International Conference on Machine Learning,
pp. 8459–8468.
Cited by: §1.

E (n) equivariant graph neural networks.
arXiv preprint arXiv:2102.09844.
Cited by: §1.

Collective classification in network data.
AI magazine 29 (3), pp. 93–93.
Cited by: §1,
§4.

Pitfalls of graph neural network evaluation.
arXiv preprint arXiv:1811.05868.
Cited by: §1,
§4.

Dropout: a simple way to prevent neural networks from overfitting.
The journal of machine learning research 15 (1), pp. 1929–1958.
Cited by: §4,
§5.

A deep learning approach to antibiotic discovery.
Cell 180 (4), pp. 688–702.
Cited by: §1.

Graph agreement models for semisupervised learning.
Cited by: §4.

Bootstrapped representation learning on graphs.
arXiv preprint arXiv:2102.06514.
Cited by: §1,
§4,
§7,
§8.

Attention is all you need.
arXiv preprint arXiv:1706.03762.
Cited by: §1.

Graph attention networks.
arXiv preprint arXiv:1710.10903.
Cited by: §1.

Deep graph infomax.
arXiv preprint arXiv:1809.10341.
Cited by: §1,
§4.

Neural execution of graph algorithms.
arXiv preprint arXiv:1910.10593.
Cited by: §1.

Microsoft academic graph: when experts are not enough.
Quantitative Science Studies 1 (1), pp. 396–413.
Cited by: §2.

Improving conformer generation for small rings and macrocycles based on distance geometry and experimental torsionalangle preferences.
Journal of chemical information and modeling 60 (4), pp. 2044–2058.
Cited by: §5.

Bag of tricks for node classification with graph neural networks.
arXiv preprint arXiv:2103.13355.
Cited by: §4.

Simplifying graph convolutional networks.
In International conference on machine learning,
pp. 6861–6871.
Cited by: §1.

Graph convolutional neural networks for webscale recommender systems.
In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,
pp. 974–983.
Cited by: §1.

Scalable graph neural networks for heterogeneous graphs.
arXiv preprint arXiv:2011.09679.
Cited by: §4,
§4.

Graphsaint: graph sampling based inductive learning method.
arXiv preprint arXiv:1907.04931.
Cited by: §4.

Learning from labeled and unlabeled data with label propagation.
Cited by: §4.

Deep graph contrastive representation learning.
arXiv preprint arXiv:2006.04131.
Cited by: §4,
§7.

Layerdependent importance sampling for training deep and large graph convolutional networks.
arXiv preprint arXiv:1911.07323.
Cited by: §4.
https://www.arxivvanity.com/papers/2107.09422/