Run node2vec ultrafast with less memory using PecanPy

An easy to use implementation of a popular graph embedding method

Node2vec is a node embedding method that generates numerical representation (or embeddings) of nodes in a graph [1]. These embeddings are then used for various down stream tasks such as node classification and link prediction. If you’re unfamiliar with node2vec, feel free to check out my other post that explains the intuition of it.

TL;DR

Node2vec is a very popular embedding method for generating numerical representations of nodes in a graph (a.k.a. network), and yet the original implementations are slow and memory inefficient.
PecanPy is an ultra fast and memory efficient implementation of node2vec, which can be easily installed via pip and can be used either from command line or in your Python script.
As an example using STRING network, PecanPy reduces the runtime from 5 hours down to slightly over a minute and reduces the memory usage from 100GB down to <1GB.

Original Python implementation of node2vec is slow and memory inefficient

The original implementations (implemented in C++ and Python) are quite slow and memory inefficient. In this blog post, I’ll introduce PecanPy [2], a parallelized, memory efficient and accelerated node2vec in Python, and we’ll see that both the run time and memory usage are drastically improved compared to the original implementations.

Figure1. Runtime and memory usage of node2vec on two biological networks BioGRID and STRING.

Let’s take a look at two examples of embedding the biological networks BioGRID and STRING. Both networks have roughly 20k nodes, with STRING being slightly denser. It took more than 5 hours and 100GB of memory to fully embed the STRING using the original Python implementation, and used more than 1GB of memory for BioGRID, even though the file size of the full network is only 2.5MB. Spoil alert: it took only slightly over a minute with less than 1GB of memory to embed STRING using PecanPy.

Three main issues with the original implementations of node2vec

The following issues are specifically for the original Python implementation. Although the C++ implementation is more efficient in some way, we’ll see in the next section that it still suffers from memory issues and runs slower than PecanPy.

Parallelism issue: The transition probability preprocessing and random walk generation are not parallelized despite the embarrassingly parallel nature of both processes, i.e. each walk generation is an independent task that does not depend on any other walks.

Graph data structure issue: The Python original implementation uses networkx to store graph, which is quite inefficient for large scale computation. Motivated by this blog post by Matt Ranger, PecanPy uses compact sparse row (CSR) format for sparse graphs, which is more compactly stored in memory and most importantly, much faster to run as it can better utilize cache. Yet there’s one issue with CSR, if the graph is dense (say fully connected, which is quite common in biology, such as gene coexpression networks), then storing in CSR requires more memory than needed due to explicit storing of indices. To deal with dense networks, we’ve also added another option to store as 2D numpy array.

Memory issue: The fact that the memory usage far exceeded the file size of the network is attributed to the preprocessing and storing of 2nd order transition probability. To resolve this, we can simply remove the preprocessing step and instead compute the transition probability as needed without saving. However, for a small and sparse network where it is still possible to save all 2nd order transition probabilities, it is recommended to include the preprocessing step as it can generate random walk much faster without re-computing transition probability in each step.

Figure2. Overview of PecanPy modes of execution.

PecanPy resolved all the above issues and provided three modes of execution, including PreComp, SparseOTF, and DenseOTF, each of which is more optimized for graphs of different sizes and densities. In short, the two OTF (stands for On The Fly) modes addressed the memory issue by computing transition probability on the fly, while PreComp mode preserves the preprocessing step as in the original implementations on smaller networks for faster walk generation. Finally, Sparse and Dense indicate whether CSR or numpy array is used to store the graph.

Benchmarking results

Figure3. Runtime and memory usage for different implementations of node2ec on 8 different networks

Figure3 shows the runtime and memory usage comparison between the original implementations and PecanPy, using networks of various sizes and densities, including those tested in the node2vec paper [1]. First, notice that all three modes of PecanPy have achieved faster runtime compared to both original Python and C++ implementations. In particular, it took only slightly over a minute to embed the STRING network. Second, the OTF strategy significantly reduced memory usage. Looking at STRING again, SparseOTF is able to fully embed the network with less than 1GB of memory, while both original implementations require ~100GB of memory. Furthermore, the GIANT-TN network (26k nodes, fully connected) can not be embedded using either original implementations due to memory limitations (tested with 200GB resource configuration).

Finally, the quality of embedding generated by PecanPy is confirmed to be the same as those generated by the original Python implementation. Checkout [2] for more information.

Getting started with PecanPy

You can check out the most up-to-date (unreleased) PecanPy on GitHub. It can also be easily installed via pip: pip install pecanpy, which installs the latest released version of PecanPy. All benchmarks are reproducible using the workflow scripts provided by the benchmark repository (you can also use it to benchmark your own implementation of node2vec).

PecanPy can either be used from the command line or loaded as a module in Python script. As an example, the following command line script embeds the karate network with SparseOTF mode:

pecanpy --input karate.edg --output karate.emb --mode SparseOTF

Or equivalently, you can load the PecanPy as a module and generate random walks from the graph, which can then be used to generate embeddings:

from pecanpy import pecanpy

# load graph object using SparseOTF mode
g = pecanpy.SparseOTF(p=1, q=1, workers=1, verbose=False)
g.read_edg("karate.edg", weighted=False, directed=False)

# generate random walks
walks = g.simulate_walks(num_walks=10, walk_length=80)

# use random walks to train embeddings
w2v_model = Word2Vec(walks, vector_size=8, window=3, min_count=0, sg=1, workers=1, epochs=1)

Compare to existing implementations of ndoe2vec

Although there are many other existing implementations of node2vec, including nodevectors by Matt Ranger and Python3 node2vec by Elior Cohen, none of them has provided a systematic benchmark to show exactly how much speedup is made and few of them dealt with the memory issue. Furthermore, none of the existing implementations showed quality control of embedding despite that one could easily make mistakes when optimizing the code and introduce bugs, e.g. in random walk generation, and thus reduce the quality of the resulting embedding. PecanPy provides systematic performance benchmark and quality control of embedding, which are completely reproducible using the workflow set up in the benchmarking repository.

Reference

[1] Grover A., Leskovec J. (2016) node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp. 855–864. Association for Computing Machinery, New York, NY, USA.

[2] Renming Liu, Arjun Krishnan, PecanPy: a fast, efficient and parallelized Python implementation of node2vec, Bioinformatics, 2021;, btab202, https://doi.org/10.1093/bioinformatics/btab202