Journal of Universal Computer Science, vol. 28, no. 2 (2022), 160-180
submitted: 17/12/2020, accepted: 2/11/2021, appeared: 28/2/2022 CC BY-ND 4.0

A Spark Parallel Betweenness Centrality Computation and its
Application to Community Detection Problems

Daniel Gómez González
(Complutense University of Madrid, Spain

https://orcid.org/0000-0001-9548-5781 dagomez@estad.ucm.es)

Luis Llana Díaz
(Complutense University of Madrid, Spain

https://orcid.org/0000-0003-1962-1504 llana@ucm.es)

Cristóbal Pareja
(Complutense University of Madrid, Spain

https://orcid.org/0000-0001-7739-0236 cpareja@ucm.es)

Abstract: The Brandes algorithm has the lowest computational complexity for computing the

betweenness centrality measures of all nodes or edges in a given graph. Its numerous applications

make it one of the most used algorithms in social network analysis. In this work, we provide a par-

allel version of the algorithm implemented in Spark. The experimental results show that the parallel

algorithm scales as the number of cores increases. Finally, we provide a version of the well-known

community detection Girvan-Newman algorithm, based on the Spark version of Brandes algorithm.

Keywords: Spark, MapReduce, Social NetworkAnalysis, Centrality measure, BrandesAlgorithm,
distributed programming
Categories: G.2 G.2.2 J.4 C.1.4 C.2.4

DOI: 10.3897/jucs.80688

1 Introduction

Social network analysis is currently a popular discipline. It is extensively used in a range
of applications and real problems. Some common network analysis applications include
data aggregation and mining, network propagation modelling, network modelling and
sampling, user attribute and behavior analysis, community-maintained resource support,
location-based interaction analysis, social sharing and filtering, recommender systems
development, and link prediction and entity resolution. One of the major concerns in
network analysis is related to the concept of centrality. Centrality measures the importance
of a node’s position in a network. In social, biological, communication, and transportation
networks, among others, it is important to know the relative structural prominence of
nodes to identify the key elements in the network (see for example [Borgatti, 2005, Gómez
et al., 2003, Newman, 2003, Gómez et al., 2013, Bravetti et al., 2007, Freeman, 1978]).
Numerous studies have proposed and analyzed several centrality measures (for more
details, see [Borgatti, 2005, Wasserman and Faust, 1994]).

One significant drawback of some standard centrality measures is the processing time
needed. Centrality measures must be calculated for large and complex networks in the era
of Big Data. The Brandes algorithm [Brandes, 2001] can solve the betweenness centrality

https://orcid.org/0000-0001-9548-5781
https://orcid.org/0000-0001-9548-5781
https://orcid.org/0000-0003-1962-1504
https://orcid.org/0000-0003-1962-1504
https://orcid.org/0000-0001-7739-0236
https://orcid.org/0000-0001-7739-0236


Gomez D., Llana L., Pareja C.: A Spark Parallel Betweenness… 161

defined in [Freeman, 1978]. But, its sequential implementations cannot successfully deal
with large graphs due to the time this would require. The authors in [Balkir et al., 2015]
propose a parallel algorithm to compute an approximation of centrality. This centrality
measure often needs to compute the shortest paths between all the pairs of nodes in the
network, making it highly complex. The exact computation of the betweenness centrality
involves solving the APSP problem. The authors in [Yang and Lonardi, 2007] provide a
parallel algorithm for computing betweenness centrality. Their solution requires high-end
shared symmetric memory multiprocessor architectures.

Another implementation of betweenness centrality can be found in [Madduri et al.,
2009, Prountzos and Pingali, 2013a, Wang et al., 2016], which was subsequently im-
proved in [Jamour et al., 2018]. These work use parallel versions of betweenness central-
ity based on an incremental betweenness centrality computation. They are variations of
the Brandes algorithm, not the algorithm itself. The authors in [AlGhamdi et al., 2017]
carried out the first major study to attempt to benchmark betweenness centrality calcula-
tions. They used a supercomputer to compute exact betweenness centralities for large
graphs. Finally, in [Matta et al., 2019] there is an approximated algorithm for computing
the betweenness centrality with very good results.

The major contribution of this paper is to provide a parallel version of the Brandes
betweenness centrality algorithm [Brandes, 2001] in the Spark framework (https://spark.
apache.org/). The Brandes algorithm is the fastest known way to compute the well-known
centrality measure for links or nodes in a network. We have studied the performance
of this algorithm with large-scale data, and have shown how the computational cost
decreases almost linearly as the number of cores grows.

We have applied our parallel version of the Brandes algorithm to the Girvan-Newman
algorithm [Girvan and Newman, 2002] (from now on GN algorithm). This algorithm
uses the computation of betweenness centrality for links. We have shown how the GN
algorithm benefits from the use of our parallel implementation of the Brandes algorithm.
Additionally, we have compared this version of the GN algorithm with a recent parallel
version of the same algorithm called SPB-MRA [Moon et al., 2016].

We have organized this paper as follows. First, we have dedicated the next Section to
some preliminary concepts: parallel computing, Spark programming paradigm, centrality
measures and community detection problems. Section 3 details the Brandes algorithm for
the parallel computing of the node and edge betweenness centrality. Next, in Section 4
we have shown the results achieved by our parallel implementation. Section 5 contains
the Girvan-Newman’s algorithm using the parallel version of the Brandes algorithm, and
finally there are some conclusions in Section 7.

2 Related work: Preliminaries

2.1 Distributed programming with Spark

Sequential algorithms frequently cannot successfully deal with the problems that involve
a large amount of data. To overcome this problem, there are new models of programming
that distribute the workload over several processors.

Currently, the most common conceptual model for implementing distributed com-
puting with large datasets is MapReduce. This computational model comes from the
functional programming paradigm and allows parallel program execution. In this way, it
is possible to process massive datasets in a distributed framework. Processing a large
dataset under this model means splitting it into multiple small pieces to be processed

https://spark.apache.org/
https://spark.apache.org/


162 Gomez D., Llana L., Pareja C.: A Spark Parallel Betweenness…

(map) separately, and then combining the partial results (reduce) in a tree-like manner.
Both steps, processing single chunks of data and compiling them together, can be run in
a distributed manner. Both Hadoop [Haddop] and Spark [Spark] implement this model.
Thanks to them, a programmer does not need to deal with the low-level technical details
related to the system, such as partitioning datasets into chunks and the communication
between the processing units. Programmers thus only have to focus on how to process
single portions of data and how to combine the partial results.

Spark is a free, open-source project for large-scale data processing. Although the
implementation language is Scala, one can use Spark with several programming lan-
guages, including Python. The computational model of Spark is more abstract than
Hadoop’s model, exploiting the concept of Resilient Distributed Datasets (RDDs). RDDs
allow Spark to work on data stored in memory and not only on disk like Hadoop. Spark
performance is therefore usually much better than what can be achieved in Hadoop.

2.2 Social network analysis: Centrality measures

Centrality is a complex notion that has a vague definition. Usually, there is a consensus
about the idea that a node has high centrality if it can communicate directly with other
nodes or if it performs as an intermediary in a communication among other nodes. These
ideas are derived from the three first concepts of centrality:

– Degree centrality.We can define this measure as the number of links (edges) emerging
from a node. The simplicity of this measure makes it usable in very large and complex
networks. For example, in large social networks such as Facebook, the popularity of
a person is measured by the number of its friends. The degree of centrality focuses
on the communication activity level, that is, the number of direct links from a node
to its neighborhood. As the ability to increase the number of direct communications
grows, the centrality degree of a node increases [Wasserman and Faust, 1994].

– Closeness centrality. Closeness centrality measures the possibility of a given node
to communicate with many others using a minimum number of intermediaries.
The degree of centrality is a particular case of closeness centrality because only
communications without intermediaries are considered (see [Bavelas, 1948] for
more details).

– Betweenness centrality. Betweenness centrality measures the possibility of being
an intermediary in communication between other nodes. Betweenness is a type of
measure of communication control. This measure was first proposed and studied
in [Freeman, 1978].

In this work, we focus on the calculation of betweenness centrality. Although there
are different betweenness centrality measures based on how the information flows (see
[Borgatti, 2005]), the most used measure is most likely the one based on the shortest
paths defined by Freeman et al. [Freeman, 1978].

Definition 2.1 Given a graph (V,E)whereV = {1, . . . , n} represents the set of vertices
and E represents the set of direct links, the betweenness centrality of a node v ∈ V is
expressed as:

bet(v) =

n−1∑
i=1

j=n∑
j=i+1

πi,j(v)

πi,j


Gomez D., Llana L., Pareja C.: A Spark Parallel Betweenness… 163

where πi,j(v) represents the number of shortest paths from i to j through node v, and
πi,j is the number of shortest paths from i to j.

Calculating the betweenness centralities of all vertices in a network requires obtaining
the shortest paths between all pairs of vertices, which takesΘ(|V |3) time with the Floyd-
Warshall algorithm, modified not only to find one path, but also to count all of the shortest
paths between two nodes.

On a sparse graph, Johnson’s algorithm may be more efficient. The time complexity
is O(|V |2 log |V |+ |V ||E|). On unweighted graphs, calculating betweenness centrality
takes O(|V ||E|) time using the Brandes algorithm [Brandes, 2001].

Let us observe that, on the calculation of the betweenness centrality of all vertices,
one usually assumes that the graph is undirected and connected. In the following section,
we describe the sequential version of the Brandes algorithm.

2.3 Community detection algorithms: Girvan and Newman

One of the most popular algorithms in community detection problems is the Girvan-
Newman (GN) algorithm [Girvan and Newman, 2002]. This algorithm is a hierarchical
clustering algorithm based on the betweenness centrality idea defined in [Freeman, 1978]
for edges instead of nodes. The betweenness measure of an edge represents the number of
shortest paths between pairs of nodes that pass through it. The idea behind the algorithm
is that edges with strong betweenness frequently connect communities. Thus, removing
these edges breaks the graph into communities. So, communities in the graph are obtained
by sequentially eliminating edges with the highest betweenness centrality repeatedly.

There are others more efficient, faster and sophisticated community detection algo-
rithms, such as Louvain CNM or community detection based on optimization algorithms
(see [Fortunato, 2010] for more details). Nevertheless, due to its simplicity and intuitive-
ness, GN algorithm is currently used in non-large networks. However, it is still an open
problem to design a scalable and accurate parallel GN algorithm to tackle large graphs
using a parallel machine with distributed memory [Moon et al., 2016, Nath and Roy,
2018]. In the next section, we briefly describe some attempts to extend the GN algorithm
for large graphs using a distributed architecture.

2.4 Parallel centrality measures and community detection algorithms

As mentioned above, the concept of betweenness in social networks is a crucial element
in the identification of communities because this centrality measure between two nodes
represents the communication stress suffered by these two nodes. When this stress is
very high, we can assume that these two nodes should belong to different communities,
being each of them intermediaries between these two communities. The idea of the
most common betweenness centrality measures is the computation of the minimum
paths between each pair of nodes. Some authors [Tan et al., 2009, Green and Bader,
2013, Prountzos and Pingali, 2013b] have parallelized this. However, most of these
parallel versions come from algorithms that are not very efficient in the sequential
version. For this reason, we propose in this paper a parallelization of the most efficient
sequential algorithm for computing betweenness centrality: the Brandes algorithm.

In community detection problems, there are many authors (for example [Guo et al.,
2015, Zhang et al., 2016]) who try to find parallel versions of some well-known commu-
nity detection algorithms. The authors in [Bahmani et al., 2012] propose an algorithm that


164 Gomez D., Llana L., Pareja C.: A Spark Parallel Betweenness…

Brandes algorithm for computing betweenness centrality in unweighted graphs

1: procedure BetweennessCentrality(G = (V,A))
2: cB [v]← 0, v ∈ V
3: for s ∈ V do

4: S ← empty stack
5: P [w]← empty list, w ∈ V
6: σ[t]← 0, t ∈ V ; σ[s]← 1
7: d[t]← −1, t ∈ V ; d[s]← 0
8: Q← empty queue
9: enqueue s→ Q
10: while Q not empty do
11: dequeue v ← Q
12: push v → S
13: foreach neighbor w of v do
14: // w found for the first time?
15: if d[w] < 0 then
16: enqueue w → Q
17: d[w]← d[v] + 1
18: end if

19: // shortest path to w via v?
20: if d[w] = d[v] + 1 then
21: σ[w]← σ[w] + σ[v]
22: append v → P [w]
23: end if

24: end foreach

25: end while

26: δ[v]← 0, v ∈ V
27: // S returns vertices in order of non-increasing distance from s
28: while S not empty do
29: pop w ← S

30: for v ∈ P [w] do δ[v]← δ[v] + σ[v]
σ[w] · (1 + δ[w])

31: if w 6= s then CB [w]← CB [w] + δ[w]
32: end while

33: end for

34: return cB
35: end procedure

finds the densest subgraph and implements it using the MapReduce model. The authors
in [Li et al., 2011] propose a modified parallel version of the label propagation algorithm.
However, there are few works dealing with the problem of finding parallel versions of the
well-known GN algorithm. The authors in [Yang and Lonardi, 2007] present a parallel
version of the GN algorithm using the MPI library. The authors in [Moon et al., 2016]
partially solved some of these issues and developed two algorithms (the SPB-MRA and
SPB-VCA) that efficiently divide a graph into subgraphs and store them in distributed
file systems (in SPB-MRA) or in local file systems (in SPB-VCA) to support big data
graphs.

Nevertheless, these two last algorithms present two significant drawbacks, which


Gomez D., Llana L., Pareja C.: A Spark Parallel Betweenness… 165

def node_be tweenne s s _p a r ame t e r i z e d (G, weight , e n d p o i n t s ) :

def f ( s ) :

i f weigh t i s None : # use BFS

S , P , s igma = _ s i n g l e _ s o u r c e _ s h o r t e s t _ p a t h _ b a s i c (G. va lue , s )

e l s e : # use D i j k s t r a ’ s a l g o r i t hm

S , P , s igma = \

_ s i n g l e _ s o u r c e _ d i j k s t r a _ p a t h _ b a s i c (G. va lue , s , we igh t )

# Ca l c u l a t e new c o n t r i b u t i o n :

i f e n d p o i n t s :

re turn \

_ a c c umu l a t e _ e n dpo i n t s _ _ c o n t r i b u t i o n _ f r om_nod e ( S ,

P , sigma , s )

e l s e :

re turn \

_ a c c umu l a t e _ b a s i c _ _ c o n t r i b u t i o n _ f r om_nod e ( S ,

P , sigma , s )

re turn f

def n o d e _ b e tw e e n n e s s _ c e n t r a l i t y (G, s p a r k _ c on t e x t ,

k=None , no rma l i z ed =True , we igh t =None ,

e n d p o i n t s =Fa l s e ,

s eed=None ) :

brG = s p a r k _ c o n t e x t . b r o a d c a s t (G)

n co r e s = i n t ( s p a r k _ c o n t e x t . ge tConf ( ) . g e t ( ’ t o t a l − exe cu t o r − c o r e s ’ ) )

b e tweenne s s _ con t r _due_ t o_a_node = \

nod e_be tweenne s s _p a r ame t e r i z e d ( brG , weight , e n d p o i n t s )

p a r a lNode s = s p a r k _ c o n t e x t . p a r a l l e l i z e (G. n o d e s _ i t e r ( ) , n c o r e s )

be tweenne s s = pa r a lNode s . f l a tMap ( b e tweenne s s _ con t r _due_ t o_a_node ) . \

reduceByKey ( o p e r a t o r . add ) . co l l ec tAsMap ( )

_ r e s c a l e ( be tweenness , l en (G) ,

no rma l i z ed =norma l i zed ,

d i r e c t e d =G. i s _ d i r e c t e d ( ) ,

k=k )

re turn be tweennes s

Figure 1: Python functions that compute the node betweeness

motivate our work. First, the betweenness centrality algorithm that the authors are using
to build their parallel version is only moderately efficient, (whilst the Brandes algorithm
is the most efficient known algorithm for this problem). Second, the parallel GN version
that they proposed is just an approximation method of the original (it does not calculate
the same results of the original GN algorithm). Consequently, the performance of the
parallel version is unacceptable in some situations.

3 A parallel version of the Brandes algorithm: PBN and PBE

The Brandes algorithm is currently the most efficient sequential algorithm for computing
node betweenness centrality [Fortunato, 2010, Green and Bader, 2013]. Several au-
thors [Prountzos and Pingali, 2013b] have studied the possibilities of parallelizing it. Our
approach consists in processing multiple source nodes in parallel. This coarse-grained
parallelization strategy is possible due to the independence of the information calculated
for each source node. Our starting point was the Python version of the Brandes algorithm
in the package Networkx [Networkx]. Networkx is a free open source piece of software


166 Gomez D., Llana L., Pareja C.: A Spark Parallel Betweenness…

def edg e_b e twe enn e s s _p a r ame t e r i z e d (G, we igh t ) :

def f ( s ) :

i f weigh t i s None : # use BFS

S , P , s igma = _ s i n g l e _ s o u r c e _ s h o r t e s t _ p a t h _ b a s i c (G. va lue ,

s )

e l s e : # use D i j k s t r a ’ s a l g o r i t hm

S , P , s igma = _ s i n g l e _ s o u r c e _ d i j k s t r a _ p a t h _ b a s i c (G. va lue ,

s , we igh t )

# accumu la t i on

re turn _ a c c umu l a t e _ edg e s __ con t r i b u t i o n_ f r om_node (G. va lue ,

S , P , sigma , s )

re turn f

def e d g e _ b e tw e e n n e s s _ c e n t r a l i t y (G, s p a r k _ c on t e x t , n o rma l i z ed =True ,

we igh t =None ) :

brG = s p a r k _ c o n t e x t . b r o a d c a s t (G)

n co r e s = i n t ( s p a r k _ c o n t e x t . ge tConf ( ) . g e t ( ’ t o t a l − exe cu t o r − c o r e s ’ ) )

b e tweenne s s _ con t r _due_ t o_an_edge = \

e dg e_b e twe enn e s s _p a r ame t e r i z e d ( brG , we igh t )

p a r a lNode s = s p a r k _ c o n t e x t . p a r a l l e l i z e (G. n o d e s _ i t e r ( ) ,

numSl i ce s= n co r e s )

be tweenne s s = pa r a lNode s . f l a tMap ( b e tweenne s s _ con t r _due_ t o_an_edge ) \

. reduceByKey ( o p e r a t o r . add ) . co l l ec tAsMap ( )

_ r e s c a l e _ e ( be tweenness , l en (G) ,

no rma l i z ed =norma l i zed ,

d i r e c t e d =G. i s _ d i r e c t e d ( ) )

re turn be tweennes s

Figure 2: Python functions that compute the node betweeness

that is carefully documented and broadly used. In abstract, it is the ideal choice in this
context.

In order to synthesize the parallel algorithm, we have transformed the original fun-
cions node_betweenness_centrality and edge_betweenness_centrality , to adapt them
to the Spark MapReduce scheme. The functionality of the new version is the same as
the original. We defined three functions (_accumulate_basic, _accumulate_endpoints
and _accumulate_edges). Each performs their respective contributions to centrality, up-
dating the shared variable betweenness, in an accumulative way that is intrinsically
sequential. In the original Brandes algorithm, this variable plays the role of a shared,
unique variable, and is passed by reference to each of these functions. In fact, this
variable has to be updated in each call to these functions, which forces the computa-
tion to be sequential. The key is that this shared role is not necessary because each
of these functions can calculate its specific contribution. Because these contributions
are additive, commutative, and associative, the order in which single contributions are
added is irrelevant, so these contributions can be computed in parallel. This intermedi-
ate step allows us to transform these functions into three respective, similar functions,
that calculate the centrality without performing accumulation. This is a critical change
because it avoids a strong dependency with the variable betweenness. The calls to the
new functions betweenness_centrality and edge_betweenness_centrality also need to
be adapted to perform the respective accumulations (of nodes and edges to centrality)
after the calculations. Specifically, we have parameterized these functions with simple
partial evaluation, that is, they are applied to all parameters except the nodes. This basic


Gomez D., Llana L., Pareja C.: A Spark Parallel Betweenness… 167

Gaph name
Number of

vertices

Number of

Edges
ca-GrQc 5242 14496

ca-HepTh 9877 25998

ca-AstroPh 18772 198110

cit-HepPh 34546 421578

soc-Slashdot0811 77360 905468

sx-superuser 194085 1443339

com-dblp 317080 1049866

Table 1: Input statistics for real world networks used in our experimental study.

Graph Name Time (secs.)

ca-GrQc 101.05

ca-HepTh 512.19

ca-AstroPh 4975.47

cit-HepPh 26395.99

Table 2: Time required for the sequential version of the Brandes Algorithm

transformation allows us to express the loop traversing the nodes using a MapReduce
scheme: the reduce operation can combine the partial results as mapping functions yield
them. These functions are shown in Figures 1 and 2. The full code can be found in
GitHub: https://github.com/LuisLlana/spark_brandes.

4 Results and discussion

4.1 Networks data set

The network data sets that we use in this paper can be found in standard repositories
of network datasets such as the Stanford Large Network Dataset Collection [Stanford]
or the UCI Network Data Repository [UCI]. Table 1 shows the datasets used in the
experimental setting as some characteristics of the network. We have included the time
required by the sequential implementation of the Brandes function in the Networkx
module in Table 2 on the smaller graphs. Due to the time required we have not included
the time for the bigger ones on.We have executed the BrandesAlgorithm in the Networkx
package on an Intel Core i5 CPU 3.10GHz.

In Figure 5 and Figure 6 we can see the strong relations between the running time
with 200 cores and the graph size.

4.2 Scalability

First, we can compare the obtained results with the sequential version of the algorithm.
The sequential algorithm takes 26396 seconds (7 hours and 20 minutes) to compute
the Brandes algorithm while the parallel version with 50 nodes takes 2438 seconds (39
minutes). That is an 11.24x speedup. Discussing this comparison is difficult because
we have different architectures and Spark introduces an overhead in the computations.
Nonetheless, this comparison is not as relevant as the scalability that we discuss next.

https://github.com/LuisLlana/spark_brandes


168 Gomez D., Llana L., Pareja C.: A Spark Parallel Betweenness…

0 50 100 150 200

500

1000

1500

Number of cores

T
im
e
(s
ec
s.
)

0 50 100 150 200

2

4

6

·10−3

Number of cores

1
/T
im
e

Figure 3: Arxiv Astro Physics graph speed up (ca-AstroPh)

Gaph name Time (sec) Vertices Edges

ca-GrQc 34.70 5242 14496

ca-HepTh 27.67 9877 25998

ca-AstroPh 150.23 18772 198110

cit-HepPh 884.94 34546 421578

soc-Slashdot 905468 3703.87 77360 905468

sx-superuser 40292.89 194085 1443339

com-dblp 112854.35 317080 1049866

Table 3: Results for real world networks in our experimental study (200 cores).

To show the scalability of our algorithm, we have measured the elapsed time for
one iteration while varying the number of cores. Table 3 shows the time used by our
implementation for the graphs in Table 1 using 200 cores. The full study of scalability
has been carried out with the graphs ca-AstroPh and cit-HepPh. We have run our imple-
mentation on these graphs with the number of cores varying from 10 to 200 in the case
of the ca-AstroPh and from 50 to 200 in the case of cit-HepPh. The results are shown in
Figures 3 and 4.

In both figures, the left graph shows the time required to complete the programs
while the right one shows the inverse, which gives an idea of the speedup. In both cases,
we can observe an almost linear speedup. Figure 7 compares the execution time between


Gomez D., Llana L., Pareja C.: A Spark Parallel Betweenness… 169

50 100 150 200

1000

1500

2000

2500

Number of cores

T
im
e
(s
ec
s.
)

50 100 150 200

0.4

0.6

0.8

1

1.2

·10−3

Number of cores

1
/T
im
e

Figure 4: Arxiv High Energy Physics paper citation network (cit-HepPh)

different sized graphs: ca-GrQc (14496 edges), ca-HepTh (25998 edges), ca-AstroPh
(198110 edges) and cit-HepPh (421578 edges).

At this point we want to quantify scalability of the parallelization and the ratio
between the execution time (Y ) and the number of cores (X). In order to achieve this,
we present four regression models of the form Y = f(X) for the considered examples.

– Y = α0 + α1X . Linear regression.

– Y = α0e
α1X . Exponential regression.

– Y = α0 + α1 log(X). Logarithmic regression.

– Y = α0 X
α1 . Power regression.

Table 4 shows the different regression results in terms of determination coefficient
R2. As we can see, the best adjustment for the two examples is the power regression
model (both of them with a α1 negative value). Figure 8 shows the potential regression
model and the error function for the ca-AstroPh example.


170 Gomez D., Llana L., Pareja C.: A Spark Parallel Betweenness…

Figure 5: Relation (power regression) between running time with 200 cores and edges

sizes of the network.

Figure 6: Relation (power regression) running time with 200 cores and edges sizes of

the network

To illustrate the reduction in time depending on the number of cores, the next regres-
sion model shows us the relationship between the inverse of the execution time (Z = 1

Y )
and the number of cores X . As done previously, the same four regression models were
studied for the two examples analyzed, now for Z = f(X).

In Table 5, we show the regression results in terms of determination coefficient R2.
The best adjustment for the two examples is now the linear regression model (both with
a positive α1 value, but lower than 1).

Figure 9 shows the linear adjustment between the inverse of time and the number of
cores for the ca-AstroPh example. It clearly shows that the time used by the algorithm
decreases as the number of cores increases. Additionally, the asymptotic relationship
within the number of cores and the elapsed time used is just an inverse function. In other
words, if we double the number of cores, the time request is halved, asymptotically.
According to the experiments, the scalability of our parallel algorithm is almost optimal.

Moreover, Figure 8 shows that the error estimated, clearly and very quickly, tends to
zero, underlining the optimality of the parallelization.

4.3 Software and hardware

Our implementation was developed using Python 3.4, Spark 2.1 and the pyspark module
2.1. We ran our programs in an HP Prolian SL390 G7 with 40 nodes running Linux


Gomez D., Llana L., Pareja C.: A Spark Parallel Betweenness… 171

1

2

3

4

·105

50
100

150
200

0

0.5

1

·104

Graph size (edges)
Number of Cores

T
im
e
(s
ec
s.
)

Figure 7: Time comparison between graphs

Regression Model

Example Linear Exponential Logarithmic Power

ca-AstroPh 0,64 0,88 0,91 0,99

cit-HepPh 0,7864 0,85 0,9069 0,9361

Table 4: Determination coefficient for different considered regression models between.

Redhat Enterprise 5.7. Each node had 6 cores; the total RAM of the system was 1200Gb.
We used a maximum of 200 executors, each one with limited memory of 2Gb.

5 Application of PB algorithm: a new parallel version of the GN
algorithm

In the previous sections, we have described a parallel version of the Brandes algorithm
that computes the betweenness centrality for all nodes and edges. Based on it, in this
section, we show a possible application to this efficient computational in the field of
community detection problems. One of the most well-known algorithms in community
detection problem is the Girvan-Newman algorithm. We modified the Jahanbakhsh
version [Jahanbakhsh].

Our version of the Girvan-Newman (Figure 10) algorithm stops when there are only
small communities. A community is a connected component of the graph. The size of


172 Gomez D., Llana L., Pareja C.: A Spark Parallel Betweenness…

Figure 8: Power regression Y = 11000 x−0,807 with R2 = 0, 9916.

Regression Model

Example Linear Exponential Logarithmic Power

ca-AstroPh 0,9904 0,87 0,89 0,9901

cit-HepPh 0,9754 0,8338 0,8504 0,931

Table 5: Determination coefficient for different regression models between inverse time

(Z) and number of cores (X).

the desired largest community is controlled by the comm_size parameter. Similar to the
previous algorithms, the full code is in GitHub: https://github.com/LuisLlana/spark_
brandes.

5.1 Performance of GN-PBE results

We have conducted some experiments to evaluate the performance of the proposed
algorithm. This section describes the details of the experimental setting and its analysis.

We have carried out two studies to show the effectiveness of our parallel version
of the GN algorithm (GN-PBE). We have used two medium/large size examples: the
ca-GrQc and ca-HepTh networks. It is important to observe that trying to obtain this
amount of iterations with undistributed programming would have increased the time
considerably, so only the results for 200 cores were taken into account. The GN algorithm
has to compute the betweenness centrality iteratively for all edges of the network and
then choose the maximum value and eliminate it from the graph.

Figures 11 and 13 show the time of execution in seconds of the entire process during
the first 2500 and 9000 iterations of the algorithm for the ca-GrQc and ca-HepTh networks.

https://github.com/LuisLlana/spark_brandes
https://github.com/LuisLlana/spark_brandes


Gomez D., Llana L., Pareja C.: A Spark Parallel Betweenness… 173

Figure 9: Linear regression Z = 0.0005 + 0.0005x with R2 = 0, 9916.

def g i rvan_newman_s tep (G, s p a r k _ c o n t e x t =None ) :

from b e t w e e n n e s s _ c e n t r a l i t y _ s p a r k \

import e d g e _ b e tw e e n n e s s _ c e n t r a l i t y

bw = e d g e _ b e tw e e n n e s s _ c e n t r a l i t y (G, s p a r k _ c o n t e x t = s p a r k _ c o n t e x t )

edges = l i s t ( bw . keys ( ) )

edges_bw = l i s t ( bw . v a l u e s ( ) )

bw_max = max ( edges_bw )

pos_max = edges_bw . i ndex ( bw_max )

u , v = edges [ pos_max ]

G. remove_edge ( u , v )

re turn u , v , bw_max

def girvan_newman (G, comm_size , s p a r k _ c o n t e x t =None , o u t f i l e = sy s . s t d o u t ) :

t o t a l _ e d g e s = G. number_of_edges ( )

i n i _ t im e = t ime . t ime ( )

components , num = f i nd_b i g_componen t s ( comm_size , G)

show_info ( o u t f i l e , i n i _ t ime , 0 , comm_size , components , num)

whi le l en ( components ) >0 :

removed = gi rvan_newman_s tep (G, s p a r k _ c o n t e x t )

t o t a l _ e d g e s −= 1

components , num = f i nd_b i g_componen t s ( comm_size , G)

show_info ( o u t f i l e , i n i _ t ime , removed , comm_size , components , num)

re turn G

Figure 10: Girvan-Newman algorithm using the parallel version of the Brandes

algorithm

In both graphs, we can see how time elapses more slowly from one iteration to another
as we increase the number of iterations. Amore in-depth analysis of this phenomenon
can be observed in the three graph in Figure 12 and Figure 14.

Figure 12 shows a notable time reduction around the 600th iteration. This is because
the graph has been disconnected (when the edge with most betweenness is removed)
into two large and homogeneous communities (that is, connected components). After
this iteration, the calculation of betweenness centrality is drastically reduced since the
centrality measure of a particular edge depends on the connected component to which it
belongs.

Another aspect that should be addressed is the observed variability that seems to be


174 Gomez D., Llana L., Pareja C.: A Spark Parallel Betweenness…

−200 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Iteration

T
im
e
(s
ec
s.
)

Figure 11: Girvan-Newman iterations for ca-grqc
Groups N Clusters

1 2 3 4 5 6 7 8 9 10 11 12 13 14

24 100 1.84

25 43 1.84

23 100 1.92 1.92

22 100 1.93 1.93

21 100 1.94 1.94

20 100 1.97 1.97

19 100 2.01 2.01

18 100 2.11 2.11

17 100 2.13 2.13

15 100 2.23

16 100 2.24

14 100 2.38

13 100 2.48

12 100 2.78

11 100 3.03

10 100 3.21

9 100 3.39

8 100 3.65

7 100 4.60

6 100 5.92

5 100 6.01

4 100 6.04

2 100 6.16

3 100 6.20

1 100 6.46

p-value .13 .44 .08 .07 .07 1.00 1.00 1.00 1.00 1.00 1.00 .05 .48 1.00

Table 6: Student-Newman Keuls contrast of GN-PBE algorithm on ca-grqc graph group

by 100 iterations

noise in the curve. This variability in the execution-time around an iteration is explained in
the box-plot chart. In general, for a set of close iterations, the computation of an iteration
is quite stable. However, there are certain outlier iterations in which the execution time is
increased (for example, during the first 100 iterations the average execution time is 6.5
seconds, but we can see other outlier iterations close to 8 seconds). This phenomenon
mainly occurs in dense graphs where the computation of all the minimum paths is
constant. Sometimes, the algorithm removes an edge that does not disconnect. In these
cases, the time increases because the minimum paths are longer.

Figure 14 shows the execution-time behavior of the GB-SBE algorithm for the first
9000 iterations. First, there is no step in which the time is drastically reduced as was
the case in the previous example. The tendency in this example is less pronounced than
in the previous one. We can identify the outlier process around a controlled group of
iterations.We observe that the execution time of an iteration in a group of iterations is very


Gomez D., Llana L., Pareja C.: A Spark Parallel Betweenness… 175

−200 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600

1

2

3

4

5

6

7

8

9

10

11

12

13

Iteration

T
im
e
(s
ec
s.
)

−5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105

5.6

5.8

6

6.2

6.4

6.6

6.8

7

7.2

7.4

7.6

7.8

8

8.2

8.4

Iteration

T
im
e
(s
ec
s.
)

Figure 12: Girvan-Newman delta times for ca-grqc

stable. There are also anomalous increases of time produced by the same phenomenon
mentioned above, which are related to the greater complexity of the calculation of the
minimum paths in less dense graphs than in dense ones. We can also observe a reduction
in execution time as the number of iterations increases.

From both experiments, we can guarantee that time reduction tends to drop as we
increase the number of iterations in both problems. To verify this, we performed a
Student-Newman-Keuls contrast between groups of 100 and 500 iterations for the two


176 Gomez D., Llana L., Pareja C.: A Spark Parallel Betweenness…

−500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 8000 8500 9000 9500

0

1

2

3

4

5

6

7

·104

Iteration

T
im
e
(s
ec
s.
)

Figure 13: Girvan-Newman iterations for ca-hepth
Group N Clusters

1 2 3 4 5 6 7 8 9 10 11 12 13 14

18 338 3.33

17 500 3.39

16 500 3.42

15 500 3.54

14 500 3.81

13 500 3.99

12 500 4.45

11 500 4.81

10 500 5.36

9 500 6.11

8 500 7.26

7 500 8.24

6 500 9.35

5 500 10.37

4 500 11.76

3 500 12.72

2 500 13.45

1 500 14.75

p-value .14 .07 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

Table 7: Student-Newman Keuls contrast of GN-PBE algorithm on ca-hepth graph

group by 500 iterations

examples to see if there is a significant difference between the average execution times
in each group. Tables 6 and 7 show the results and also the p-value of each group for
both problems. We can observe that the decrease in groups of 100 is significant. This
can also be seen in the second example when we consider groups of 500 iterations.

6 Threats to validity

There are several issues in the experiments we have carried out. First, the comparison of
the sequential and the parallel versions of the algorithm may be not accurate because
the architectures they have been executed are different. This is because our institution
policies did not allow us the execution of the sequential version in the cluster. Anyway,
the purpose of this research has been to show the scalability of the parallel version.

Another problem with this version is that we need a copy of the graph in each node
of the cluster. So, the size of the graph we can consider is limited by the amount of RAM
memory of the nodes in the cluster. Assuming that each node can have 8Gb of memory
for storing the graph, we could deal with graphs up to 40 million nodes and edges. For
bigger graphs we could use GraphX. This library has two problems with respect our


Gomez D., Llana L., Pareja C.: A Spark Parallel Betweenness… 177

−500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 8000 8500 9000 9500

0

10

20

30

40

50

60

70

Iteration

T
im
e
(s
ec
s.
)

−40 −20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520 540

12

14

16

18

20

22

24

26

28

30

Iteration

T
im
e
(s
ec
s.
)

Figure 14: Girvan-Newman delta times for ca-hepth

approach. Firstly, it is only implemented in Scala, and it does not have support for Python
yet. Secondly, and more important, it does not implement the Djisktra Algorithm, that is
in the base of the Brandes Algorithm.


178 Gomez D., Llana L., Pareja C.: A Spark Parallel Betweenness…

7 Conclusions and Future Work

The core of this work is the development of a distributed algorithm for the well-known
Brandes algorithm, the most efficient algorithm for calculating the betweenness centrality
in terms of computational complexity [Fortunato, 2010, Green and Bader, 2013]. The
parallel version of this algorithm was implemented and the open source code is freely
available in Python. Brandes algorithm parallelization was possible because the contribu-
tions to centrality was additive, and these can then be calculated using the MapReduce
scheme. We synthesized the respective functions for each contribution by partial evalua-
tion and straightforward functional transformation techniques. The experimental results
showed that the performance of the algorithm increases with the number of cores. Specif-
ically, the time decreases proportionally to the number of cores, so we can state that the
parallelization achieved is nearly optimal.

In addition to the parallelization of the Brandes algorithm, we provided in this
paper a parallel version of one of the best-known community detection algorithms:
the Girvan-Newman (GN) algorithm. The GN algorithm is based on the calculation
of the betweenness centrality measure. As far as we know, few attempts have been
made to parallelize the GN algorithm. One of these attempts was provided in the SPB-
MRA algorithm developed some years ago. It is important to observe that although the
scalability properties of the SPB-MRA algorithm are excellent, it is not fair to compare
the GN-PBE algorithm provided here with the SPB-MRA for different reasons. The
SPB-MRA algorithm does not reproduce the results obtained by the GN algorithm, and
as the authors point out, only a high percentage of similar edges are removed from the
graph. Therefore, the results provided by the SPB-MRAalgorithm could be unacceptable
in terms of modularity or in detecting good communities. In this sense, the algorithm
developed here (GN-PBE) reproduces the same results as the original GN algorithm, but
drastically reduces the execution time. In this article, the first thousand iterations of the
GN algorithm were obtained for two classic, medium/large size examples. These results
were possible due to the GN-PBE version, since the execution time of the original GN
algorithm would have been unacceptable.

Acknowledgments

This work has been supported by State Research Agency (AEI) of the Spanish Ministry
of Science and Innovation under grant RTI2018-093608-B-C31 (FAME), the Comunidad
de Madrid under grant S2018/TCS-4314 (FORTE-CM) co-funded by EIE Funds of the
European Union, and the Madrid Government (Comunidad de Madrid-Spain) under
the Multiannual Agreement with the Complutense University as part of the Program
to Stimulate Research for Young Doctors in the context of the V PRICIT (Regional
Programme of Research and Technological Innovation) under grant PR65/19-22452.”

References

[AlGhamdi et al., 2017] AlGhamdi, Z., Jamour, F. T., Skiadopoulos, S., and Kalnis, P. (2017). A
benchmark for betweenness centrality approximation algorithms on large graphs. In Proceedings
of the 29th International Conference on Scientific and Statistical Database Management, Chicago,
IL, USA, June 27-29, 2017, pages 6:1–6:12. ACM.

[Bahmani et al., 2012] Bahmani, B., Kumar, R., and Vassilvitskii, S. (2012). Densest subgraph
in streaming and mapreduce. Proc. VLDB Endow., 5(5):454–465.


Gomez D., Llana L., Pareja C.: A Spark Parallel Betweenness… 179

[Balkir et al., 2015] Balkir, A. S., Oktay, H., and Foster, I. T. (2015). Estimating graph distance
and centrality on shared nothing architectures. Concurr. Comput. Pract. Exp., 27(14):3587–3613.

[Bavelas, 1948] Bavelas, A. (1948). Amathematical model for group structures. Human organi-
zation, 7(3):16–30.

[Borgatti, 2005] Borgatti, S. P. (2005). Centrality and network flow. Soc. Networks, 27(1):55–71.

[Brandes, 2001] Brandes, U. (2001). A faster algorithm for betweenness centrality. Journal of
mathematical sociology, 25(2):163–177.

[Bravetti et al., 2007] Bravetti, M., Gilmore, S., Guidi, C., and Tribastone, M. (2007). Replicating
web services for scalability. In Barthe, G. and Fournet, C., editors, Trustworthy Global Computing,
Third Symposium, TGC 2007, Sophia-Antipolis, France, November 5-6, 2007, Revised Selected
Papers, volume 4912 of Lecture Notes in Computer Science, pages 204–221. Springer.

[Fortunato, 2010] Fortunato, S. (2010). Community detection in graphs. Physics reports, 486(3-
5):75–174.

[Freeman, 1978] Freeman, L. C. (1978). Centrality in social networks conceptual clarification.
Social networks, 1(3):215–239.

[Girvan and Newman, 2002] Girvan, M. and Newman, M. E. (2002). Community structure in so-
cial and biological networks. Proceedings of the national academy of sciences, 99(12):7821–7826.

[Gómez et al., 2013] Gómez, D., Figueira, J. R., and Eusébio, A. (2013). Modeling centrality
measures in social network analysis using bi-criteria network flow optimization problems. Eur. J.
Oper. Res., 226(2):354–365.

[Gómez et al., 2003] Gómez, D., González-Arangüena, E., Manuel, C., Owen, G., del Pozo, M.,
and Tejada, J. (2003). Centrality and power in social networks: a game theoretic approach. Math.
Soc. Sci., 46(1):27–54.

[Green and Bader, 2013] Green, O. and Bader, D. A. (2013). Faster betweenness centrality based
on data structure experimentation. In Alexandrov, V. N., Lees, M., Krzhizhanovskaya, V. V.,
Dongarra, J. J., and Sloot, P. M. A., editors, Proceedings of the International Conference on
Computational Science, ICCS 2013, Barcelona, Spain, 5-7 June, 2013, volume 18 of Procedia
Computer Science, pages 399–408. Elsevier.

[Guo et al., 2015] Guo, K., Guo, W., Chen, Y., Qiu, Q., and Zhang, Q. (2015). Community
discovery by propagating local and global information based on the mapreduce model. Inf. Sci.,
323:73–93.

[Haddop] Haddop. Haddop. https://hadoop.apache.org/.

[Jahanbakhsh] Jahanbakhsh, K. Python implementation of the girvan newman algorithm. https:
//github.com/kjahan/community/blob/master/cmty.py.

[Jamour et al., 2018] Jamour, F. T., Skiadopoulos, S., and Kalnis, P. (2018). Parallel algorithm
for incremental betweenness centrality on large graphs. IEEE Trans. Parallel Distributed Syst.,
29(3):659–672.

[Li et al., 2011] Li, Q., Wang, Z., Wang, W., Liu, Y., Wang, P., and Yu, T. (2011). LI-MR: A
local iteration map/reduce model and its application to mine community structure in large-scale
networks. In Spiliopoulou, M., Wang, H., Cook, D. J., Pei, J., Wang, W., Zaïane, O. R., and Wu,
X., editors, Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on,
Vancouver, BC, Canada, December 11, 2011, pages 174–179. IEEE Computer Society.

[Madduri et al., 2009] Madduri, K., Ediger, D., Jiang, K., Bader, D. A., and Chavarría-Miranda,
D. G. (2009). Afaster parallel algorithm and efficient multithreaded implementations for evaluating
betweenness centrality on massive datasets. In 23rd IEEE International Symposium on Parallel
and Distributed Processing, IPDPS 2009, Rome, Italy, May 23-29, 2009, pages 1–8. IEEE.

[Matta et al., 2019] Matta, J., Ercal, G., and Sinha, K. (2019). Comparing the speed and accuracy
of approaches to betweenness centrality approximation. Computational Social Networks, 6(1):2.

https://hadoop.apache.org/
https://github.com/kjahan/community/blob/master/cmty.py
https://github.com/kjahan/community/blob/master/cmty.py


180 Gomez D., Llana L., Pareja C.: A Spark Parallel Betweenness…

[Moon et al., 2016] Moon, S., Lee, J., Kang,M., Choy,M., and Lee, J. (2016). Parallel community
detection on large graphs with mapreduce and graphchi. Data Knowl. Eng., 104:17–31.

[Nath and Roy, 2018] Nath, K. and Roy, S. (2018). A parallel approach to detect communities in
evolving networks. In Mondal, A., Gupta, H., Srivastava, J., Reddy, P. K., and Somayajulu, D.
V. L. N., editors, Big Data Analytics - 6th International Conference, BDA 2018, Warangal, India,
December 18-21, 2018, Proceedings, volume 11297 of Lecture Notes in Computer Science, pages
188–203. Springer.

[Networkx] Networkx. Networkx: Software for complex networks. https://networkx.github.io/.

[Newman, 2003] Newman, M. E. J. (2003). The structure and function of complex networks.
SIAM Rev., 45(2):167–256.

[Prountzos and Pingali, 2013a] Prountzos, D. and Pingali, K. (2013a). Betweenness centrality:
algorithms and implementations. In Nicolau, A., Shen, X., Amarasinghe, S. P., and Vuduc, R. W.,
editors, ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP
’13, Shenzhen, China, February 23-27, 2013, pages 35–46. ACM.

[Prountzos and Pingali, 2013b] Prountzos, D. and Pingali, K. (2013b). Betweenness centrality:
algorithms and implementations. In Nicolau, A., Shen, X., Amarasinghe, S. P., and Vuduc, R. W.,
editors, ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP
’13, Shenzhen, China, February 23-27, 2013, pages 35–46. ACM.

[Spark] Spark. https://spark.apache.org/.

[Stanford] Stanford large network dataset collection. https://snap.stanford.edu/data/.

[Tan et al., 2009] Tan, G., Tu, D., and Sun, N. (2009). A parallel algorithm for computing be-
tweenness centrality. In ICPP 2009, International Conference on Parallel Processing, Vienna,
Austria, 22-25 September 2009, pages 340–347. IEEE Computer Society.

[UCI] UCI. The uci network data repository. http://networkdata.ics.uci.edu/.

[Wang et al., 2016] Wang, Y., Davidson, A. A., Pan, Y., Wu, Y., Riffel, A., and Owens, J. D.
(2016). Gunrock: a high-performance graph processing library on the GPU. In Asenjo, R. and
Harris, T., editors, Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice
of Parallel Programming, PPoPP 2016, Barcelona, Spain, March 12-16, 2016, pages 11:1–11:12.
ACM.

[Wasserman and Faust, 1994] Wasserman, S. and Faust, K. (1994). Social Network Analysis:
Methods and Applications. Cambridge University Press.

[Yang and Lonardi, 2007] Yang, Q. and Lonardi, S. (2007). A parallel edge-betweenness cluster-
ing tool for protein-protein interaction networks. Int. J. Data Min. Bioinform., 1(3):241–247.

[Zhang et al., 2016] Zhang, Q., Qiu, Q., Guo, W., Guo, K., and Xiong, N. (2016). A social
community detection algorithm based on parallel grey label propagation. Comput. Networks,
107:133–143.

https://networkx.github.io/
https://spark.apache.org/
https://snap.stanford.edu/data/
http://networkdata.ics.uci.edu/

	Introduction
	Related work: Preliminaries
	Distributed programming with Spark
	Social network analysis: Centrality measures
	Community detection algorithms: Girvan and Newman
	Parallel centrality measures and community detection algorithms

	A parallel version of the Brandes algorithm: PBN and PBE
	Results and discussion
	Networks data set
	Scalability
	Software and hardware

	Application of PB algorithm: a new parallel version of the GN algorithm
	Performance of GN-PBE results

	Threats to validity
	Conclusions and Future Work