Python & Graphs

This is an html version of the a Jupyter notebook I used for a presentation in December 2018 to the San Diego Python Users Group. I've included all the slides but greyed out the ones I didn't present (like this one).

This notebook relies on a number of python packages, d3 (a javascript visualization library), and, optionally, rise, which is a jupyter slideshow add-on.

One of the questions I was asked at the presentations was "Where can I get data to play with this on my own?" The Enron header file is included but here are some other ideas:

Any questions, please contact me at: GitHub or LinkedIn.

This Jupyter Notebook is available on GitHub.

Thanks,
Keith P Jolley

In [1]:
import pandas as pd
import numpy as np
import scipy
import networkx as nx
import matplotlib.pyplot as plt
import heapq
import colorsys
from sklearn.preprocessing import minmax_scale
import json
import sys
from IPython.display import display, Javascript, HTML

%matplotlib inline

color = lambda c,n : "#%02x%02x%02x" % tuple(int(255*c) for c in colorsys.hsv_to_rgb(c/(n+1),1,1))
updt = lambda d1,attr,d2 : [d1[k].update({attr:v}) for k,v in d2.items()]
min_s, max_s = 100,700 # moon units?
min_r, max_r = 6,20    # radius in pixels
plt.ioff()             # have to use 'plt.show()'

def plt2x2(G, beta):
    f, (ax1, ax2) = plt.subplots(1, 2, num=None, figsize=(14, 8), dpi=72, facecolor='w', edgecolor='k')
    Gplt = nx.draw_kamada_kawai(G, ax=ax1,
        node_color=list(nx.get_node_attributes(G, 'color').values()),
        node_size=list(nx.get_node_attributes(G, 'size').values()))
    agaig = plt.imread(beta)
    ax2.imshow(agaig)
    ax2.set_title("What if this is as good as it gets?")
    return f

def get_size(obj, seen=None):
    """Recursively finds size of objects"""
    size = sys.getsizeof(obj)
    if seen is None:
        seen = set()
    obj_id = id(obj)
    if obj_id in seen:
        return 0
    # Important mark as seen *before* entering recursion to gracefully handle
    # self-referential objects
    seen.add(obj_id)
    if isinstance(obj, dict):
        size += sum([get_size(v, seen) for v in obj.values()])
        size += sum([get_size(k, seen) for k in obj.keys()])
    elif hasattr(obj, '__dict__'):
        size += get_size(obj.__dict__, seen)
    elif hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes, bytearray)):
        size += sum([get_size(i, seen) for i in obj])
    return size
In [2]:
%%html
<link rel="stylesheet" href="/notebooks/include/css/style.css" type="text/css">
In [3]:
display(Javascript("require.config({paths: {d3: '/notebooks/include/js/d3.v4.min'}});"))
display(Javascript(filename="include/js/bg.js"))
display(Javascript("""
        (function(element){
            require(['bg'], function(dobg) {          
                dobg();
            });
        })(element);
    """))

Python & Graphs

Keith P Jolley

What's This Talk About?

1. What is a graph?

2. What can graphs be used for?

3. A couple things to keep in mind when working with graphs.

4. How to create graph in a handful of different python packages.

5. A simple analysis using networkx.

6. Visualizing the graphs.

Keith P Jolley

Experience:

  • Jamul Heavy Industries - Owner
  • Technical Account Manager - NetApp
  • Senior Director, IT - Qualcomm

Education:

  • MS Analytics, 2020, Georgia Tech
  • MS Computational Science, 2013, SDSU
  • BS Aerospace Engineering, 1992, SDSU

What's a Graph or Network?

A visual representation of Nodes and Edges that show how things are connected.

Nodes, also called vertices, are the "things" in the network.

Edges, also called arcs, are the "connections" in the network.

Edges can have weight. An edge with weight shows the relative magnitude of the connectivity between two nodes.

Edges can also have direction.

Graph Examples

  • Social Network (LinkedIn: People, Companies, Jobs are nodes, edges are the connections)
  • Email Networks (Google Mail: People, Emails are nodes, sending/receiving emails are the edges)
  • Websites (Google: Pages are nodes, hyperlinks are the edges)
  • Roads (Waze: Intersections are vertices, roads are the edges, "traffic" is the weight)
  • Oracle of Bacon (Actors are the nodes, "credits" are the edges)

You Should Know

Graphs become large quickly

Worst case: $N^2$ where N is the number of nodes.

A social network of all 2.1M Walmart employees is big.

The graph of all ~2B Facebook users is really big.

Google's graph "every page google knows about" is huge.

If your graph has tens of millions or billions of nodes Python, and in particular numpy or networkx, may not be a good choice.

You Should Know...

Specialized tools

There are tools that are made for graphs that may be more appropriate as your graphs increase in size. For example:

Relational databases are a terrible choice for managing relationships!

You Should Know...

Visualization...

Visualizing large graphs is hard. There are no good network visualization tools in Python as of late 2018. From the authors of networkx:

Drawing

In the future, graph visualization functionality may be removed from NetworkX or only available as an add-on package.

Proper graph visualization is hard, and we highly recommend that people visualize their graphs with tools dedicated to that task.

You Should Know...

Visualization...

Hairball. Please don't do this. And I'm not blaming Python for this! terrible graph visualization that looks like a hairball

Let's Build A (directed, weighted) Network Already!

0. Alice sends an email

Edge notation: Alice -> (Bob, Carol, Dan)
Matrix notation:

$\left[ \begin{array}{cccc} & A & B & C & D & E & F \\ A & {\color{grey}0} & 1 & 1 & 1 & {\color{grey}0} & {\color{grey}0} \\ B & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} \\ C & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} \\ D & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} \\ E & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} \\ F & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} \\ \end{array} \right] $
![step 1 graph](include/images/graphs.000.png)

numpy:

In [4]:
Gnp = np.array([[0,1,1,1,0,0],
                [0,0,0,0,0,0],
                [0,0,0,0,0,0],
                [0,0,0,0,0,0],
                [0,0,0,0,0,0],
                [0,0,0,0,0,0]])

1. Carol sends an email

Edge Notatation: Carol -> (Alice, Bob, Dan)
Matrix notation:

$\left[ \begin{array}{cccc} & A & B & C & D & E & F \\ A & {\color{grey}0} & 1 & 1 & 1 & {\color{grey}0} & {\color{grey}0} \\ B & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} \\ C & {\color{blue} 1} & {\color{blue} 1} & {\color{grey}0} & {\color{blue} 1} & {\color{grey}0} & {\color{grey}0} \\ D & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} \\ E & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} \\ F & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} \\ \end{array} \right] $
![step 1 graph](include/images/graphs.001.png)

numpy:

In [5]:
Gnp += np.array([[0,0,0,0,0,0],
                 [0,0,0,0,0,0],
                 [1,1,0,1,0,0],
                 [0,0,0,0,0,0],
                 [0,0,0,0,0,0],
                 [0,0,0,0,0,0]])

2. Carol sends another email

Edge Notation: Carol -> (Erin, Frank)
Matrix notation:

$\left[ \begin{array}{cccc} & A & B & C & D & E & F \\ A & {\color{grey}0} & 1 & 1 & 1 & {\color{grey}0} & {\color{grey}0} \\ B & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} \\ C & {\color{blue}1} & {\color{blue}1} & {\color{grey}0} & {\color{blue}1} & {\color{green}1} & {\color{green}1} \\ D & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} \\ E & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} \\ F & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} & {\color{grey}0} \\ \end{array} \right] $
![step 3 graph](include/images/graphs.002.png)

numpy:

In [6]:
Gnp += np.array([[0,0,0,0,0,0], 
                 [0,0,0,0,0,0],
                 [0,0,0,0,1,1],
                 [0,0,0,0,0,0],
                 [0,0,0,0,0,0],
                 [0,0,0,0,0,0]])

With Sparse Matrix

0. Alice sends an email

Edge notation: Alice -> (Bob, Carol, Dan)

step 1 graph

In [7]:
i2n = ["Alice", "Bob", "Carol", "Dan", "Erin", "Frank"]
n2i = { n:i for i,n in enumerate(i2n) }
weight = [1,1,1]
frm = [n2i["Alice"], n2i["Alice"], n2i["Alice"]]
to_ = [n2i["Bob"],   n2i["Carol"], n2i["Dan"]]
Gsm = scipy.sparse.coo_matrix((weight, (frm, to_)), shape=(6,6))

1. Carol sends an email

Edge Notatation: Carol -> (Alice, Bob, Dan)

step 1 graph

In [8]:
Gsm += scipy.sparse.coo_matrix(([1,1,1], ([2,2,2], [0,1,3])), shape=(6,6))  # no broadcasting w/ sparse matrix

2. Carol sends another email

Edge Notation: Carol -> (Erin, Frank)

step 2 graph

In [9]:
Gsm += scipy.sparse.coo_matrix(([1,1], ([2,2], [4,5])), shape=(6,6))
display(Gsm.toarray())
array([[0, 1, 1, 1, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [1, 1, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0]], dtype=int64)

With networkx

In [10]:
import networkx as nx
Gnx = nx.DiGraph()
# Alice -> Bob, Carol, Dan
Gnx.add_edge("Alice", "Bob",   weight=1)
Gnx.add_edge("Alice", "Carol", weight=1)
Gnx.add_edge("Alice", "Dan",   weight=1)
# Carol -> Alice, Bob, Dan
Gnx.add_edge("Carol", "Alice", weight=1)
Gnx.add_edge("Carol", "Bob",   weight=1)
Gnx.add_edge("Carol", "Dan",   weight=1)
# Carol -> Erin, Frank
Gnx.add_edge("Carol", "Erin",  weight=1)
Gnx.add_edge("Carol", "Frank", weight=1)

display(Gnx.edges(data=True))
OutEdgeDataView([('Alice', 'Bob', {'weight': 1}), ('Alice', 'Carol', {'weight': 1}), ('Alice', 'Dan', {'weight': 1}), ('Carol', 'Alice', {'weight': 1}), ('Carol', 'Bob', {'weight': 1}), ('Carol', 'Dan', {'weight': 1}), ('Carol', 'Erin', {'weight': 1}), ('Carol', 'Frank', {'weight': 1})])

Equivalence

In [11]:
print('Sparse Matrix:')
print(Gsm)
print(Gsm.toarray())
Sparse Matrix:
  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (2, 0)	1
  (2, 1)	1
  (2, 3)	1
  (2, 4)	1
  (2, 5)	1
[[0 1 1 1 0 0]
 [0 0 0 0 0 0]
 [1 1 0 1 1 1]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]]

Equivalence

In [12]:
print("Is the numpy graph equal to the sparse matrix graph: {}".format((Gnp == Gsm.toarray()).all()))
print("Is the numpy graph equal to the networkx graph: {}".format((Gnp == np.int64(nx.to_numpy_array(Gnx))).all()))
Is the numpy graph equal to the sparse matrix graph: True
Is the numpy graph equal to the networkx graph: True

The point is...

There are different ways of approaching graph problems. Choose what works best for your situation.

Show me something practical!

Hands-on with networkx

Enron Email Database

  • Available online.
  • Has complete text and headers of 3715057 emails.
  • I've parsed the database into a csv file.
  • Also dumped into elasticsearch but not used for this demo.
In [13]:
# NOTE! I gzipped this csv file to be nice w/ github. I recommend
# uncompressing it for better performance. (Or not using a csv at all!)
df = pd.read_csv("./data/enronmails.csv.gz", compression='gzip', sep="\t") # there are better ways of doing this!
print("total edges: ", df.shape[0])
print("nodes: ", len(pd.unique(df[['to', 'from']].values.ravel('F'))), "(This number squared is large!)")
print("sample data:")
display(df.head(n=5))
total edges:  3712732
nodes:  89993 (This number squared is large!)
sample data:
from to mbox subject size
0 rob_tom calxa@aol.com allen-p re: history of lime and cement 950801820
1 rob_tom strawbale@crest.org allen-p re: history of lime and cement 950801820
2 matt strawbale@crest.org allen-p re: concrete stain 947291340
3 billc strawbale@crest.org allen-p re: newsgroups 950175960
4 grensheltr mccormick@elkus-manfredi.com allen-p re: concrete stain 947107440

Create a directed, weighted Graph

  • to:/from: will be the nodes.
  • An edge represents an email was received.
  • ~40k self loops, e.g. (alice -> alice). These would show up on a matrix diagonal.
  • Create a 'weight' of 1 for each email received.
  • A handful of nodes do all the work.
In [14]:
# df_all is the complete network. Create edge weights in pandas.
df_all = df.groupby(['from','to'], as_index=False).size().reset_index().rename(columns={0:'weight'})
G = nx.from_pandas_edgelist(df_all, 'to', 'from', ['weight'])
# multiple ways of finding the most 'connected' nodes
# use 'pagerank'...
pr = sorted(nx.pagerank_scipy(G, weight='weight').values(), key=lambda v: -v)
plt.plot(pr)
plt.title("Enron")
plt.ylabel("pagerank")
del df_all

The point is: typically only a few nodes do all the heavy lifting and the rest take up space.

Find the graph of something interesting

  • Search for "California" in the 'subject:' line.

    "For most popular subjects, a simple text matching search that is restricted to web page titles performs admirably when PageRank prioritizes the results."
    Sergey Brin and Lawrence Page, Stanford University

  • Warning - lots of code about to come at you.
  • Remember - much of the heavy lifting is done by only a few nodes
  • Interesting things happen when you combine graphs/searches.
In [15]:
import community
# Create a df of just the 'california' rows (elasticsearch is your friend!)
df_cal = df[df.subject.str.contains(r'california', case=False)]
df_cal = df_cal.groupby(['from','to'], as_index=False).size().reset_index().rename(columns={0:'weight'})
G = nx.from_pandas_edgelist(df_cal, 'to', 'from', ['weight'])

# Create a 'pagerank' dict and delete all but the 5% most relevant nodes.
pagerank = nx.pagerank_scipy(G, weight='weight')
eicentrality = nx.eigenvector_centrality_numpy(G)
smalls = heapq.nsmallest(round(len(pagerank)*0.95), pagerank, key=pagerank.get)
for k in smalls:
    pagerank.pop(k, None)  # Trim the pagerank/eigenvector lists
    eicentrality.pop(k, None)
    G.remove_node(k)        # remove nodes (and edges) from graph
# This can lead to nodes no longer connected to the main graph. Keep only largest graph.
G = list(nx.connected_component_subgraphs(G))[0]

# Louvain algorithm^H heuristic for community detection
communities = community.best_partition(G)

# dict of dicts: attributes for each node. Node size ~ 'pagerank', color on 'community'
attrs = {k:{'pr':v} for k,v in pagerank.items()}
updt(attrs, 'community', communities)
updt(attrs, 'ei', eicentrality)
updt(attrs, 'color', {k:color(v, max(communities.values())) for k,v in communities.items()})
updt(attrs, 'size',   dict(zip(pagerank.keys(), minmax_scale(list(pagerank.values()),feature_range=(min_s,max_s)))))
updt(attrs, 'radius', dict(zip(pagerank.keys(), minmax_scale(list(pagerank.values()),feature_range=(min_r,max_r)))))
nx.set_node_attributes(G, attrs)
In [16]:
del attrs, pagerank, communities, smalls, df_cal, eicentrality

Using matplotlib

In [17]:
_=plt2x2(G, 'include/images/agaig.jpg' )

Using plotly

In [18]:
import plotly.plotly as py
import plotly.graph_objs as go
In [19]:
pos = nx.kamada_kawai_layout(G)
edge_trace = go.Scatter(
    x=[],
    y=[],
    line=dict(width=0.5,color='#888'),
    hoverinfo='none',
    mode='lines')

for edge in G.edges():
    x0, y0 = pos[edge[0]]
    x1, y1 = pos[edge[1]]
    edge_trace['x'] += tuple([x0, x1, None])
    edge_trace['y'] += tuple([y0, y1, None])
In [20]:
node_trace = go.Scatter(
    x=[],
    y=[],
    text=[],
    mode='markers',
    hoverinfo='text',
    marker=dict(
         color=[],
         size=[]
    )
)

def htext(node):
    ret  = '<b>' + node + '</b><br>'
    ret += '  pagerank: ' + str(G.nodes()[node]['pr']) + '<br>'
    ret += 'eigen_cent: ' + str(G.nodes()[node]['ei']) + '<br>'
    ret += ' community: ' + str(G.nodes()[node]['community'])
    return ret

for node in G.nodes():
    #print(node)
    x, y = pos[node]
    node_trace['x'] += tuple([x])
    node_trace['y'] += tuple([y])
    node_trace['marker']['color'] += tuple([G.nodes()[node]['color']])
    node_trace['marker']['size']  += tuple([2*G.nodes()[node]['radius']])
    node_trace['text'] += tuple([htext(node)])
In [21]:
fig = go.Figure(data=[edge_trace, node_trace],
             layout=go.Layout(
                title='Network graph made with Python/plotly',
                titlefont=dict(size=16),
                showlegend=False,
                hovermode='closest',
                margin=dict(b=20,l=5,r=5,t=40),
                annotations=[ dict(
                    showarrow=False,
                    xref="paper", yref="paper",
                    x=0.005, y=-0.002 ) ],
                xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)))
In [22]:
py.iplot(fig, filename='networkx')
Out[22]:

Using d3

This is an interactive graph...

In [23]:
HTML("<div id='d3target'></div>")
Out[23]:

Take-aways

  • Graph operations show up in quite a few areas.
  • Python has a number of tools to help you out.
  • As your graph grows you need to be more careful in your tool selection.
  • Visualizing large graphs is difficult. Judiciously edit.
  • Graphs are a dessert topping and a floor wax.

Fin

In [24]:
# this is needed to draw the d3 graph previously.
display(Javascript(filename="include/js/d3draw.js"))
display(Javascript("""(function(element){
    require(['d3draw'], function(drawgraph) {          
      drawgraph("#d3target", %s, %d, %d);
    });
  })(element);
  """ % (json.dumps(nx.node_link_data(G)), 1000, 600))
)
In [25]:
# to read/write graph data:
#with open('data/G.json', 'w') as outfile:
#    json.dump(nx.node_link_data(G), outfile)
#with open('data/G.json') as data_file:
#    G = nx.node_link_graph(json.load(data_file))