Showing posts with label graph. Show all posts
Showing posts with label graph. Show all posts

Saturday, October 05, 2024

Using Knowledge Graphs to enhance Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) has become a popular approach to harness LLMs for question answering using your own corpus of data. Typically, the context to augment the query that is passed into the Large Language Model (LLM) to generate an answer comes from a database or search index containing your domain data. When it is a search index, the trend is to use Vector search (HNSW ANN based) over Lexical (BM25/TF-IDF based) search, often combining both Lexical and Vector searches into Hybrid search pipelines.

In the past, I have worked on Knowledge Graph (KG) backed entity search platforms, and observed that for certain types of queries, they produce results that are superior / more relevant compared to that produced from a standard lexical search platform. The GraphRAG framework from Microsoft Research describes a comprehensive technique to leverage KG for RAG. GraphRAG helps produce better quality answers in the following two situations.

  • the answer requires synthesizing insights from disparate pieces of information through their shared attributes
  • the answer requires understanding summarized semantic concepts over part of or the entire corpus

The full GraphRAG approach consists of building a KG out of the corpus, and then querying the resulting KG to augment the context in Retrieval Augmented Generation. In my case, I already had access to a medical KG, so I focused on building out the inference side. This post describes what I had to do to get that to work. It is based in large part on the ideas described in this Knowledge Graph RAG Query Engine page from the LlamaIndex documentation.

At a high level, the idea is to extract entities from the question, and then query a KG with these entities to find and extract relationship paths, single or multi-hop, between them. These relationship paths are used, in conjunction with context extracted from the search index, to augment the query for RAG. The relationship paths are the shortest paths between pairs of entities in the KG, and we only consider paths upto 2 hops in length (since longer paths are likely to be less interesting).

Our medical KG is stored in an Ontotext RDF store. I am sure we can compute shortest paths in SPARQL (the standard query language for RDF) but Cypher seems simpler for this use case, so I decided to dump out the nodes and relationships from the RDF store into flat files that look like the following, and then upload them to a Neo4j graph database using neo4j-admin database import full.

1
2
3
4
5
6
7
8
9
# nodes.csv
cid:ID,cfname,stygrp,:LABEL
C8918738,Acholeplasma parvum,organism,Ent
...

# relationships.csv
:START_ID,:END_ID,:TYPE,relname,rank
C2792057,C8429338,Rel,HAS_DRUG,7
...

The first line in both CSV files are the headers that inform Neo4j about the schema. Here our nodes are of type Ent and relationships are of type Rel, cid is an ID attribute that is used to connect nodes, and the other elements are (scalar) attributes of each node. Entities were extracted using our Dictionary-based Named Entity Recognizer (NER) based on the Aho-Corasick algorithm, and shortest paths are computed between each pair of entities (indicated by placeholders _LHS_ and _RHS_) extracted using the following Cypher query.

1
2
MATCH p = allShortestPaths((a:Ent {cid:'_LHS_'})-[*..]-(b:Ent {cid:'_RHS_'}))
RETURN p, length(p)

Shortest paths returned by the Cypher query that are more than 2 hops long are discarded, since these don't indicate strong / useful relationships between the entity pairs. The resulting list of relationship paths are passed into the LLM along with the search result context to produce the answer.

We evaluated this implementation against the baseline RAG pipeline (our pipeline minus the relation paths) using the RAGAS metrics Answer Correctness and Answer Similarity. Answer Correctness measures the factual similarity between the ground truth answer and the generated answer, and Answer Similarity measures the semantic similarity between these two elements. Our evaluation set was a set of 50 queries where the ground truth was assigned by human domain experts. The LLM used to generate the answer was Claude-v2 from Anthropic while the one used for evaluation was Claude-v3 (Sonnet). The table below shows the averaged Answer Correctness and Similarity over all 50 queries, for the Baseline and my GraphRAG pipeline respectively.

Pipeline Answer Correctness Answer Similarity
Baseline 0.417 0.403
GraphRAG (inference) 0.737 0.758

As you can see, the performance gain from using the KG to augment the query for RAG seems to be quite impressive. Since we already have the KG and the NER available from previous projects, it is a very low effort addition to make to our pipeline. Of course, we would need to verify these results using Further human evaluations.

I recently came across the paper Knowledge Graph based Thought: A Knowledge Graph enhanced LLM Framework for pan-cancer Question Answering (Feng et al, 2024). In it, the authors identify four broad classes of triplet patterns that their questions (i.e, in their domain) can be decomposed to, and addressed using reasoning approaches backed by Knowledge Graphs -- One hop, Multi-hop, Intersection and Attribute problems. The idea is to use an LLM prompt to identify the entities and relationships in the question, then use an LLM to determine which of these templates should be used to address the question and produce an answer. Depending on the path chosen, an LLM is used to generate a Cypher query (an industry standard query language for graph databases originally introduced by Neo4j) to extract the missing entities and relationships in the template and answer the question. An interesting future direction for my GraphRAG implementation would be to incorporate some of the ideas from this paper.

Saturday, March 14, 2020

Music Recommendations using DeepWalk on Spark


The idea behind Distributional Semantics in Natural Language Processing (NLP) can be succintly summed up by the quote from the famous linguist John Firth -- You shall know a word by the company it keeps. In other words, the semantic meaning of a word can be derived by analyzing the meaning of words it is commonly found with in sentences. This intuition is the basis for neural NLP models such as Word2Vec, a group of models that exploit word co-occurrences in large, publicly available text corpora, to produce word embeddings, which are dense, (relatively) low-dimensional vector representations that encode the meanings of words in these corpora. The principle has been extended to domains other than NLP as well. In case of Word2Vec, the "company" words keep (or the context of the word) is determined by by looking at large number of word sub-sequences found in sentences in natural text, and training the model to trying to predict the neighbors given a word (Skip Gram), or predicting the word given its neighbors (CBOW). For graph structures, node sequences constructed by doing random walks on the graph can be thought of as being analogous to sentences, and may be used to train Word2Vec like models for the domain represented by the graph. This is the idea behind graph embeddings such as DeepWalk and node2vec. In this post, I will describe a Music Recommender built using DeepWalk embeddings using Apache Spark on Databricks.

The data used for this recommender comes from the Amazon product co-purchasing network (March 02 2003) and its associated product metadata. The data was released as part of the paper The Dynamics of Viral Marketing, (Leskovic, J, Adamic, L, and Adamic, B. 2007) and are available from the Stanford Network Analysis Project. The Amazon co-purchasing network contains approximately 260 thousand product nodes and 1.2 million co-purchasing edges. From these, I extracted just the nodes categorized as Music, and restricted edges only to those that connected a pair of Music nodes. This resulted in a much smaller graph of about 35 thousand nodes (103 thousand music products from catalog) and 46 thousand co-purchasing edges. I did the filtering because I felt that restricting to a single domain would result in more meaningful recommendations. The other major category in the dataset was Books, with nearly 400 thousand entries in the catalog, but I felt that book co-purchasing might not be as tightly linked to consumer taste as music. The format of the raw files were as follows, tab separated.

  • nodes (id: String, prod_id: String, description: String, category: String)
  • edges (src_id: String, dst_id: String)

The following Spark snippet converts the pair of files into what I call the node neighborhood format, with the immediate neighbor nodes for each node grouped together as a list. The first two blocks are just for reading the TSV file into named Spark DataFrames.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import org.apache.spark.sql.functions.collect_list

val nodeDF = spark.read.format("csv")
  .option("header", "false")
  .option("delimiter", "\t")
  .load(nodeFile)
  .withColumnRenamed("_c0", "id")
  .withColumnRenamed("_c1", "prod_id")
  .withColumnRenamed("_c2", "description")
  .withColumnRenamed("_c3", "category")

val edgeDF = spark.read.format("csv")
  .option("header", "false")
  .option("delimiter", "\t")
  .load(edgeFile)
  .withColumnRenamed("_c0", "src_id")
  .withColumnRenamed("_c1", "dst_id")

val nodeNeighborsDF = edgeDF.groupBy("src_id")
  .agg(collect_list("dst_id")
  .alias("neighbor_ids"))

nodeNeighborsDF.write.parquet(nodeNeighborsOutputFile)

The mean length of the neighbor_ids list is about 1.5, with minimum length 1 and maximum length 5. The output looks format looks like this:

  • node_neighbors (src_id: String, neighbor_ids: List[String])
The next step is to generate random walks using the node_neighbors format. Our co-purchasing network is undirected because a co-purchase edge between nodes A and B is semantically the same as one between nodes B and A. Also, since each co-purchase between a pair of music products is treated as a single node, the edges are unweighted. The DeepWalk algorithm generates multiple random walks of some specified maximum length starting from each node in the graph. At each node on its random path, the algorithm will randomly choose the next node to go to from the neighor_ids list. A sequential implementation would require O(m*N*d*k) computations, where N is the number of nodes in the graph, m is the number of walks to start from each node, d is the average degree of the network, and k is the path length. However, the process of selecting the next node to add to a random walk is dependent only on (the neighbors of) the current node, so we can speed this up if we parallelize this using a platform such as Spark. So the idea is to build up the random walk path Dataset iteratively. Before starting the iteration, the path Dataset is initialized with the src_id column from the node_neighbors Dataset, repeating m times to get the required number of paths per start node. At each iteration, an additional random node is added to all the random walks in the path Dataset. Instead of looking up the neighbors at each row, we leverage Spark's join capability to join the path Dataset with the node_neighbors Dataset using the src_id and the id of the last element in the path, and then randomly choosing the next node from the neighbor_ids list, so this is another time saving due to Spark. The iterations continue for the maximum specified path length. There may be nodes in the graph for which there are no neighbors, so not all generated random paths will have the same length. The code below contains the full code for generating random walks. The case classes specify the formats required for input and output. Output is written out as a Parquet file for the next step.
1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import scala.util.Random

import org.apache.spark.sql.Dataset
import org.apache.spark.sql.functions.{broadcast, size}

case class NeighborRec(src_id: String, neighbor_ids: Array[String])
case class PathRec(tail_src_id: String, path: List[String])

def getRandomElement(xs: Array[String]): String = {
  val random = new Random()
  xs(random.nextInt(xs.length))
}


def generateRandomWalks(nodeNeighborsDS: Dataset[NeighborRec], 
                        numWalksPerStartNode: Int, 
                        pathLen: Int): Dataset[PathRec] = {
  
  var pathDS = nodeNeighborsDS.flatMap(rec => {
    (0 until numWalksPerStartNode).toList.map(j => {
      PathRec(rec.src_id, List(rec.src_id))
    })
  })
  for (i <- 1 until pathLen) {
    val newPathDS = pathDS.joinWith(broadcast(nodeNeighborsDS), 
        pathDS("tail_src_id") === nodeNeighborsDS("src_id"),
        "left_outer")
      .map(rec => {
        val path = rec._1.path
        if (rec._2 != null) {
          val nextNode = getRandomElement(rec._2.neighbor_ids)
          val newPath = path ++ List(nextNode)
          PathRec(nextNode, newPath)
        } else {
          PathRec(rec._1.tail_src_id, rec._1.path)
        }
      })
    pathDS = newPathDS
  }
  pathDS
}


val randomWalksDS = generateRandomWalks(nodeNeighborsDS, 20, 10)
randomWalksDS.write.parquet(randomWalksFile)
The output of this step has the following format. We generated around 630,000 paths with average length 7.7, minimum 2 and maximum 10.
  • random_walks (tail_src_id: String, path: List[String])
Once the random walks are generated, we can treat the node sequence in the path column as sentences to be input into the Word2Vec model. The Spark ML library contains a Word2Vec Estimator that can be trained using these sentences. The only change we make to the default implementation is to consider window sizes of 6 (3 nodes to the left, and 3 nodes to the right of the current node) instead of the default 5 (5 words to the left, 5 words to the right) that seems more suitable to natural language. Here is the code to train the model.
1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import org.apache.spark.ml.feature.{Word2Vec, Word2VecModel}

val word2vec = new org.apache.spark.ml.feature.Word2Vec()
     .setInputCol("path")
     .setOutputCol("features")
     .setVectorSize(100)
     .setMinCount(0)
     .setMaxIter(100)
     .setWindowSize(3)

val model = word2vec.fit(randomWalksDF)

model.write.overwrite().save(modelFile)
Finally, we can use the trained Word2Vec model to recommend music similar to a given music product, by computing the synonyms of the original music. Embeddings are created as a side effect of the Word2Vec training. As the model trains, it gets better and better at predicting either a word given its context, or its context given the word, based on the type of model being trained. However, what really changes under the hood are the weights of the network for each word in its vocabulary. These weights can be thought of as vectors in a space where semantically similar words clump together and semantically dissimilar words get pushed furthr apart. Using the same analogy to the embeddings from our trained model, we can now find music similar to some given music product by looking in the neighborhood of the given product in the space created by the embeddings. The findSynonyms() call provided by the Spark ML Word2Vec model returns a DataFrame of neighboring words (music product in our case), and the similarity between the source word and the neighbor. The function below wraps the findSynonyms() call, and pulls out the neighbor metadata from the nodes Dataset we saw earlier. As before, the case classes enforce the input and output formats the function will need.
1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import org.apache.spark.ml.feature.{Word2Vec, Word2VecModel}
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.functions._

case class SynonymRec(word: String, similarity: Double)
case class ProductRec(id: String, prod_id: String, description: String, category: String)
case class NeighborRec(id: String, prod_id: String, description: String, similarity: Double)

def similarMusic(model: Word2VecModel,
                 nodeDS: Dataset[ProductRec],
                 srcId: String, 
                 numSimilar: Int): Dataset[NeighborRec] = {
  
  val synonymsDS = model.findSynonyms(srcId, numSimilar).as[SynonymRec]
  val similarMusicDS = synonymsDS.joinWith(nodeDS, synonymsDS("word") === nodeDS("id"), "inner")
    .map(rec => NeighborRec(rec._2.id, rec._2.prod_id, rec._2.description, rec._1.similarity))
    .orderBy(desc("similarity"))
  similarMusicDS
}
It is now simple to generate recommendations for some given music. Here are some examples. As you can see, the recommendations are in the same or similar genres, which the model learned from walking the co-purchase graph.
scala> similarMusic(model, nodeDS, "25551", 10)
     |   .show(10, false) // The Very Best of Motorhead
+------+----------+-------------------------------------+------------------+
|id    |prod_id   |description                          |similarity        |
+------+----------+-------------------------------------+------------------+
|34447 |B000002C1I|All That Matters                     |0.8850399255752563|
|37049 |B00004S95N|Elevation, Vol. 3                    |0.8403890132904053|
|45169 |B000056CDA|Collection                           |0.6613308787345886|
|17489 |B00002SWRF|Penetration                          |0.6495149731636047|
|222717|B00000GAOV|Rita Coolidge                        |0.6456888914108276|
|132023|B00000JN9G|F#¢k Me...I Thought He Was Dead!!!   |0.628462553024292 |
|88642 |B000003A2X|What Goes Around                     |0.6210222244262695|
|132024|B00000JN9E|American Jet Set                     |0.6044375896453857|
|143078|B0000025D7|Don't Let Go                         |0.6024927496910095|
|208504|B0000023U0|South Texas Swing                    |0.6008718013763428|
+------+----------+-------------------------------------+------------------+

scala> similarMusic(model, nodeDS, "25598", 10)
     |   .show(10, false) // Mieczyslaw Horszowski Plays Mozart, Chopin, Debussy, Beethoven 
+------+----------+-------------------------------------+------------------+
|id    |prod_id   |description                          |similarity        |
+------+----------+-------------------------------------+------------------+
|23844 |B000008QVX|Sacred Spirit Drums                  |0.9538416266441345|
|50937 |B000006RBJ|Enemigos Intimos                     |0.8765220046043396|
|258208|B000068FUQ|Anthology                            |0.8210484981536865|
|258207|B000068FUU|Sound of Lies                        |0.8157663941383362|
|134531|B00004WFKM|Atmospheres: Celtic Voices           |0.6351345181465149|
|151097|B00004WJEB|Christmas Time Again                 |0.632773756980896 |
|31231 |B000000919|Golden Classics                      |0.603758692741394 |
|138347|B0000032P5|Faithful                             |0.5865736603736877|
|45704 |B0000057OR|Second Sight                         |0.5757307410240173|
|122203|B00008BX5C|Alma                                 |0.5749264359474182|
+------+----------+-------------------------------------+------------------+

scala> similarMusic(model, nodeDS, "1501", 10)
     |   .show(10, false) // Mississippi Hill Country Blues 
+------+----------+-------------------------------------+------------------+
|id    |prod_id   |description                          |similarity        |
+------+----------+-------------------------------------+------------------+
|1502  |B00005IAF6|Time Is the Distance                 |0.8823902606964111|
|174640|B00005NC3Q|Second Chants                        |0.8467361330986023|
|155669|B000068QZR|Gonna Take a Miracle [Expanded]      |0.640330970287323 |
|177533|B0000549WA|A La Hora Que Me Llamen Voy          |0.6273027658462524|
|49286 |B000003AFR|In tha Beginning...There Was Rap     |0.6219795346260071|
|32838 |B00000JC6L|Real Life                            |0.6073424816131592|
|147053|B00004Y9J7|Silent Joy                           |0.6009130477905273|
|50583 |B000003ZTL|Greatest Freestyle Hits: Vol. 4      |0.6003987193107605|
|20414 |B000001SQ1|Horn Quartet of Berlin Philharmonic  |0.5992087125778198|
|75424 |B000063WD9|Greetings from Asbury Park, N.J.     |0.5959932804107666|
+------+----------+-------------------------------------+------------------+
As you can see, the results don't look too bad, and it was not a whole lot of work to get here. Neither Word2Vec nor DeepWalk are novel concepts, but generating random walks for any reasonable sized graph is usually quite a computation intensive process, so I decided to see if I could use Spark to do this more efficiently. So this was the bulk of the work involved in building the recommender. Hopefully you found it interesting, and hope it helps you build similar recommenders with your own datasets.

Friday, February 14, 2020

Entity Co-occurrence graphs as Mind Maps


Some time ago, as part of a discussion I don't remember much about anymore, I was referred to this somewhat old (Jan/Feb 2018) set of articles about Deutsche Bank and its involvement in money laundering activities.


Now I know as much about money laundering as the average person on the street, which is to say not much, so it was a fascinating and tedious read at the same time. Fascinating because of the scale of operations and the many big names involved, and tedious because there were so many players that I had a hard time keeping track of them as I read through the articles. In any case, I had just finished some work on my fork of the open source NERDS toolkit for training Named Entity Recognition models, and it occurred to me that identifying the entities in this set of articles and connecting them up into a graph might help to make better sense of it all. Sort of like how people draw mind-maps when trying to understand complex information. Except our process is going to be (mostly) automated, and our mind-map will have entities instead of concepts.

Skipping to the end, here is the entity graph I ended up building, it's a screenshot from the Neo4j web console. Red nodes represent persons, green nodes represent organizations, and yellow nodes represent geo-political entities. The edges are shown as directed, but of course co-occurrence relationships are bidirectional (or equivalently undirected).


The basic idea is to find Named Entities in the text using off the shelf Named Entity Recognizers (NERs), and connect a pair of entities if they co-occur in the same sentence. The transformation from unstructured text to entity graph is mostly automated, except for one step in the middle where we manually refine the entities and their synonyms. The graph data was ingested into a Neo4j graph database, and I used Cypher and Neo4j graph algorithms to generate insights from the graph. In this post I describe the steps to convert from unstructured article text to entity graph. The code is provided on GitHub, and so is the the data for this example, so you can use them to glean other interesting insights from this data, as well as rerun the pipeline to create entity graphs for your own text.

I structured the code as a sequence of Python scripts and Jupyter notebooks that are applied to the data. Each script or notebook reads the data files already available and writes new data files for the next stage. Scripts are numbered to indicate the sequence in which they should be run. I describe these steps below.

As mentioned earlier, the input is the text from the three articles listed above. I screen scraped the text into a local text file (select the article text and then copy the text, then paste it into a local text editor, and finally saved it into the file db-article.txt. The text is organized into paragraphs, with an empty line delimiting each paragraph. The first article also provided a set of acronyms and their expansions, which I captured similarly into the file db-acronyms.txt.

  • 01-preprocess-data.py -- this script reads the paragraphs and converts it to a list of sentences. For each sentence, it checks to see if any token is an acronym, and if so, it replaces the token with the expansion. The script uses the SpaCy sentence segmentation model to segment the paragraph text into sentences, and the English tokenizer to tokenize sentences into tokens. Output of this step is a list of 585 sentences in the sentences.txt file.
  • 02-find-entities.py -- this script uses the SpaCy pre-trained NER to find instances of Person (PER), Organization (ORG), GeoPolitical (GPE), Nationalities (NORP), and other types of entities. Output is written to the entities.tsv file, one entity per line.
  • 03-cluster-entity-mentions.ipynb -- in this Jupyter notebook, we do simple rule-based entity disambiguation, so that similar entity spans found in the last step are clustered under the same entity -- for example, "Donald Trump", "Trump", and "Donald J. Trump", are all clustered under the same PER entity for "Donald J. Trump". The disambiguation finds similar spans of text (Jaccard token similarity) and considers those above a certain threshold to refer to the same entity. The most frequent entity types found are ORG, PERSON, GPE, DATE, and NORP. This step writes out each cluster as a key-value pair, with the key being the longest span in the cluster, and the value as a pipe-separated list of the other spans. Output from this stage are the files person_syns.csv, org_syns.csv, and gpe_syns.csv.
  • 04-generate-entity-sets.py -- This is part of the manual step mentioned above. The *_syns.csv files contain clusters that are mostly correct, but because the clusters are based solely on lexical similarity, they still need some manual editing. For example, I found the "US Justice Department" and "US Treasury Department" in the same cluster, but "Treasury" in a different cluster. Similarly, "Donald J. Trump" and "Donald Trump, Jr." appeared in the same cluster. This script re-adjusts the clusters, removing duplicate synonyms for clusters, and assigning the longest span as the main entity name. It is designed to be run with arguments so you can version the *_syn.csv files. The repository contains my final manually updated files as gpe_syns-updated.csv, org_syns-updated.csv, and person_syns-updated.csv.
  • 05-find-corefs.py -- As is typical in most writing, people and places are introduced in the article, and are henceforth referred to as "he/she/it", at least while the context is available. This script uses the SpaCy neuralcoref to resolve pronoun coreferences. We restrict the coreference context to the paragraph in which the pronoun occurs. Input is the original text file db-articles.txt and the output is a file of coreference mentions corefs.tsv. Note that we don't yet attempt to update the sentences in place like we did with the acronyms because the resulting sentences are too weird for the SpaCy sentence segmenter to segment accurately.
  • 06-find-matches.py -- In this script, we use the *_syns.csv files to construct a Aho-Corasick Automaton object (from the PyAhoCorasick module), basically a Trie structure against which the sentences can be streamed. Once the Automaton is created, we stream the sentences against it, allowing it to identify spans of text that match entries in its dictionary. Because we want to match any pronouns as well, we first replace any coreferences found in the sentence with the appropriate entity, then run the updated sentence against the Automaton. Output at this stage is the matched_entities.tsv, a structured file of 998 entities containing the paragraph ID, sentence ID, entity ID, entity display name, entity span start and end positions.
  • 07-create-graphs.py -- We use the keys of the Aho-Corasick Automaton dictionary that we created in the previous step to write out a CSV file of graph nodes, and the matched_entities.tsv to construct entity pairs within the same sentence to write out a CSV file of graph edges. The CSV files are in the format required by the neo4j-admin command, which is used to import the graph into a Neo4j 5.3 community edition database.
  • 08-explore-graph.ipynb -- We have three kinds of nodes in the graph, PERson, ORGanization, and LOCation nodes. In this notebook, we compute PageRank on each type of node to find the top people, organizations, and locations we should look at. From there, we select a few top people and find their neighbors. One other feature we built was a search like functionality, where once two nodes are selected, we show a list of sentences where these two entities cooccur. And finally, compute the shortest path between a pair of nodes. The notebook shows the different queries, the associated Cypher queries (including calls to Neo4j Graph algorithms), as well as the outputs of these queries, its probably easier for you to click through and take a look yourself than for me to describe it.

There are obviously many other things that can be done with the graph, limited only by your imagination (and possibly by your domain expertise on the subject at hand). For me, the exercise was fun because I was able to use off the shelf NLP components (as opposed to having to train my own compoenent for my domain) to solve a problem I was facing. Using the power of NERs and graphs allows us to gain insights that would normally not be possible solely from the text.


Sunday, October 20, 2019

Trip Report: Graphorum 2019


I just got back from Graphorum 2019, organized by the folks at Dataversity. The conference was held at The Westin Chicago River North, and colocated with the Data Architecture Summit (DAS), also organized by Dataversity. Attendees were not restricted to talks at one or the other conference, they were allowed, even encouraged, to attend talks at the other conference, perhaps in an effort to popularize Graph concepts among Data Architects, and Data Architecture best practices among Graph practioners. Both conferences were very heavily multi-track -- at any point, you had around 3-5 choices if you restricted yourself to talks in either track. I attended only the Graphorum talks, so this trip report represents one of 322 million possible unique trip reports (and one of 142 billion possible unique reports if I had not restricted myself), with the naive assumption that at all talks within each conference were independent, and at any point, I was equally likely to select any one of the talks offered.

The conference was four days long, starting on Monday last week (October 14 2019) and ended yesterday (October 17). As expected, many of the talks (at least among the ones I attended) were centered around Knowledge Graphs (KGs). Some of these focused on techniques and advice on how to build them from unstructured text, and some focused on making them more effective. Many of the presentations ended up covering the various Linked Data standards such as Resource Description Framework (RDF) for specifying semantic triples, and Web Ontology Language (OWL) for doing inference on them. More than one talk mentioned the new Shape Constraint Language (SHACL) for validating such RDF graphs. On the other hand, it looks like there is strong industry support for Labeled Property Graphs (LPG) as well, both among database vendors and users. Regardless of the graph flavor, there was also quite a lot of interest in using Machine Learning (ML) and Natural Language Processing (NLP) to leverage the signal inherent in graph structure. In the rest of this post, I will cover the talks that I liked in my 1-in-322M path through the conference. I will also cover some interesting discussions I had with some vendors at the exhibition, and some overall feedback about the conference as a whole.

I arrived late at Chicago on Sunday night, so was unable to take advantage of the early bird registration. At breakfast next morning, I found myself at the end of a fairly long line while the others were relatively empty. In an attempt to understand why, and notwithstanding Murphy's Law, I noticed that the lines were by first character of last name, and unevenly sized (A-E, F-J, K-O, and P-Z). It is possible that the sizing is based on actual attendee last names, in that case I guess people with last names P-Z tend to be relatively last minute types.

Day 1


On the first day, I attended a presentation about Applications of Knowledge Graphs in Enterprise Data Management and AI by Juergen Jakobitsch from Semantic Web Company and Andreas Blumauer from Pool Party. It was very comprehensive, and included lots of background material on Knowledge Graphs. Among some standard applications were semantic search, info-box style results, product recommendations, virtual assistants, etc. One interesting application mentioned was text data augmentation, and (in the same vein) filtering results from chatbots for illogical answers. They also talked about the PoolParty pipeline for converting inputs (structured and unstructured) to Knowledge Graphs, which includes entity resolution and entity linking. This presentation, as well as others throughout the conference, also focused on the various W3C standards such as RDF, SPARQL, OWL2, and SHACL.

I also attended Graphs Transform Chemical Discovery in Pharmaceutical Research presented by Liana Kiff of Tom Sawyer Software. I had initially thought that it would talk about the actual process of Chemical Discovery using graphs, but it turned out to be (an admittedly very powerful) visualization of a graph of chemical entities from the ChEMBL database, including features that allow for manual chemical discovery.

The final presentation of the day was You Must Be This Tall: Machine Learning from Graphs, Tables, Trees, and Documents by Brian Sletten of Bosatsu Consulting, Inc. As with the previous talk, I came in hoping to learn about ML with graphs, but the talk turned out to be about what you need to do before you can do ML with your graphs. Nevertheless, the presenter was very knowledgable, so the talk ended up being pretty interesting and educational, in spite of it covering a lot of the RDF/SPARQL/OWL ground covered by earlier talks. One good insight here was the distinction between data, knowledge, and wisdom as points on a Data Understanding vs Connected space, coupled with capturing relationships between entities, the context in which these relations exist, and the exploitation of this information using RDF technologies. Another was the need for data lakes to be available as a pre-requisite to effective ML in the organization. Yet another other thing I liked about his presentation is this example SPARQL queries against Wikidata. He also talked about JSON-LD and how it can be a good substitute for RDF.

Day 2


The first presentation of Tuesday was the Ontology Engineering for Knowledge Graphs by Elisa Kendall and Deborah McGuinness. The talk focused on the role of the Ontology Engineer, who is the technical person who talks to the Knowledge Engineer or Domain Expert. The main component of this role is being effective at interviewing the Knowledge Engineer. The talk also covered various aspects of modeling the domain, with many examples drawn from the presenter's own experiences.

The next presentation I attended was Merging Company Data from Multiple Sources with GraphDB and Ontotext Platform by Atanas Kiryakov (their CEO). As expected, the presentation was more about how Ontotext does things. However, Ontotext is one of the first graph database companies I know of that pursued a strong NLP + Graph strategy, so there was lots of interesting content. Among the functionality covered were features around Entity Recognition from text and Entity Disambiguation (from structured data as well as text), the use of different embedding technology (1-hot, 2-hot, BERT, etc), and the use of different technologies such as Random Forests and SVM for entity and document classification, BiLSTM and CRF for Named Entity Recognition, and Inductive Logic Programming (ILP) for rules engines built around the discovered entities and relations.

The final presentation on Tuesday for me was Reveal Predictive Patterns with Neo4j Graph Algorithms (Hands-on) by Amy Hodler and William Lyon. I had learned about Graph Algorithms in Neo4j (and also used some of them for my own talk later in the conference) from the Graph Algorithms: Practical Examples in Apache Spark and Neo4j from O'Reilly (Amy Hodler, one of the presenters, is also one of the authors of this book), so some of it was old material for me. The talk started off with the need for graph algorithms as tools that exploit the additional information implicit in the graph structure, then covered the various classes of graph algorithms (Pathfinding, Centrality, Community Detection, Link Prediction and Similarity), with deep dives into specific algorithms and running them on their Neo4j Desktop product (proprietary product with 30 day free trial, but all the features covered are also available in the free community edition). I ended up learning a few new things, such as how to use virtual graphs (generated as a result of a Cypher query, sort of like views in the RDBMS world), and how to use the Strongly Connected components algorithm as a debugging tool. They also showed off their NEuler product, which allows forms-based invocation of various algorithms, as well as some very good visualizations. Talking about visualization, William Lyon also mentioned the neo4j-contrib/neovis.js project, which seems interesting as well. Overall, lots of useful information about Neo4j and graphs.

I also learned about the Bridges of Chicago, based on a challenge from the presenters about using Cypher (the Neo4j query language) to find an Eulerian path similar to the Bridges of Königsberg problem. I guess I was the only one that responded, since the problem is much simpler than it appears to be at first glance.

Exhibitors started setting up their booths today, so I spent some of the coffee breaks and most of the evening talking to various exhibitors. Both Graph database vendors and consultants were well represented among the exhibitors (considering it was a graph + data architecture conference). Graph vendors I knew of included Neo4j, Ontotext, TigerGraph, DataStax, and StarDog. Among those who I learned about at this conference were PoolParty, Semantic Web Company, and Cambridge Semantics. Having attended the presentations from PoolParty and The Semantic Web, and Ontotext, I spent a lot of time talking with them. I also met up with the folks at TigerGraph, and let them know how helpful their Graph Gurus webinar series has been to me. I also took the opportunity to meet up with the folks at Stardog, who I had met earlier at another Graph conference few years earlier through a reference. Since I was a speaker here, the conversation also drifted occassionally to the subject of my talk, and what graph database I was using (Neo4j).

Day 3


Wednesday was quite a heavy day in terms of presentations, comparatively speaking. It started with two keynote presentations. The first one was Knowledge Graphs and AI: The Future of Enterprise Data by David Newman from Wells Fargo. He spoke of the progression of looking at Strings to Things to Predicted Things to Vectors, which resonated with me as well, since we are progressing along a very similar path ourself. He led us through multiple examples involving harmonizing an entity across multiple Knowledge Graphs in the enterprise, the need for classifying entities into a taxonomy, using Knowledge Graphs to predict new relationships, using graph relations for creating digital fingerprints for ML algorithms, etc. His examples referenced the Financial Industry Business Ontology (FIBO), which provides a standard schema for the financial services industry.

The second keynote was Graph Stories: How Four Metaphors can help you decide if Graphs are right for you by Dan McCreary of Optum. While David Newman's presentation was based on RDF style graphs, Dan McCreary is a big proponent of Labeled Property Graphs (LPG), although his choice had several very pragmatic reasons. The four metaphors he described are the Neighborhood Walk, the Knowledge Triangle, the Open World Assumption, and the Jenga Tower. Specifically, the first indicates the importance of relationship traversal in your applications, the second indicates where your application is (or wants to be) on the Data / Information / Knowledge Triangle, the third indicates the ease with which new information can be incorporated into your system, and the fourth indicates the resilience of your query system to small changes in your backend. The keynote also covered the importance of graph structure (Structure is the new gold in data mining), the inter-relationship of Graphs with Deep Learning techniques such as Graph Convolutional Networks (GCNN) and Structured Learning with Graphs.

The next presentation I attended was Knowledge Graphs and Model Driven Classification by Scott Henninger of SmartLogic, where he showed off the capabilities of the SmartLogic platform, which centered around Metadata tagging, document classification (based on the metadata tagging and external taxonomies), and Search Enhancement Services (SES). The bulk of the capabilities seem to be rule based, which can be good for explainability purposes. SmartLogic's KG backend is based on RDF Schema, OWL, and SHACL. An interesting functionality of SmartLogic is to allow the user to manually fine-tune the (term) weights from their classifier. I got quite excited at this, thinking that perhaps this functionality could be leveraged to produce explainable Deep Learning models by perturbing the inputs, but then realized that the intuition is similar to the idea behind LIME - Local Interpretable Model-Agnostic Explanations.

Next up was a talk on How Do You Read Millions of Documents for Meaning using Graph? by Ryan Chandler of Caterpillar, Inc. He described a system he built at Caterpillar, that allowed customer support technicians to query a large collection of internal support tickets created by other technicians. The end result is a query-able knowledge base. The text in the support tickets are tokenized and segmented into sentences, tagged with cause, complaint, solution, note, and correction (classification). The document is decomposed into semantic frames, and the document and the associated semantic frames, along with its metadata, are stored in a Neo4j graph database. On the query side, the natural language (NL) query is converted into a graph using a dependency parse, and re-composed into a Cypher query against specific semantic frames (as indicated by the metadata). The Cypher query produces a ranked list of support tickets that best satisfy the NL query. I thought this was quite an interesting technique, although it may be somewhat dependent on the structure of the input data.

The next presentation I attended was Graph Analytics for Enterprise Applications by Melliyal Annamalai, Souripriya Das, and Matthew Perry from Oracle. I came in a few minutes late so I missed the first part, but from what I gathered, it covers Oracle's foray into graph databases -- it turns out that Oracle customers can now start working with SPARQL using SQL Developer, seamlessly against Oracle's new hybrid Property and RDF graph. The functionality is nice, but probably only useful for current and future Oracle customers.

My next presentation was Insufficient Facts always invite Danger: Combat them with a Logical Model by Michael Grove of Stardog, where he described how important it was to have a Logical Model to ensure completeness of your model, and and how it can help you avoid problems later.

The evening before I had spent some time at the DataStax booth, mainly for nostalgic reasons since I worked with Cassandra (the Apache version, not the DataStax version) at my previous job, and I was curious about their graph product based on Cassandra (initially called Titan, then Janus). So I attended the presentation Graph Innovations for Distributed Data at Scale by Jonathan Lacefield. The presentation covered the evolution of their graph product, and also answered a nagging question I had about how they implemented the graph in a column-family database under the covers -- turns out that each row is basically the star graph around each node. Other interesting things in this presentation were their use of Gremlin and Spark support through their DataStax extension.

The last presentation of the day was Knowledge Graphs and GraphQL in Action: A Practical Approach to using RDF and Semantic Models for Web Applications by Irene Polikoff and Ralph Hodgson of TopQuadrant. They described their Semantic GraphQL interface which provides the user with a GraphQL interface, and converts down to a RDF, OWL, and SHACL query against a RDF triple store.

Finally, the last event of the day was a session about Open Source Knowledge Graph Tooling, which really turned out to be a group of banking folks trying to collaborate around the FIBO Ontology, but it is likely that they might expand to other industries as well in the future. There was also talk about listing out a current (non-deprecated) list of open source ontologies in various industries, applying unit tests to ontologies so they don't become stale and irrelevant, both of which were interesting to me.

The exhibitors were still around, and so I hung around for some more conversations with vendors and fellow attendees for a couple more hours after that. Among them were Cambridge Semantics, who have a fast analytics graph database called AnzoDB.

Day 4


The first presentation of the day was Unsupervised and Supervised ML on Big Graph: Case Studies by Victor Lee. He described various case studies using TigerGraph. The first one was finding influential healthcare provides in various local networks from a specialty network, and finding their influence networks. Another case study had to do with detecting spam phone calls in the China Mobile network, the training data for which consisted of 180+ graph features. The model was a Random Forest classifier. At prediction time, an incoming phone call would be placed in the call graph, the 180+ features computed and fed into the Random Forest model to predict (under 20 milliseconds) whether the call was spam or not spam. The third case study was for Bank Fraud, based on some synthetic data from a Kaggle competition, where TigerGraph engineers built some compound relationships based on edges discovered in the feature graph, which ended up giving good results, showing that the structure of the data provides useful signal. The talk ended with an introduction to Node2Vec, a graph embedding scheme.

The next presentation in my list was my own (Graph Techniques for Natural Language Processing). My presentation was about using Graph techniques (mostly a combination of common third party algorithms) to solve Natural Language Processing (NLP) problems. I covered four case studies that attempted to replicate academic papers (referenced from the Bibliography of Graph-Based Natural Language Processing and Information Retrieval) around document summarization, clustering using language model based vectors, word sense disambiguation, and topic finding. Graph techniques used included various graph centrality metrics (some built-in and some computed using Cypher and built-in algorithms), random walk techniques, Louvain Community Detection, Label Propagation, and Personalized PageRank. Compared to the other presentations, mine was probably a bit unusual, since it focused on NLP more than on graphs, so while I had only about 15-20 attendees, there seemed to be lots of interest, and some very good questions at the end. For those of you who weren't able to make it to the presentation but would like more details, you can find the link to my slides and code (in Jupyter notebooks, with a lot of verbose commentary) at my sujitpal/nlp-graph-examples repository.

I hung around a bit after my presentation answering questions, so I ended up being a bit late to the next presentation, even with the generous coffee break in between. This was When Time Meets Relationships: Understanding an Immutable Graph Database by Brian Platz of Fluree. He makes the case that a Knowledge Graph is a snapshot at a point in time. A time-aware Knowledge Graph can be thought of as an immutable linked list, where facts are added to an append-only log, and made tamper-proof with hashing techniques, much like a private blockchain. The strategy assumes the Knowledge Graph is a triple-store of (Subject, Predicte, Object). As time passes, facts are either retracted or added, so a time-aware tuple would be (Subject, Predicate, Object, Time, Add/Retract). In addition, labeled properties, such as a scheduled expiration date, can be accommodated with an addition Metadata attribute. He also covered some indexing strategies that can make it efficient to query such an time-aware tuple-store.

After this, there were two keynotes. The first one was News and Graphs by Peter Olson of NBC News Digital, which covered the graph structure of the NBC News Publishing pipeline, and how NBC leverages graphs to provide news with high velocity and scale. The second keynote was Knowledge Graph Pilot improves Data Quality While Providing a Customer 360 View by Bethany Swhon and Patricia Branum of Capital One, where they described how they improved the quality of their Knowledge Graph to provide a better Customer view across the enterprise.

The final presentation of the day and conference for me was Automated Encoding of Knowledge from Unstructured Text into a Graph Database by Chris Davis of Lymba. The presentation describes the Lymba pipeline to convert text into Knowledge Graph. It includes the usual preprocessing, tokenizing, POS tagging, and segmentation steps other presentations covered (and in some ways seem to be standard knowledge in the text to KG NLP sub-community), but this presentation went one step further and talked about the need for Word Sense Disambiguation, Concept extraction (using gazetteers and NER models), and Syntactic (constituent) and Semantic Parses (dependency) for relation extraction. It also includes Coreference Resolution, which is also quite important but usually omitted from pipelines because of its complexity. The Lymba product provides a turnkey solution plus consulting for various industries.

I had to catch a flight back, and having heard about the Chicago traffic and having faced the zero tolerance for lateness in large airports such as LAX, I didn't want to miss it. So I ended up skipping the last panel discussion on Graphs vs Tables. Turns out I didn't need to, but better safe than sorry.

Conclusions


As conferences go, this was quite luxurious -- attendees were treated to a sumptous buffet breakfast every day, and a 3 course sit-down lunch for 3 of the 4 days (1 of the days was build-your-own sandwiches, but even that was quite nice). One observation is that sit-down lunches can foster really good and insightful conversations. In addition, there was coffee and snacks throughout the day, and (limited) free drinks for 2 of the 3 evenings. Swag included a Dataversity branded backpack to hold your conference materials, wool hats with the Dataversity logo, stickers, and a pen which contained a USB drive with all the presentation slides, as well as the swag vendors give out at their stalls (to potential clients).

Of course, the nicest thing about conferences (after the presentations) are the conversations with fellow attendees, and the chance to learn from their insights, and what they are doing with the tech under consideration (in this case graphs). I met people from aviation, health (insurance), finance, consulting, the government (from both the covert and the overt branches), as well as scientific publishing. In addition, it was a chance to interact with people from the vendor companies, and bounce ideas against them about specific things they do well. Two insights, both gained at lunch table conversations -- first, RDF has better inter-operability and tooling, but LPGs are easier to work with; second, certain Asian cultures believe that you can never define an object fully, which seems to warrant more of a triple-store structure than the more efficient but constrained graph structure.

Overall, it was good to see Graphs being treated with so much importance. The previous Graph conferences I have attended were much smaller affairs, rarely lasting more than a day. I suppose this might partly be because of the focus on explainable AI, advances in Knowledge Graphs, Graph CNNs and embeddings, as well as the realization that graph structure provides useful exploitable signal, all of which are causing graphs to become more and more important, and graph conferences to become more popular.

If I had to suggest an improvement, I would suggest streamlining the evaluation process. I don't know how many feedback forms were returned (I returned all four that were provided in my conference materials, but not the last global one). Each form takes approximately 5 minutes to complete, so it is tempting to skip it and go to the next session instead. And by the evening, it is harder to do, since you have to refer to your notes instead of relying on short term memory. On the other hand, someone at the door with an iPad who scans your badge and asks you to tap on a smiley versus a frowney icon provides much better coverage (although you would have to interpret the meaning of the feedback). I guess its the tension between explicit versus implicit feedback, there are tradeoffs either way.


Friday, August 09, 2019

KDD 2019: Trip Report


I had the good fortune last week to attend KDD 2019, or more formally, the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, that was held at downtown Anchorage, AK from August 4-8, 2019. Approximately 3000 participants descended on the city (boosting the population by 1%, as Mayor Berkowitz pointed out in his keynote). This was my first KDD, and I found it somewhat different from other conferences I have attended in the past. First, there seems to be a more equitable representation from academia and industry. To some extent this is understandable, since data is a big part of KDD, and typically, it is industry that collects, and has access to, large and interesting datasets. A side effect is that there is as much emphasis on tools and frameworks as on methods, so by judiciously selecting the sessions you attend, you could load up on the combination that works best for you. Second, there is a surfeit of choice -- at any point in time during the conference, there were 5-8 parallel tracks spread across two conference buildings. While there were some difficult choices to make as to which track to attend and which to drop, I felt I got a lot of value out of the conference. Of course, this does mean that each attendee's experience of the conference is likely to be somewhat personalized by their background and their choices. In this post, I will write about my experience at KDD 2019.

Day 1 (Sunday, August 4) -- Lecture Style Tutorials


My choices here were slightly non-obvious, since I haven't worked directly with hospital data (unless you count the i2b2 and MIMIC datasets), and don't forsee doing so any time soon. However, I had just finished reading the book Deep Medicine by Dr. Eric Topol, and was quite excited about all the cool ideas in the book around using machine learning to assist medical practitioners in hospital settings. There were tutorials that were closer to my work interests, but I figured that it might be good to mix in some exploration with the exploitation, and I decided to attend the tutorials on Mining and model understanding on medical data, and Data mining methods for Drug Discovery and Development.

The first tutorial provided a very broad coverage of Medical Data Mining, starting with sources and coding schemes (ICD-10, ATC, etc.), various interesting strategies for extracting temporal features from Electronic Health Records (EHR), such as the use of Allen's temporal logic and Itemset disproportionality. Also covered were Learning from Cohorts and the use of Randomized Control Trials, and the application of the Dawid-Skene algorithm to Data Fusion, i.e., a process of creating clean features from multiple noisy features, which reminded me of the Snorkel generative model. I also learned that the Dawid-Skene algorithm is equivalent to a one-layer Restricted Boltzmann Machine (RBM) (example code in Tensorflow). Another interesting insight provided by one of the presenters is the timeline for ML techniques -- starting with rules, then simple matrix based techniques (logistic regression, decision trees, SVM, XGBoost, etc), then very briefly probabilistic statistical techniques, rapidly supplanted by Deep Learning techniques. There is of course a move to merge the last two techniques nowadays, so probabilistic techniques are coming back to the forefront. The tutorial was run by Prof. Myra Spiliopoulou and her team from the Otto-von-Guericke-Universität Magdeburg.

The second tutorial provided a broad overview of in-silico drug development. Primary tasks here are Molecular Representation Learning (for example, mol2vec) to allow molecules to be represented in a semantic vector space similar to word embeddings; Molecular Property Prediction that takes a drug molecule and outputs its property; Drug Repositioning that takes a (molecule, protein) pair and outputs an affinity score to indicate how the molecule will interact with the (disease) protein; Adverse Drug Interation that takes a (molecule, molecule) pair and predicts their interaction; and finally De Novo drug development, which takes a chemical property and outputs a drug molecule. We also learned various schemes for encoding molecules as text, such as 1D, 2D, and 3D encoding, circular fingerprints (ECFPx), SMILES, and adjacency matrix (Bond Adjacency). The main takeaways for me were the ways in which molecules can be encoded and embedded, and the use of Variational Autoencoders (VAE) and grammar constraints to restrict generated drugs to valid ones. The tutorial was run by Cao Xiao and Prof. Jimeng Sun of IQVIA and Georgia Tech respectively.

Day 2 (Monday, August 5) -- Workshops


The workshop I attended was Mining and Learning with Graphs Workshop (MLG 2019). This was an all-day event, with 5 keynote speakers. The first keynote speaker was Prof. Lisa Getoor of University of California, Santa Cruz - she gave a shorter version of her RecSys 2018 keynote and mentioned the Probabilistic Soft Logic (PSL) Framework, a framework for developing (specifying rules and optimizing) probabilistic models. Prof. Austin Benson from Cornell spoke of Higher Order Link Prediction (slides). Lada Adamic spoke about interesting Social Network findings based on crunching county-level US demographics data from Facebook. She works for Facebook now, but I remember her as the University of Michigan professor who re-introduced me to Graph Theory after college, through her (now no longer active) course on Social Network Analysis on Coursera. Prof. Vagelis Papalexakis, from the University of Riverside, talked about Tensor Decomposition for Multi-aspect Graph Analytics, a talk that even a fellow workshop attendee and PhD student who had a poster accepted thought was heavy. Finally, Prof Huan Liu of Arizona State University, and the author of the (free to download) book Social Media Mining, spoke very entertainingly about the challenges in mining Big Social Media data, mostly related to feature sparsity and privacy, and possible solutions to these. He also pointed the audience to an open source feature selection library called scikit-feature.

There were quite a few papers in there (as well as posters) in the workshop that I found interesting. The paper Graph-Based Recommendation with Personalized Diffusions uses random walks to generate personalized diffusion features for an item-based recommender. The Sparse + Low Rank trick for Matrix Factorization-based Graph Algorithms based on Halko's randomized algoithm, describes a simple way to make matrix factorization more scalable by decomposing the matrix into a sparse and a low-rank component. Graph Embeddings at Scale proposes a distributed infrastructure to build graph embeddings that avoids graph partitioning. The Temporal Link Prediction in Dynamic Networks (poster) uses a SiameseLSTM network to compare pairs of sequences of node embeddings over time. When to Remember where you came from: Node Representation Learning in Higher-Order Networks uses historical links to predict future links.

Finally, I also went round looking at posters from other workshops. Of these, I found Automatic Construction and Natural-Language Description of Nonparametric Regression Models that attempts to classify time series trends against a library of reference patterns, and then create a vector that can be used to generate a set of explanations for the trend.

This was followed by the KDD opening session, where after various speeches by committee members and invited dignitaries, awards for various activities were given out. Of note was the ACM SIGKDD Innovation Award awarded to Charu Aggarwal, the ACM SIGKDD Service Award for Balaji Krishnapuram, and the SIGKDD Test of Time Award to Christos Faloutsos, Natalie Glance, Carlos Guestrin, Andreas Krause, Jure Leskovec, and Jeanne VanBriesen.

There was another poster session that evening, where I had the chance to see quite a few more posters. Some of these that I found interesting are as follows. Revisiting kd-tree for Nearest Neighbor Search, which uses randomized partition trees and Fast Fourier Transforms (FFT) to more efficiently build kd-trees with the same level of query accuracy. It caught my interest because I saw something about randomized partition trees, and I ended up learning something interesting. Another one was Riker: Mining Rich Keyword Representations for Interpretable Product Question answering, which involves creating word vectors for questions and using attention maps to predict the importance of each of these words for a given product.

Day 3 (Tuesday, August 6) -- Oral Presentations


The day started with a keynote presentation titled The Unreasonable Effectiveness and Difficulty of Data in Healthcare by Dr Peter Lee of Microsoft Research. To a large extent, his points mirror those made by Dr. Eric Topol in Deep Medicine in terms of what is possible in medicine with the help of ML/AI, but he also looks at the challenges that must be overcome before this vision becomes reality.

Following that, I attended two sessions of Applied Data Science Oral Presentations, one on Auto-ML and Development Frameworks, and the other on Language Models and Text Mining, and then one session of Research Track Oral Presentation on Neural Networks.

I found the following papers interesting in the first Applied Data Science session. Auto-Keras: An Efficient Neural Architecture Search System uses Bayesian Optimization to find the most efficient Dense Keras network for your application. To the user, calling this is a one-liner. Currently this works on legacy Keras, but the authors are working with Google to have this ported to tf.keras as well. A more interesting framework keras-tuner currently works with tf.keras, and while invoking keras-tuner involves more lines of code, it does seem to be more flexible as well. TF-Ranking: Scalable Tensorflow Library for Learning-to-Rank is another Learning to Rank (LTR) framework that is meant to be used instead of libraries like RankLib or LambdaMART. It provides pointwise, pairwise based, and listwise ranking functions. FDML: A Collaborative Machine Learning Framework for Distributed Learning is meant to be used where learning needs to happen across platforms which are unable to share data either because of volume or privacy reasons. The idea is to learn local models with diverse local features, which will output local results, then combine local results to get the final prediction. In addition, there was a talk on Pythia: AI assisted code completion system that is used in the VSCode editor, and Shrinkage Estimators in Online Experiments, which mentions the Pytorch based Adaptive Experimentation Platform for Bayesian Parameter Optimization.

The second Applied Data Science session was on Language Models. The papers I found interesting in this session are as follows. Unsupervised Clinical Language Translation (paper) which uses an unsupervised technique to induce a dictionary between clinical phrases and corresponding layman phrases, then uses a standard Machine Translation (MT) pipeline to translate one to the other. A reverse pipeline is also constructed, which can be used to generate more training data for the MT pipeline. GMail Smart Compose: Real-Time Assisted Writing underlies the phrase completion feature most GMail users are familiar with. It is achieved by interpolating predictions from a large global language model and a smaller per-user language model. As part of this work, they have open sourced Lingvo, a Tensorflow based framework for building sequence models. And finally, Naranjo Question Answering using End-to-End Multi-task Learning Model attempts to infer adverse drug reactions (ADR) from EHRs by answering the Naranho questionnaire using automated question answering. There was also Automatic Dialog Summary Generation for Customer Service uses key point sequences to guide the summarization process, and uses a novel Leader-Writer network for the purpose.

The final oral presentation session for the day was the Research Track on Neural Networks. Unfortunately, I did not find any of the papers useful, in terms of techniques I could borrow for my own work. I did get the impression that Graph based Neural Networks were the new thing, since almost every paper used some form of Graph network. Apart from graph embeddings that are derived from node properties or conducting random walks on graphs, there is the graph convolution network (GCN) which uses graph local features instead of spatially local features. The GCN-MF: Disease-Gene Association Identification by Graph Convolutional Networks and Matrix Factorization uses this kind of architecture to detect associations between diseases and genes. Similarly, the Origin-destination Matrix prediction via Graph Convolution: A new perspective of Passenger Demand Modeling uses GCNs to predict demand for ride-hailing services.

The exhibition booth had also opened earlier that day, so I spent some time wandering the stalls, meeting a few people and asking questions about their products. There were a couple of publishers, MIT Press and Springer, selling Data Science books. There were some graph database companies, TigerGraph and Neo4j. Microsoft and Amazon were the two cloud providers with booths, but Google wasn't present (not my observation, it was pointed out to me by someone else). Indeed and LinkedIn were also there. NVIDIA was promoting its RAPIDS GPU-based acceleration framework, along with its GPUs. There were a bunch of smaller data science / analytics companies as well. I picked up a couple of stickers and some literature from the National Security Agency (NSA) and the NVIDIA booths.

I then wandered over to the poster area. I managed to talk to a few people and listen to a few presentations. Notable among them was the poster on Chainer: a Deep Learning Framework for Accelerating the Research Cycle. I haven't used Chainer, but looking at the code in the poster, it looked a bit like Pytorch (or more correctly perhaps, Pytorch looks a bit like Chainer). Another framework to pick up when time permits, hopefully.

Day 4 (Wednesday, August 7) -- Hands-on Tutorials


I came in bright and early, hoping to attend the day's keynote presentation, but ended up having a great conversation with a data scientist from Minnesota instead, as we watched the sun rise across the Alaska range from the third floor terrace of the conference building. In any case, the keynote I planned on attending ended up getting cancelled, so it was all good. For my activity that day, I had decided on attending two hands-on tutorials, one about Deep Learning for Natural Language Processing with Tensorflow, and the other about Deep Learning at Scale on Databricks.

The Deep Learning for NLP with Tensorflow was taught by a team from Google. It uses the Tensorflow 2.x style of eager execution and tf.keras. It covers basics, then rapidly moves on to sequence models (RNN, LSTM), embeddings, sequence to sequence models, attention, and transformers. As far as teachability goes, I have spent a fair bit of time trying to figure this stuff out myself, then trying to express it in the cleanest possible way to others, and I thought this was the most intuitive explanation of attention I have seen so far. The slide deck is here, they contain links to various Collab notebooks. The Collab notebooks can also be found at this github link. The tutorial then covers the transformer architecture, and students (in an ideal world with enough internet bandwidth and time) are taught how to construct a transformer encoder-decoder architecture from scratch. They also teach you how to user the pre-trained BERT model from TF-Hub and optionally fine tune it. Because we were not in an ideal world, after the initial few Collab notebooks, it was basically a lecture, where we are encouraged to run the notebooks on our own time.

The Deep Learning at Scale on Databricks was taught by a team from Databricks, and was apparently Part-II in a two part session. But quite a few of us showed up based on the session being marked as a COPY of the morning session, so the instructor was kind enough to run through the material again for our benefit. The slide deck can be found here. Unfortunately, I can no longer locate the URL for the notebook file archive to be imported into Databricks, but I am guessing these notebooks will soon be available as a Databricks tutorial. We used the Databricks platform provided by Microsoft Azure. In any case, the class schedule was supposed to cover Keras basics, MLFlow, Horovod for distributed model training, HyperOpt for simultaneously training models on workers with different hyperparameters. We ended up running through the Keras basics very fast, then spending some time on MLFlow, and finally run distributed training with Horovod on Spark. Most people had come to get some hands-on with Horovod anyway, so not being able to cover HyperOpt was not a big deal for most of us.

That evening was also the KDD dinner. I guess lot of people (including me, based on past ACL conferences) had expected something more formal, but it turned out to be a standup with drinks and hors-d'oeuvres. To be fair, the stand-up model does give you more opportunities to network. However, it was also quite crowded, so after a fairly long time spent in lines with correspondingly little profit, I decided to hit the nearby Gumbo House where I enjoyed a bowl of gumbo and some excellent conversation with a couple of AWS engineers, also KDD attendees who decided to eat out rather than braving the lines. Talking of food, other good places to eat at Anchorage downtown are the Orso, Pangea, and Fletcher's (good pizza). I am sure there are others, but these are the ones I went to and can recommend.

Day 5 (Thursday, August 8) -- More Hands-on Tutorial


This was the last day of the conference. I had a slightly early flight (3 pm) which meant that I would be able to attend only sessions in the first half. In the morning keynote, Prof. Cynthia Rudin of Duke University spoke about her experience with smaller simpler models versus large complex ones, and made the point that it is harder to come up with a simple model because the additional constraints are harder to satisfy. She then shows that it is possible to empirically test for whether one or more simple models are available by looking at accuracies from multiple ML models. Overall, a very thought provoking and useful talk.

For the rest of the day, I chose another hands-on tutorial titled From Graph to Knowledge Graph: Mining Large-scale Heterogeneous Networks using Spark taught by a team from Microsoft. As with the previous hands-on, we used Databricks provided by Azure. The objective was to learn to operate on subsets of the Microsoft Academic Graph, using Databricks notebooks available on this github site. However, since we were all sharing a cluster, there wasn't enough capacity for the students to do any hands-on, so we ended up watching the instructor run through the notebooks on the projector. The initial notebooks (run before lunch) seemed fairly basic, with standard DataFrame operators being used. I am guessing the fun stuff happened in the afternoon after I left, but in any case, Microsoft also offers a longer course From Graph to Knowledge Graph - Algorithms and Applications on edX, which I plan on auditing.

Closing Thoughts


There were some logistical issues, that in hindsight perhaps, could be avoided. While Anchorage is a beautiful city and I thoroughly enjoyed my time there, for some attendees it was perhaps not as great an experience. One particularly scary problem was that some people's hotel bookings got cancelled due to a mixup with their online travel agents, which meant that they had no place to sleep when they arrived here. Apparently some people had to sleep on park benches -- I thought that was particularly scary, at least until the University of Alaska opened up their dormitory to accommodate the attendees who had nowhere to go. I didn't get accommodation at the "KDD approved" hotels listed on their site either, but I did end up getting a place to stay that was only a 7 minute walk from the conference venue, so I count myself as one of the lucky ones. However, apart from this one major mishap, I think the conference went mostly smoothly.

At RecSys 2018, which I attended last year, one of the people in the group I found myself in said that he had finally "found his people". While my conference experience has been improving steadily over time with respect to the social aspect, and I did end up making lot more friends at RecSys 2018 than I did here (partly due to the network effect of my colleague and his friends being die-hard extroverts), I do think I have finally found mine at KDD.