Salmon Run: machine-learning

Showing posts with label machine-learning. Show all posts

Saturday, June 28, 2025

Book Review: Hands-On Artificial Intelligence for IoT

For those in similar professional circles as I am in, i.e. looking forward into the Generative AI space, yet with one foot pragmatically and firmly stuck in Machine Learning (ML) and Deep Learning (DL) techniques of the (recent, ok, not very distant) past, you will find Dr Amita Kapoor's recent book Hands-On Artificial Intelligence for IoT: Expert Machine Learning and Deep Learning Techniques for developing smarter IoT systems, 2/ed published by PackT a very useful resource into the use of these techniques applied to applications in the Internet of Things (IoT) domain. My own interest in IoT is driven primarily by previous personal (and failed) forays into Home Automation, but I do have some background in ML and DL techniques. So I approached this book from the perspective of a reader trying to understand the challenges and applications of these techniques in the IoT domain. This perspective shaped my reading of the book, and to some extent this review as well, as I looked for insights that would help me bridge my existing knowledge with the nuances of the IoT domain.

The book is organized into 4 parts. The first part introduces foundational techniques that are common to both the fields of AI (this term includes ML and DL) and IoT, while the second part covers advanced techniques. The third part focuses on specific IoT applications and AI techniques to handle them, while the fourth part covers IoT applications at different levels of granularity (personal/home, industrial, smart cities, etc.). The book is quite large (approximately 400 pages) and covers a lot of ground, some of which you may already be familiar with depending on your background. However, even in those cases, it may be worthwhile to skim the text to make sure you don't miss something you didn't know about, since things move quickly in this field. In any case, I present below my summary of each chapter, organized into a loose table of contents type structure. Hopefully they help you make the decision to read versus skim and optimize your reading experience.

Part I: Principles and Foundations of IoT and AI
- Principles and Foundations of IoT and AI -- covers the theoretical foundations of IoT (think ISO network stack), various applications, and the necessity of using Big Data techniques and ML. It concludes with a list of tools used in the text, which includes Keras3.0 to support DL in IoT applications.
- Data Access and Distributed Processing for IoT -- this chapter covers processing data in various formats (text, CSV, Excel, JSON, HDFS, and various SQL and NoSQL databases) using Python. This is because IoT devices often present data in proprietary formats, and you need to be able to read it into your application.
- Machine Learning for IoT -- covers traditional ML algorithms such as Naive Bayes, Logistic Regression, Decision Trees, SVM, etc (remember my quip about having one foot firmly in the distant ML past? This is about as far back you would go), and one example using a simple DL model. Even though these may not be on par with more recent models such as BERT or small LLMs, these are typically deployed for solving simpler problems and have lower latency requirements, and are often adequate for the problem at hand.
Part II: Advanced AI Techniques and their application in IoT
- Deep Learning for IoT -- introductory DL chapter, covers DL basics, CNN, RNN and AutoEncoders. It also provides a brief description of OpenVINO for IoT vision applications and TinyML for low-power on-device analytics, and using Keras Tuner for Hyperparameter Tuning.
- Techniques for IoT -- explores alternative optimization techniques to Gradient Descent (GD) such as Simulated Annealing and Swarm Optimization. Also covers the use of Evolutionary and Genetic Algorithms (EA and GA) using libraries such as PyGAD and DEAP. While not mentioned explicitly, I will guess that EA/GA are included here because they are less resource intensive compared to GD, and can often be more efficient depending on application.
- Reinforcement Learning for IoT -- this chapter covers the basics of Reinforcement Learning (RL), Q-Learning (DQN, DDQN, Policy Gradients, etc). As before RL based training can be particularly suitable for IoT applications because they are physics based and reinforcement signals can be cheaper to obtain and more relevant compared to supervision signals.
- Generative Models for IoT -- this chapter covers Generative Adversarial Networks (GAN) and Variational AutoEncoders (VAE), which are probably not the Generative Models you had in mind if you are in the current "GenAI" space, but these are the OG models that generate images from noise (rather than the next token from a stream of tokens). Primrily their utility in the IoT space seems to be data generation and simulation (GAN) and anomaly detection (VAE).
Part III: Implementing Intelligent IoT Solutions in Diverse Domains
- Distributed Learning using Keras -- this chapter covers Distributed training using Keras3 (using the JAX backend). This is useful information if you were just curious about Keras3 distributed capabilities. The relevance of this to the IoT space is that training data may be aggregated from multiple edge devices, say for recommendations, or multiple resource constrained edge devices may be used to retrain on new data, such as maintenance models in industrial IoT systems.
- AI Cloud Platforms for IoT -- covers the need for Cloud based APIs in the context of IoT, and IoT adjacent services provided by popular providers such as AWS, Azure and Watson. Also covers these providers from the point of view of ML services, including Google VertexAI and AutoML, AWS SageMaker and Bedrock, and IoT specific services such as AWS IoT Core, Azure IoT Hub and GCP IoT code.
- Deep Learning for Time Series Data from IoT -- covers working with time series data using traditional algorithms such as Prophet and Spark-ML, and wirth recurrent neural networks (RNN), and using pre-trained Temporal Convolutional Networks (TCN) models such as Chronos. This is particularly relevant since IoT devices emit streams of data over time that can be analyzed and extrapolated to predict the future.
- Leveraging AI for Visual Data from IoT -- covers the processing of visual data from IoT systems, including image segmentation and object detection and classification. Architectures covered include CNN, TCN, and ViT (Visual Transformers).
- AI for Text, Audio and Speech Data from IoT -- IoT devices can listen for particular sounds or speech patterns in their input, so this chapter covers mechanisms for IoT devices to process speech and audio, as well as free-form text input from users.
Part IV: Applying AI and IoT in Real-World Scenarios
- AI for Personal and Home IoT -- mainly covers Personal and Home IoT applications, and considerations for creating them, along with a case study on a Smart Home implementation. It also includes pointers on getting started on your own IoT projects.
- AI for IIoT -- there are already many IoT applications in use in industrial environments, and this chapter describes instances of these in various industries. Application areas are not only in manufacturing support, but could also be for preventative maintenance and forecasting load.
- AI for Smart Cities IoT -- I felt initially that this may a bit of an aspirational chapter, in the sense that the typical reader of this book is unlikely to be in a position to influence the use of AI for smart cities, but the examples proved me wrong. Many of these are examples of smart solutions to everyday problems that are well within the realm of influence of people working for cities or local government, directly or indirectly.

In summary, I found this book to be a comprehensive resource to understand the concepts behind IoT applications. It's breadth of coverage is truly impressive -- spanning essential principles of IoT and AI, traversing through machine learning, deep learning, and optimization techniques, and culminating in thorough discussions on real-world deployments across domains such as smart homes, industrial IoT, and smart cities. While the book’s extensive coverage of fundamentals in areas like machine learning and distributed processing may at times feel broader than strictly necessary for readers already well-versed in these fields, it ensures that the material remains accessible to a broader spectrum of readers.

The progression of chapters from core principles to practical case studies equips readers with a strong theoretical foundation as well as a practical understanding of how intelligent systems can be implemented in the IoT space. The inclusion of dedicated chapters on time series analysis, computer vision (CV), and Natural Language and Audio processing, offer readers additional perspective in these areas. While I don't see an IoT applications in my immediate future, it was an interesting read, and having read it, I feel more confident about being able to tackle one should it come about.

Tuesday, December 31, 2024

Packaging ML Pipelines from Experiment to Deployment

As an ML Engineer, we are generally tasked with solving some business problem with technology. Typically it involves leveraging data assets that your organization already owns or can acquire. Generally, unless it is a very simple problem, there would be more than one ML model involved, maybe different types of models depending on the sub-task, maybe other supporting tools such as a Search Index or Bloom Filter or third-party API. In such cases, these different models and tools would be organized into an ML Pipeline, where they would cooperate to produce the desired solution.

My general (very high level, very hand-wavy) process is to first convince myself that my proposed solution will work, then convince my project owners / peers, and finally to deploy the pipeline as an API to convince the application team that the solution solves the business problem. Of course, generating the initial proposed solution is a task in itself, and may need to be composed of multiple sub-solutions, each of which needs to be tested individually as well. So very likely the initial "proposed solution" is a partial bare-bones pipeline to begin with, and improves through successive iterations of feedback from the project and application teams.

In the past, I have treated these phases as largely disjoint, and each phase is built (mostly) from scratch with lot of copy-pasting of code from the previous phase. That is, I would start with notebooks (on Visual Studio Code of course) for the "convice myself" phase, copy-paste a lot of the functionality into a Streamlit application for the "convince project owners / peers" phase, and finally do another round of copy-pasting to build the backend for a FastAPI application for the "convnice application team" phase. While this works in general, folding in iterative improvements into each phase gets to be messy, time-consuming, and potentially error-prone.

Inspired by some of my fellow ML Engineers who are more steeped in Software Engineering best practices than I am, I decided to optimize the process by making it DRY (Don't Repeat Yourself). My modified process is as follows:

Convince Yourself -- continue using a combination of Notebooks and Short code snippets to test out sub-task functionality and compose sub-tasks into candidate pipelines. Focus is on exploration of different options, in terms of pre-trained third party models and supporting tools, fine-tuning candidate models, understanding the behavior of the individual components and the pipeline on small subsets of data, etc. There is no change here, the process can be as organized or chaotic as you like, if it works for you it works for you.

Convince Project Owners -- in this phase, your audience is a set of people that understand the domain very well, and are generally interested in how you are solving it, and how your solution will behave in wierd edge cases (that they have seen in the past and that you may not have imagined). They could run your notebooks in a pinch but they would prefer an application like interface with lots of debug information to show them how your pipeline is doing what it is doing.

Here the first step is to extract and parameterize functionality from my notebook(s) into functions. Functions would represent individual steps in multi-step pipeline, and should be able to return additional debug information when given a debug parameter. There should also be a function representing the entire pipeline, composed of calls to the individual steps. This is also the function that would deal with optional / new functionality across multiple iterations through feature flags. These functions should live in a central model.py file that would be called from all subsequent clients. Functions should have associated unit tests (unittest or pytest).

The Streamlit application should call the function representing the entire pipeline with the debug information. This ensures that as the pipeline evolves, no changes need to be made to the Streamlit client. Streamlit provides its own unit testing functionality in the form of the AppTest class, which can be used to run a few inputs through it. The focus is more to ensure that the app does not fail in a non-interactive manner so it can be run on a schedule (perhaps by a Github action).

Convince Project Team -- while this is similar to the previous step, I think of it as having the pipeline evaluated by domain experts in the project team against a larger dataset than what was achievable on the Streamlit application. We don't need as much intermediate / debugging information to illustrate how the process works. The focus here is on establishing that the solution generalizes for a sufficiently large and diverse set of data. This should be able to leverage the functions in the model we built in the previous phase. The output expected for this stage is a batch report, where you call the function representing the pipeline (with debug set to False this time), and format the returned value(s) into a file.

Convince Application Team -- this would expose a self-describing API that the application team can call to integrate your work into the application solving the business problem. This is again just a wrapper for your function call to the pipeline with debug set to False. Having this up as early as possible allows the application team to start working, as well as provide you valuable feedback around inputs and outputs, and point out edge cases where your pipeline might produce incorrect or inconsistent results.

I also used the requests library to build unit tests for the API, the objective is to just be able to test that it doesn't fail from the command line.

There is likely to be a feedback loop back to the Convince Yourself phase from each of these phase as inconsistencies are spotted and edge cases are uncovered. These may result in additional components being added to or removed from the pipeline, or their functionality changed. These changes should ideally only affect the model.py file, unless we need to add additional inputs, in that case these changes would affect the Streamlit app.py and the FastAPI api.py.

Finally, I orchestrated all these using SnakeMake, which I learned about in the recent PyData Global conference I attended. This allows me to not have to remember all the commands associated with running the Streamlit and FastAPI clients, running the different kinds of unit tests, etc, if I have to come back to the application after a while.

I implemented this approach over a small project recently, and the process is not as clear cut as I described, there was a fair amount of refactoring as I moved from the "Convince Project Owner" to "Convince Application Team". However, it feels less like a chore than it did when I have to fold in iterative improvements using the copy-paste approach. I think it is a step in the right direction, at least for me. What do you think?

Saturday, March 23, 2024

Book Report: Machine Learning for Drug Discovery

Drug Discovery is a field where biochemists (and more recently computer scientists) turn ideas into potential medications. I first came across a few applications in this area when checking out how to build Graph Neural Networks (GNN) as part of auditing the CS224W: Machine Learning with Graphs course from Stanford, some learnings of which I recycled into my Deep Learning with Graphs tutorial at ODSC 2021. Of course, drug discovery is much more than just GNNs, I mention this only because this happened to be my entry point into this fascinating world. However, I will hasten to add that despite having made an entrance, I am still parked pretty solidly close to the entrance (or exit, depending on your point of view).

But I am always looking to learn more about stuff I find interesting, so when I was offered a chance to review Dr Noah Flynn's Machine Learning for Drug Discovery published by Manning, I jumped on it. The book is currently in MEAP (Manning Early Access Program) so currently there are only 5 chapters available, but once the book is completed, there are going to be 15 chapters in all. The intended audience of the book, as the title suggests, are computational biochemists, i.e. the ones who attempt to solve Drug Discovery problems using Machine Learning. Thus, to become a computational biochemist, there are two main ways -- either you are a biochemist and you learn the ML, or you are a ML person and you learn the biochemistry. The book is aimed at both categories of readers.

As someone in the latter category, I had to spend much more time on the biochemistry aspects. I suspect that most readers of this review would also fall into this category. For them, I would say that while the ML part is sophisticated enough to solve the problem at hand, they are methods and practices that should be familiar to most ML people already. The most useful things that I think you would get out of this book are as follows:

Framing the Drug Discovery problem as a ML problem
Preprocessing and Encoding inputs
Getting data to train your ML model

For the first one, you either need to have a biochemistry background yourself, or you need to pair with someone who does. I suppose you could get by with a life sciences or chemistry background as well, or acquire enough biochemistry knowledge over time in this field, and this book may even take you part of the way there, but be aware that the learning curve is steep.

For the second and the third items, I thought the book was super useful. Most chapters are built as case studies around a Drug Discovery problem, so as you go through the chapters, you will learn about the sites to acquire your datasets from, and the techniques to preprocess the data from these sites into a form suitable for consumption by your ML model. At least the first 5 chapters deal with fairly simple ML models, but which may or may not be familiar to you depending on your industry, so you might also learn a few things about evaluating or tuning these models that you didn't know before (I did).

The first chapter introduces the reader to the domain and talks about the need for computational approaches to Drug Discovery. It introduces the terminology and the RDKit software library, an open-source cheminformatics toolkit the provides implementations of many common operations needed for computational Drug Discovery (sort of like a specialized supplement to Scikit-Learn for general ML). It also covers high level rules of thumb for detecting drug compounds, such as Lipinski's rule of 5. It then covers some common use cases common in Drug Discovery, ranging from Virtual Screening to Generative and Synthetic Chemistry. It also covers some popular (and public) repositories for Chemistry data, such as ChEMBL, PubChem, Protein Data Bank (PDB), etc.

The second chapter demonstrates Ligand based Screening, where you already have a reference molecule with some of the desired properties, and you want to search the chemical space for molecules similar to that one, with the objective of finding more drugs like the one you started with. The case study here is to identify potential anti-malarial compounds. The dataset for this comes packaged with RDKit itself as Structure Definition Files (SDF) which describes each molecule using a SMILES (Simplified Molecular Input Link Entry System) string. The chapter walks us through converting the SMILES to MOL format, then using RDKit to extract specialized chemical features from the MOL and SMILES, preprocessing to filter out uninteresting molecules based on rule based thresholds such as bio-availability, molecular weight, etc, structure based thresholds such as toxicity, and specific substructural patterns (similar to subgraph motifs). It then uses RDKit to generate Morgan fingerprints out of the remaining molecules (MOL). Morgan (and other) fingerprints are similar to embeddings in NLP, except that they encode structural information through a more deterministic process, and are hence more explainable than embeddings. Finally, these fingerprints are compared with the reference molecule using Tanimoto similarity and the nearest neighbors found.

Chapter 3 continues with the problem of Ligand based screening, but tries to predict cardiotoxicity of the anti-malarial compounds found in the previous chapter using a linear model. This is done indirectly by predicting if the compound blocks the hERG (or gene potassiuam) channel, then it is cardiotoxic, and vice versa. A linear model (Scikit-Learn SGD CLassifier) is trained using the hERG dataset from the Therapeutic Data Commons (TDC). The chapter shows some Exploratory Data Analysis (EDA) on the data, using standard preprocessing as described in the previous chapter. An additional step here is to standardize (regularize) the data for classification. The author provides the biochemistry reasoning for behind this step, but uses the implementation already provided by RDKit. Finally Morgan fingerprints are used to train the SGD Classifier. Because the elements of Morgan fingerprints have meaning, the weights of the resulting SGD model can be used to determine feature importances. There is also some discussion here of cross validation, L1/L2 regularization, removing collinearity, adding interaction terms and hyperparameter sweeps.

Chapter 4 explores building a linear regression model to predict solubility, i.e. how much of the drug would be absorbed by the system. The dataset used to train the regressor is the AqSolDB, also from TDC. This chapter introduces the idea of scaffold splitting, a technique common with biochemical datasets that preserves the structural / chemical similarity within each split. It also briefly describes outlier removal at the extremes, which requires chemistry knowledge. The RDKit library is used to extract features from the dataset, and the model trained to minimize the Mean Squared Error loss. The RANSAC (RANdom SAmple Consensus) technique is introduced that makes models more robust to outliers. On the ML side, there is some discussion on the bias-variance tradeoff and Learning / Validation curves.

The fifth and last chapter of the MEAP (at the time of writing this review) deals with predicting how well the body will metabolize the drug. Typically, drugs are broken down into enzymes in the liver, a large proportion of which are collectively known as the Cytochrome P450 superfamily. As before, metabolism is predicted indirectly by whether the drug inhibits Cytochrome P450 -- if it does, then it will not get metabolized easily, and vice versa. The dataset used to train the model is the CYP3A4 dataset, also from TDC. Data is prepared using the same set of (by now) standard pipeline and the classifier trained a binary predictions of whether the input inhibits Cytochrome P450 or not. The chapter discusses the utility of Reliability Plots in Performance Evaluation and Platt scaling for calibrating probabilities. It also talks about how to deal with imbalanced datasets, Data Augmentation, Class Weights and other approaches to deal with class imbalance. Various models are trained and evaluated, and their important features identified and visualized with RDKit Similarity Map. The chapter ends with a short discussion on Multi-label classification.

The pandemic and the rapid discovery of the COVID vaccine gave a lot of us (at least those of us that were watching) a ringside view into the fascinating world of drug discovery. This book provides yet another peek into this world, with its carefully crafted case studies and examples. Overall, I think you will learn a lot about drug discovery if you go through this book, both on the biochemistry side and the ML side. There are exercises at the end of each chapter, doing these would help you get more familiar with RDKit and hopefully more effective at computational drug discovery.

Saturday, October 07, 2023

A PySpark idiom for efficient Model Inference

I recently needed to build an Apache Spark (PySpark) job where the task was (among other things) to use a Language Model (LM) to encode text into vectors. This is an embarassingly parallel job where the text to encoding is one to one, so something like Spark works very well here. We could, in theory at least, achieve a N-fold performance improvement by horizontally partitioning the data into N splits respectively, and encoding them using N parallel workers.

However, LMs (and Machine Learning (ML) models in general) usually take some time to initialize before it is ready for use. This initialization step loads the model's parameters (multi-dimensional tensors of weights learned during the training process) into memory. So it is not really feasible to do something like this:

@dataclass
class Document:
    content: str
    metadata: Dict[str, Any]
    embedding: numpy.ndarray
 
def encode_row(row: Row) -> Row:
    model = initialize_model()
    row.embedding = model.encode(row.content)
    return row
    
data_rdd = data_rdd.map(lambda row: encode_row(row))

This is because it would require the model to be initialized for each row in our RDD, which can be very time-consuming. We can address this by initializing it on the master and broadcasting to all the workers, something I have done in the past.

def encode_row(row: Row) -> Row:
    model = bc_model.value
    row.embedding = model.encode(row.content)
    return row

model = initialize_model()
bc_model = sc.broadcast(model)
data_rdd = data_rdd.map(lambda row: encode_row(row))

But Spark provides a higher-order function (HOF) specifically for this use case, called mapPartitions, which allows you to specify code to create some heavyweight object(s) per partition, and then apply some processing (using these heavyweight objects) to all rows in the partition. So using this idiom, our processing code would look like this. You could also broadcast the model from the master instead of initializing it each time in the workers, which will save you the initialization time on each worker. Regardless, you can think of model.initialize_model as a wrapper for either approach.

def encode_rows(rows: Iterable[Row]) -> Row:
    model = initialize_model()
    for row in rows:
        row.embedding = model.encode(row.content)
        yield row

data_rdd = data_rdd.mapPartitions(lambda p: encode_rows(p))

However, LMs (and ML models in general) are designed to process input in batches. Generally inference (at least for neural models) involves a lot of matrix multiplications, which the underlying tensor library does in parallel if you feed your model in batches (or larger sets) rather than one input record at a time. Assuming the model was trained with batch size B (usually indicated by the default value for the batch_size parameter in the encode method (or equivalent)), this would translate roughly into a B-fold performance improvement if you fed it batches of size >= B. The model will internally partition the input into multiple batches of B records each, and process the batches sequentially and records within each batch in parallel.

So to allow the model to consume the rows in batches, we could change our code as follows.

def encode_rows(rows: Iterable[Row]) -> Row:
    model = initialize_model()
    docs = [row for row in rows]
    texts = [doc.content for doc in docs]
    embeddings = model.encode(texts)
    for doc, embedding in zip(docs, embeddings):
        doc.embedding = embedding
        yield doc

data_rdd = data_rdd.mapPartitions(lambda p: encode_rows(p))

Obviously, the approach above assumes that you have enough memory per partition to hold the text for all the documents in the partition. If your texts in your partition is too large, you will get an Out of Memory (OOM) and the job will abort. So based on your data and your architecture, the simplest (and probably slightly brute force approach) is to repartition your RDD into a larger number of (smaller) partitions, where the texts will fit in memory. So maybe something like this...

1 2	k = calculate_optimum_partition_size() # either dynamically or offline data_rdd = data_rdd.repartition(k).mapPartitions(lambda p: encode_rows(p))

But this can lead to many small partitions, which may be an overhead for Spark since it now has to manage the additional coordination. Also assuming your were initializing the model in the mapPartitions call, the job would spend more time doing this as well if there were many small partitions. Another way (and basically the idiom I am trying to build up to in this blog post) could be to leave the partition intact and use itertools.islice to batch up rows within each partition using code instead of leveraging the side effect of the partition size. Something like this:

def encode_rows(rows: Iterable[Row]) -> Row:
    model = initialize_model()
    start = 0
    while True:
        end = start + batch_size
        batch = itertools.islice(rows, start, end)
        docs = [row for row in batch]
        if len(docs) == 0:
            break
        texts = [doc.content for doc in docs]
        embeddings = model.encode(texts)
        start = end
        for doc, embedding in zip(docs, embeddings):
            doc.embedding = embedding
            yield doc

data_rdd = data_rdd.mapPartitions(lambda p: encode_rows(p))

EDIT 2023-12-11: -- I found a problem with this approach that took me a while to solve, so sharing it here in case it is helpful to someone down the line. I noticed that when applying the mapPartitions in the previous code block, the number of output records would often be smaller than the number of input records, i.e., the process lost records. I found I could mitigate it if I re-partitioned the RDD so that each partition contained number of records that were less than my batch size, i.e. itertools.islice is called only once. It turns out that islice messes up the underlying iterator (I did test its behavior with integer elements, but perhaps it behaves differently with non-primitive elements). The fix is to add a `rows, rows_copy = itertools.tee(rows)` between line 5 and 6 and only operate on the `rows_copy` in the islice call on line 6.

I am curious what people think of this approach? Using Spark to run ML inference at scale cannot be a new problem, but I wasn't able to find any information or best practices about this on the Internet. I did consider the possiblity that perhaps my Google-fu may not be as strong as I think, so I also tried Bard, and it didn't give me much to go on either. I am sure many Data Engineers before me have looked at this problem and have their own favorite solutions. Please share in the comments if you can!

Friday, June 09, 2023

Future of Data Centric AI -- Trip Report

I attended the Future of Data Centric AI 2023 this week, a free virtual conference organized by Snorkel AI. Snorkel.AI is a company built around the open-source Snorkel framework for programmatic data labeling. The project originally started at Stanford University's Hazy Research group, and many (all?) of the company's founders and some engineers are from the original research team. Snorkel.AI has been building and improving their flagship product, Snorkel Flow, an integrated tool for iterative data labeling and model building, so there were some presentations centered around that. In addition, its 2023, the year of generative LLMs (or GoLLuMs or Foundation Models) so Snorkel's ability to interface with these Foundation Models (FMs) also featured prominently. Maybe its a Stanford thing but presenters seem to prefer calling them FMs, so I will do the same, if only to distinguish them from the BERT / BART style large language models (LLMs).

If you are unfamiliar with what Snorkel does, I recommend checking out Snorkel and the Dawn of Weakly Supervised Machine Learning (Ratner et al, 2017) for a high-level understanding. For those familiar with the original open source Snorkel (and Snorkel METAL), Snorkel Flow is primarily a no-code web based tool to support the complete life-cycle of programmatic data labeling and model development. Because it is no-code it is usable by domain experts who don't necessarily know how to program. While the suite of built-in no-code Label Function (LF) templates are quite extensive, it supports adding programmatic LFs as well if you need them. In addition, it provides various conveniences such as cold-start LF recommendations and error analysis and recipes on how to address various classes of error to support an iterative approach to do model development almost like a programmer's edit-compile-run cycle. Over the last few months, they have added LLMs as another source of weak supervision and a possible source of LFs as well.

The last bit is important, because I think it points to the pragmatism of the Snorkel team. The FM applications ecosystem currently seems filled with pipelines that feature the FM front and center, i.e. use the FM for everything it can possibly do. Given their high infrastructure costs to run them and their high latencies, these pipelines don't seem very practical. Most of us were taught to cache (or pre-cache) as much as possible, so the customer does not pay the price during serving, or they will soon cease to be customers. Matthew Honnibal, creator of Spacy, makes a similar, though probably better argued, point in his Against LLM Maximalism blog post, where he advocates for smaller, more reliable, models for most tasks in the pipeline, and reserving the FM for tasks that truly need its capabilities. Snorkel Flow goes one step further by taking them out of the pipeline altogether -- instead using them to help generate good labels, thus benefiting from the FMs world-knowledge while still retaining the flexibility, reliability and explainability in the generated models.

However, Snorkel.AI is addressing the needs of the FM market as well, through their soon to be announced new tools -- Foundry and GenFlow -- which Alex Ratner (CEO and co-founder of Snorkel.AI) mentioned in his keynote addresses. They classify the usage of FMs into four stages -- pre-training (either from scratch or from trained weights, where it becomes more of a domain adaptation exercise), instruction tuning for behavior, fine tuning for a particular task, and distillation of the model into a smaller, more easily deployable model. As the DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining (Xie et al, 2023) paper shows, the mix of data used to train or adapt the FM can make a significant impact upon its quality, and Foundry and GenFlow are aimed at improving data and prompt quality for the first and second stages respectively, by ensuring optimum sampling, filtering and ranking.

Over the course of the presentation, presenters repeatedly talked about the importance of having high quality data to train models. Not surprising, since the conference has "Data-Centric AI" in its name, a term coined by Andrew Ng who was the first to emphasize this idea. However, the Snorkel team have really taken this idea to heart, and along with their customers, have developed some really cool applications, some of which they showcased in this conference. Apart from the keynotes and some panel discussions, presentations were in two parallel tracks, and I chose the ones that emphasized practice over theory, and I skipped a few, so the list below may be slightly biased. Videos of the talks will become available on the Snorkel Youtube channel in about a month, I will update the links once that happens (if I remember).

Bridging the Last Mile: Applying Foundation Models with Data-Centric AI (Alex Ratner) -- basic idea is that FMs are analogous to generalists that (think they) know lots of things, but for specific tasks they need to be trained to do well. Alex envisions data scientists of the future that are less machine learning experts and more domain and product experts. Alex's talks contain many interesting observations, too numerous to list here, and its just the right mixture of academic and practical for lay people such as myself.
Fireside Chat: building Bloomberg GPT (Gideon Mann and Alex Ratner) -- interesting insights into the rationale for Bloomberg GPT and the work that went into building it.
Fireside Chat: Stable Diffusion and Generative AI (Emad Mostaque and Alex Ratner) -- lot of cool technical insights about FMs from Emad Mostaque, CEO of Stability.AI (Stable Diffusion).
A Practical Guide to Data Centric AI -- A Conversational Use AI Use case (Daniel Lieb and Samira Shaikh) -- practical tips to building an intent classifier for conversational chatbots. Similarity function for clustering conversations was adapted from the paper Modeling Semantic Containment and Exclusion in Natural Language Inference (MacCartney and Manning, 2008).
The Future is Neurosymbolic (Yoav Shoham) -- somewhat philosophical discussion of why FMs can never do the kind of things humans can do, and why, from the founder of AI21 Labs.
Generating Synthetic Tabular Data that is Differentially Private (Lipika Ramaswamy) -- a somewhat technical discussion arguing for differential privacy to generate synthetic datasets that could be used to train FMs and thereby address the problem of them memorizing sensitive training data.
DataComp: Significance of Data for Multimodal AI (Ludwig Schmidt) -- discusses DATACOMP, a benchmark which aims to improve an image-text dataset used to train multi-modal models such as CLIP, by keeping the model fixed and improving the dataset. By applying a simple quality filter on the original dataset, they were able to model that was smaller in size, took 7x less time to train, and outperformed a larger model. More details in the DATACOMP: In search of the next generation of multimodal datasets (Gadre et al, 2023) paper.
New Introductions from Snorkel AI (Alex Ratner) -- second day keynote where Alex formally announced Snorkel Foundry and GenFlow, among other things, some of which were repeats from the previous day's keynote.
Transforming the Customer Experience with AI: Wayfair's Data Centric Way (Archana Sapkota and Vinny DeGenova) -- this was a really cool presentation, showing how they labeled their product images programatically with Snorkel for design, pattern, shape and theme, and used that to fine tune a CLIP model, which they now use in their search pipeline. More info about this work in this blog post.
Tackling advanced classification with Snorkel Flow (Angela Fox and Vincent Chen) -- the two big use cases where people leverage Snorkel are document classification and sequence labeling. Here they discuss several strategies for multi-label and single-label document classification.
Accelerating information extraction with data-centric iteration (John Smardijan and Vincent Chen) -- this presentation has a demo of Snorkel flow to label documents with keywords for a specific use case (for which off the shelf NERs do not exist). The demo shows how one can rapidly reach a good score (precision and coverage) by iterating through creating and applying an LF, then training and evaluating a model on the labels created by the LF, doing error analysis to correct the issues pointed out by creating another LF, etc, until the desired metrics are reached. They called this the Data-Model flywheel.
Applying Weak Supervision and Foundation Models for Computer Vision (Ravi Teja Mullapudi) -- talked about using Snorkel for image classification, including a really cool demo of Snorkel Periscope (an internal Labs tool) applied to satellite data to build classifiers that look for images of a particular type, using UMAP visualizations and cosine similarity distributions.
Leveraging Data-Centric AI for Document Intelligence and PDF Extraction (Ashwini Ramamoorthy) -- a talk about information extraction from PDF documents, similar to the one listed earlier, but as with that one, Ashwini shares a huge amount of practical information that I found very useful.
Leveraging Foundation Models and LLMs for Enterprise Grade NLP (Kristina Lipchin) -- slightly high level but very interesting take on FMs from a product manager viewpoint, echoes much of the same ideas about last mile handling covered in earlier talks, but identifies Domain Adaptation and Distillation as the primary use cases for most organizations.
Lessons from a year with Snorkel Data-Centric with SMEs and Georgetown (James Dunham) -- this is a hugely informative talk about Georgetown University's experience with using Snorkel Flow for a year. Not only did their domain experts adapt to it readily and love the experience, both data scientists and domain experts benefited from it. Some major benefits noted are the ability to ramp up labeling efforts faster and with less risk, since it is easier to iterate on labels (adding/removing/merging classes, etc) as your understanding of the data grows, the ability to fail fast and without too much sunk cost, and overall lowering of project risk. If you are contemplating purchasing a Snorkel Flow subscription, this talk provides lots of useful information.
Fireside chat: building RedPajamas (Ce Zheng and Braden Hancock) -- RedPajama is an open source initiative to produce a clean-room reimplementation of the popular LLaMA FM from Meta. The focus is on replicating their dataset recipe carefully, but using open source documents, and training base and instruction tuned versions of the LLaMMA model on this data that does not block commercial adoption. Ce is the head of Together Computer the company behind RedPajama, and Braden and Ce discuss the work that has been done so far in this project.

In many cases, it is not the lack of data, but a lack of labeled data that is the major hurdle to Machine Learning adoption within a company. Snorkel's support for weak supervision provides a practical path to generate labels using a programmatic approach. As someone who came to Machine Learning from Search, where featurization is basically TF-IDF and more lately using a trained tokenizer to feed a neural model, I was initially not particularly skilled at detecting features from data. However, over time, as I started looking at data, initially for error analysis and later for feature extraction in cases where labels were not available apriori, the process has become easier, so hopefully my next experience with Snorkel will be smoother. Furthermore, Snorkel's focus on FMs also provides a path to harness this powerful new resource as an additional source of weak supervision.

Monday, December 09, 2019

PyData LA 2019: Trip Report

PyData LA 2019 was last week, and I had the opportunity to attend. I also presented about NERDS (Named Entity Recognition for Data Scientists), an open source toolkit built by some colleagues from our Amsterdam office. This is my trip report.

The conference was three days long, Tuesday to Thursday, and was spread across 3 parallel tracks. So my report is necessarily incomplete, and limited to the talks and tutorials I attended. The first day was tutorials, and the next two days were talks. In quite a few situations, it was tough to choose between simultaneous talks. Fortunately, however, the talks were videotaped, and the organizers have promised to put them up in the next couple of weeks, so looking forward to catching up on the presentations I missed. The full schedule is here. I am guessing attendees will be notified by email when videos are released, and I will also update this post when that happens.

Day 1: Tutorials

For the tutorials, I did all the tutorials in the first track. Of these, I came in a bit late for Computer Vision with Pytorch, since I miscalculated the volume (and delaying power) of LA Traffic. It was fairly comprehensive, although I was familiar with at least some of the material already, so in retrospect, I should probably have attended one of the other tutorials.

The second tutorial was about Kedro and MLFlow and how to combine the two to build reproducible and versioned data pipelines. I didn't know that MLFlow can be used standalone outside Spark, so definitely something to follow up there. Kedro looks like scaffolding software which allows users to hook into specific callback points in its lifecycle.

The third tutorial was a presentation on teaching a computer to play PacMan using Reinforcement Learning (RL). RL apps definitely have a wow factor, and I suppose it can be useful where the environment is deterministic enough (rules of a game, laws of physics, etc.), but I often wonder if we can use it to train agents that can operate in a more uncertain "business applications"-like environment. I am not an expert on RL though, so if you have ideas on how to use RL in these areas, I would appreciate learning about them.

The fourth and last tutorial of the day was Predicting Transcription Factor (TF) genes from genetic networks using Neural Networks. The data extraction process was pretty cool, it was predicated on the fact that TF genes typically occupy central positions in genetic networks, so graph based algorithms such as connectedness and Louvain modularity can be used to detect them in these networks. These form the positive samples, and standard negative sampling is done to extract negative samples. The positive records (TFs) are oversampled using SMOTE. Features for these genes come from an external dataset of 120 or so experiments, where each gene was subjected to these experiments and results recorded. I thought that the coolest part was using the graph techniques for building up the dataset.

Days 2 and 3: Talks

All the talks provided me with some new information in one form or the other. In some cases, it was a tough choice to make, since multiple simultaneous talks seemed equally interesting to me going in. Below I list the ones I attended and liked, in chronological order of appearance in the schedule.

Gradient Boosting for data with both numerical and text features -- the talk was about the CatBoost library from Yandex, and the talk focused on how much better CatBoost is in terms of performance (especially on GPU) compared to other open source Gradient Boosting libraries (LightGBM and one other that I can't recall at the moment). CatBoost definitely looks attractive, and at some point I hope to give it a try.
Topological Techniques for Unsupervised Learning -- talked about how the topological technique called Uniform Manifold Approximation and Projection (UMAP) for dimensionality reduction can be used for generating very powerful embeddings and for clustering that is competitive with T-SNE. UMAP is more fully described in this paper on arXiv (the presenter was one of the co-authors of this paper). There was one other presentation on UMAP by one of the other co-authors which I was unable to attend.
Guide to Modern Hyperparameter Tuning Algorithms -- presented the open source Tune Hyperparameter Tuning Library from the Ray team. As with the previous presentation, there is a paper on arXiv that describes this library in more detail. The library provides functionality to do grid, random, bayesian, and genetic search over the hyperparameter space. It seems to be quite powerful and easy to use, and I hope to try it out soon.
Dynamic Programming for Hidden Markov Models (HMM) -- one of the clearest descriptions of the implementation of the Viterbi (given the parameters for the model and the observed states, find the most likely sequence of hidden states) algorithm that I have ever seen. The objective is for the audience to understand HMM (specifically Viterbi algorithm) well enough so they can apply it to new domains where it might be applicable.
RAPIDS: Open Source GPU Data Science -- I first learned about NVidia's RAPIDS library at KDD 2019 earlier this year. RAPIDS provides GPU optimized drop-in replacements for NumPy, Pandas, Scikit-Learn, and NetworkX (cuPy, cuDF, cuML, and cuGraph), which run order of magnitude faster if you have a GPU. Unfortunately, I don't have a GPU on my laptop, but the presenter said that images with RAPIDS pre-installed are available on Google Cloud (GCP), Azure, and AWS.
Datasets and ML Model Versioning using Open Source Tools -- this is a presentation on the Data Version Control (DVC) toolkit, which gives you a set of git like commands to version control your metadata, and link them to a physical file on some storage area like S3. We had envisioned using it internally for putting our datasets and ML models under version control some time back, so I was familiar with some of the information provided. But I thought the bit about creating versioned ML pipelines (data + model(s)) was quite interesting.

And here are the talks I would like to watch once the videos are uploaded.

Serverless Demo on AWS, GCP, and Azure -- this was covered in the lightning round on the second day. I think this is worth learning, since it seems to be an easy way to set up demos that work on demand. Also learned about AWS Batch, a "serverless" way to serve batch jobs (or at least non-singleton requests).
Supremely Light Introduction to Quantum Computing -- because Quantum Computing which I know nothing about.
Introducting AutoImpute: Python package for grappling with missing data -- No explanation needed, clearly, since real life data often comes with holes, and having something like this gives us access to a bunch of "standard" strategies fairly painlessly.
Tackling Homelessness with Open Data -- I would have attended this if I had not been presenting myself. Using Open Data for social good strikes me as something we, as software people, can do to improve our world and make it a better place, so always interested in seeing (and cheering on) others who do it.
What you got is What you got -- speaker is James Powell, a regular speaker I have heard at previous PyData conferences, who always manages to convey deep Python concepts in a most entertaining way.
GPU Python Libraries -- this was presented by another member of the RAPIDS team, and according to the previous presenter, focuses more on the Deep Learning aspect of RAPIDS.

And then of course there was my presentation. As I mentioned earlier, I spoke of NERDS, or more specifically my fork of NERDS where I made some improvements on the basic software. The improvements started as bug fixes, but currently there are quite a few significant changes, and I plan on making a few more. The slides for my talk are here. I cover why you might want to do Named Entity Recognition (NER), briefly describe various NER model types such as gazetteers, Conditional Random Fields (CRF), and various Neural model variations around the basic Bidirectional LSTM + CRF, cover the NER models available in NERDS, and finally describe how I used them to detect entities in a Biological Entity dataset from BioNLP 2004.

The reason I chose to talk about NERDS was twofold. First, I had begun to get interested in NERs in general in my own work, and "found" NERDS (although since it was an OSS project from my own company, not much discovery was involved :-)). I liked that NERDS does not provide "new" ML models, but rather a unified way to run many out of the box NER models against your data with minimum effort. In some ways, it is a software engineering solution that addresses a data science problem, and I thought the two disciplines coming together to solve a problem was an interesting thing in itself to talk about. Second, I feel that custom NER building is generally considered something of a black art, and something like NERDS has the potential to democratize the process.

Overall, based on some of the feedback I got on LinkedIn and in person, I thought the presentation was quite well received. There was some critical feedback saying that I should have focused more on the intuition behind the various NER modeling techniques than I did. While I agree that this might be desirable, I had limited time to deliver the talk, and I would not have been able to cover as much if I spent too much time on basics. Also, since the audience level was marked as Intermediate, I risked boring at least part of the audience if I did so. But I will keep this in mind for the future.

Finally, I would be remiss if I didn't mention all the wonderful people I met at this conference. I will not call you out by name, but you know who you are. Some people think of conferences as places where a small group of people get to showcase their work in front of a larger group of people, it is also a place where you get to meet people in your discipline but in similar or different domains, and I find it immensely helpful and interesting to share ideas and approaches for solving different problems.

And that's all I have for today. I hope you enjoyed reading my trip report.

Sunday, October 20, 2019

Trip Report: Graphorum 2019

I just got back from Graphorum 2019, organized by the folks at Dataversity. The conference was held at The Westin Chicago River North, and colocated with the Data Architecture Summit (DAS), also organized by Dataversity. Attendees were not restricted to talks at one or the other conference, they were allowed, even encouraged, to attend talks at the other conference, perhaps in an effort to popularize Graph concepts among Data Architects, and Data Architecture best practices among Graph practioners. Both conferences were very heavily multi-track -- at any point, you had around 3-5 choices if you restricted yourself to talks in either track. I attended only the Graphorum talks, so this trip report represents one of 322 million possible unique trip reports (and one of 142 billion possible unique reports if I had not restricted myself), with the naive assumption that at all talks within each conference were independent, and at any point, I was equally likely to select any one of the talks offered.

The conference was four days long, starting on Monday last week (October 14 2019) and ended yesterday (October 17). As expected, many of the talks (at least among the ones I attended) were centered around Knowledge Graphs (KGs). Some of these focused on techniques and advice on how to build them from unstructured text, and some focused on making them more effective. Many of the presentations ended up covering the various Linked Data standards such as Resource Description Framework (RDF) for specifying semantic triples, and Web Ontology Language (OWL) for doing inference on them. More than one talk mentioned the new Shape Constraint Language (SHACL) for validating such RDF graphs. On the other hand, it looks like there is strong industry support for Labeled Property Graphs (LPG) as well, both among database vendors and users. Regardless of the graph flavor, there was also quite a lot of interest in using Machine Learning (ML) and Natural Language Processing (NLP) to leverage the signal inherent in graph structure. In the rest of this post, I will cover the talks that I liked in my 1-in-322M path through the conference. I will also cover some interesting discussions I had with some vendors at the exhibition, and some overall feedback about the conference as a whole.

I arrived late at Chicago on Sunday night, so was unable to take advantage of the early bird registration. At breakfast next morning, I found myself at the end of a fairly long line while the others were relatively empty. In an attempt to understand why, and notwithstanding Murphy's Law, I noticed that the lines were by first character of last name, and unevenly sized (A-E, F-J, K-O, and P-Z). It is possible that the sizing is based on actual attendee last names, in that case I guess people with last names P-Z tend to be relatively last minute types.

Day 1

On the first day, I attended a presentation about Applications of Knowledge Graphs in Enterprise Data Management and AI by Juergen Jakobitsch from Semantic Web Company and Andreas Blumauer from Pool Party. It was very comprehensive, and included lots of background material on Knowledge Graphs. Among some standard applications were semantic search, info-box style results, product recommendations, virtual assistants, etc. One interesting application mentioned was text data augmentation, and (in the same vein) filtering results from chatbots for illogical answers. They also talked about the PoolParty pipeline for converting inputs (structured and unstructured) to Knowledge Graphs, which includes entity resolution and entity linking. This presentation, as well as others throughout the conference, also focused on the various W3C standards such as RDF, SPARQL, OWL2, and SHACL.

I also attended Graphs Transform Chemical Discovery in Pharmaceutical Research presented by Liana Kiff of Tom Sawyer Software. I had initially thought that it would talk about the actual process of Chemical Discovery using graphs, but it turned out to be (an admittedly very powerful) visualization of a graph of chemical entities from the ChEMBL database, including features that allow for manual chemical discovery.

The final presentation of the day was You Must Be This Tall: Machine Learning from Graphs, Tables, Trees, and Documents by Brian Sletten of Bosatsu Consulting, Inc. As with the previous talk, I came in hoping to learn about ML with graphs, but the talk turned out to be about what you need to do before you can do ML with your graphs. Nevertheless, the presenter was very knowledgable, so the talk ended up being pretty interesting and educational, in spite of it covering a lot of the RDF/SPARQL/OWL ground covered by earlier talks. One good insight here was the distinction between data, knowledge, and wisdom as points on a Data Understanding vs Connected space, coupled with capturing relationships between entities, the context in which these relations exist, and the exploitation of this information using RDF technologies. Another was the need for data lakes to be available as a pre-requisite to effective ML in the organization. Yet another other thing I liked about his presentation is this example SPARQL queries against Wikidata. He also talked about JSON-LD and how it can be a good substitute for RDF.

Day 2

The first presentation of Tuesday was the Ontology Engineering for Knowledge Graphs by Elisa Kendall and Deborah McGuinness. The talk focused on the role of the Ontology Engineer, who is the technical person who talks to the Knowledge Engineer or Domain Expert. The main component of this role is being effective at interviewing the Knowledge Engineer. The talk also covered various aspects of modeling the domain, with many examples drawn from the presenter's own experiences.

The next presentation I attended was Merging Company Data from Multiple Sources with GraphDB and Ontotext Platform by Atanas Kiryakov (their CEO). As expected, the presentation was more about how Ontotext does things. However, Ontotext is one of the first graph database companies I know of that pursued a strong NLP + Graph strategy, so there was lots of interesting content. Among the functionality covered were features around Entity Recognition from text and Entity Disambiguation (from structured data as well as text), the use of different embedding technology (1-hot, 2-hot, BERT, etc), and the use of different technologies such as Random Forests and SVM for entity and document classification, BiLSTM and CRF for Named Entity Recognition, and Inductive Logic Programming (ILP) for rules engines built around the discovered entities and relations.

The final presentation on Tuesday for me was Reveal Predictive Patterns with Neo4j Graph Algorithms (Hands-on) by Amy Hodler and William Lyon. I had learned about Graph Algorithms in Neo4j (and also used some of them for my own talk later in the conference) from the Graph Algorithms: Practical Examples in Apache Spark and Neo4j from O'Reilly (Amy Hodler, one of the presenters, is also one of the authors of this book), so some of it was old material for me. The talk started off with the need for graph algorithms as tools that exploit the additional information implicit in the graph structure, then covered the various classes of graph algorithms (Pathfinding, Centrality, Community Detection, Link Prediction and Similarity), with deep dives into specific algorithms and running them on their Neo4j Desktop product (proprietary product with 30 day free trial, but all the features covered are also available in the free community edition). I ended up learning a few new things, such as how to use virtual graphs (generated as a result of a Cypher query, sort of like views in the RDBMS world), and how to use the Strongly Connected components algorithm as a debugging tool. They also showed off their NEuler product, which allows forms-based invocation of various algorithms, as well as some very good visualizations. Talking about visualization, William Lyon also mentioned the neo4j-contrib/neovis.js project, which seems interesting as well. Overall, lots of useful information about Neo4j and graphs.

I also learned about the Bridges of Chicago, based on a challenge from the presenters about using Cypher (the Neo4j query language) to find an Eulerian path similar to the Bridges of Königsberg problem. I guess I was the only one that responded, since the problem is much simpler than it appears to be at first glance.

Exhibitors started setting up their booths today, so I spent some of the coffee breaks and most of the evening talking to various exhibitors. Both Graph database vendors and consultants were well represented among the exhibitors (considering it was a graph + data architecture conference). Graph vendors I knew of included Neo4j, Ontotext, TigerGraph, DataStax, and StarDog. Among those who I learned about at this conference were PoolParty, Semantic Web Company, and Cambridge Semantics. Having attended the presentations from PoolParty and The Semantic Web, and Ontotext, I spent a lot of time talking with them. I also met up with the folks at TigerGraph, and let them know how helpful their Graph Gurus webinar series has been to me. I also took the opportunity to meet up with the folks at Stardog, who I had met earlier at another Graph conference few years earlier through a reference. Since I was a speaker here, the conversation also drifted occassionally to the subject of my talk, and what graph database I was using (Neo4j).

Day 3

Wednesday was quite a heavy day in terms of presentations, comparatively speaking. It started with two keynote presentations. The first one was Knowledge Graphs and AI: The Future of Enterprise Data by David Newman from Wells Fargo. He spoke of the progression of looking at Strings to Things to Predicted Things to Vectors, which resonated with me as well, since we are progressing along a very similar path ourself. He led us through multiple examples involving harmonizing an entity across multiple Knowledge Graphs in the enterprise, the need for classifying entities into a taxonomy, using Knowledge Graphs to predict new relationships, using graph relations for creating digital fingerprints for ML algorithms, etc. His examples referenced the Financial Industry Business Ontology (FIBO), which provides a standard schema for the financial services industry.

The second keynote was Graph Stories: How Four Metaphors can help you decide if Graphs are right for you by Dan McCreary of Optum. While David Newman's presentation was based on RDF style graphs, Dan McCreary is a big proponent of Labeled Property Graphs (LPG), although his choice had several very pragmatic reasons. The four metaphors he described are the Neighborhood Walk, the Knowledge Triangle, the Open World Assumption, and the Jenga Tower. Specifically, the first indicates the importance of relationship traversal in your applications, the second indicates where your application is (or wants to be) on the Data / Information / Knowledge Triangle, the third indicates the ease with which new information can be incorporated into your system, and the fourth indicates the resilience of your query system to small changes in your backend. The keynote also covered the importance of graph structure (Structure is the new gold in data mining), the inter-relationship of Graphs with Deep Learning techniques such as Graph Convolutional Networks (GCNN) and Structured Learning with Graphs.

The next presentation I attended was Knowledge Graphs and Model Driven Classification by Scott Henninger of SmartLogic, where he showed off the capabilities of the SmartLogic platform, which centered around Metadata tagging, document classification (based on the metadata tagging and external taxonomies), and Search Enhancement Services (SES). The bulk of the capabilities seem to be rule based, which can be good for explainability purposes. SmartLogic's KG backend is based on RDF Schema, OWL, and SHACL. An interesting functionality of SmartLogic is to allow the user to manually fine-tune the (term) weights from their classifier. I got quite excited at this, thinking that perhaps this functionality could be leveraged to produce explainable Deep Learning models by perturbing the inputs, but then realized that the intuition is similar to the idea behind LIME - Local Interpretable Model-Agnostic Explanations.

Next up was a talk on How Do You Read Millions of Documents for Meaning using Graph? by Ryan Chandler of Caterpillar, Inc. He described a system he built at Caterpillar, that allowed customer support technicians to query a large collection of internal support tickets created by other technicians. The end result is a query-able knowledge base. The text in the support tickets are tokenized and segmented into sentences, tagged with cause, complaint, solution, note, and correction (classification). The document is decomposed into semantic frames, and the document and the associated semantic frames, along with its metadata, are stored in a Neo4j graph database. On the query side, the natural language (NL) query is converted into a graph using a dependency parse, and re-composed into a Cypher query against specific semantic frames (as indicated by the metadata). The Cypher query produces a ranked list of support tickets that best satisfy the NL query. I thought this was quite an interesting technique, although it may be somewhat dependent on the structure of the input data.

The next presentation I attended was Graph Analytics for Enterprise Applications by Melliyal Annamalai, Souripriya Das, and Matthew Perry from Oracle. I came in a few minutes late so I missed the first part, but from what I gathered, it covers Oracle's foray into graph databases -- it turns out that Oracle customers can now start working with SPARQL using SQL Developer, seamlessly against Oracle's new hybrid Property and RDF graph. The functionality is nice, but probably only useful for current and future Oracle customers.

My next presentation was Insufficient Facts always invite Danger: Combat them with a Logical Model by Michael Grove of Stardog, where he described how important it was to have a Logical Model to ensure completeness of your model, and and how it can help you avoid problems later.

The evening before I had spent some time at the DataStax booth, mainly for nostalgic reasons since I worked with Cassandra (the Apache version, not the DataStax version) at my previous job, and I was curious about their graph product based on Cassandra (initially called Titan, then Janus). So I attended the presentation Graph Innovations for Distributed Data at Scale by Jonathan Lacefield. The presentation covered the evolution of their graph product, and also answered a nagging question I had about how they implemented the graph in a column-family database under the covers -- turns out that each row is basically the star graph around each node. Other interesting things in this presentation were their use of Gremlin and Spark support through their DataStax extension.

The last presentation of the day was Knowledge Graphs and GraphQL in Action: A Practical Approach to using RDF and Semantic Models for Web Applications by Irene Polikoff and Ralph Hodgson of TopQuadrant. They described their Semantic GraphQL interface which provides the user with a GraphQL interface, and converts down to a RDF, OWL, and SHACL query against a RDF triple store.

Finally, the last event of the day was a session about Open Source Knowledge Graph Tooling, which really turned out to be a group of banking folks trying to collaborate around the FIBO Ontology, but it is likely that they might expand to other industries as well in the future. There was also talk about listing out a current (non-deprecated) list of open source ontologies in various industries, applying unit tests to ontologies so they don't become stale and irrelevant, both of which were interesting to me.

The exhibitors were still around, and so I hung around for some more conversations with vendors and fellow attendees for a couple more hours after that. Among them were Cambridge Semantics, who have a fast analytics graph database called AnzoDB.

Day 4

The first presentation of the day was Unsupervised and Supervised ML on Big Graph: Case Studies by Victor Lee. He described various case studies using TigerGraph. The first one was finding influential healthcare provides in various local networks from a specialty network, and finding their influence networks. Another case study had to do with detecting spam phone calls in the China Mobile network, the training data for which consisted of 180+ graph features. The model was a Random Forest classifier. At prediction time, an incoming phone call would be placed in the call graph, the 180+ features computed and fed into the Random Forest model to predict (under 20 milliseconds) whether the call was spam or not spam. The third case study was for Bank Fraud, based on some synthetic data from a Kaggle competition, where TigerGraph engineers built some compound relationships based on edges discovered in the feature graph, which ended up giving good results, showing that the structure of the data provides useful signal. The talk ended with an introduction to Node2Vec, a graph embedding scheme.

The next presentation in my list was my own (Graph Techniques for Natural Language Processing). My presentation was about using Graph techniques (mostly a combination of common third party algorithms) to solve Natural Language Processing (NLP) problems. I covered four case studies that attempted to replicate academic papers (referenced from the Bibliography of Graph-Based Natural Language Processing and Information Retrieval) around document summarization, clustering using language model based vectors, word sense disambiguation, and topic finding. Graph techniques used included various graph centrality metrics (some built-in and some computed using Cypher and built-in algorithms), random walk techniques, Louvain Community Detection, Label Propagation, and Personalized PageRank. Compared to the other presentations, mine was probably a bit unusual, since it focused on NLP more than on graphs, so while I had only about 15-20 attendees, there seemed to be lots of interest, and some very good questions at the end. For those of you who weren't able to make it to the presentation but would like more details, you can find the link to my slides and code (in Jupyter notebooks, with a lot of verbose commentary) at my sujitpal/nlp-graph-examples repository.

I hung around a bit after my presentation answering questions, so I ended up being a bit late to the next presentation, even with the generous coffee break in between. This was When Time Meets Relationships: Understanding an Immutable Graph Database by Brian Platz of Fluree. He makes the case that a Knowledge Graph is a snapshot at a point in time. A time-aware Knowledge Graph can be thought of as an immutable linked list, where facts are added to an append-only log, and made tamper-proof with hashing techniques, much like a private blockchain. The strategy assumes the Knowledge Graph is a triple-store of (Subject, Predicte, Object). As time passes, facts are either retracted or added, so a time-aware tuple would be (Subject, Predicate, Object, Time, Add/Retract). In addition, labeled properties, such as a scheduled expiration date, can be accommodated with an addition Metadata attribute. He also covered some indexing strategies that can make it efficient to query such an time-aware tuple-store.

After this, there were two keynotes. The first one was News and Graphs by Peter Olson of NBC News Digital, which covered the graph structure of the NBC News Publishing pipeline, and how NBC leverages graphs to provide news with high velocity and scale. The second keynote was Knowledge Graph Pilot improves Data Quality While Providing a Customer 360 View by Bethany Swhon and Patricia Branum of Capital One, where they described how they improved the quality of their Knowledge Graph to provide a better Customer view across the enterprise.

The final presentation of the day and conference for me was Automated Encoding of Knowledge from Unstructured Text into a Graph Database by Chris Davis of Lymba. The presentation describes the Lymba pipeline to convert text into Knowledge Graph. It includes the usual preprocessing, tokenizing, POS tagging, and segmentation steps other presentations covered (and in some ways seem to be standard knowledge in the text to KG NLP sub-community), but this presentation went one step further and talked about the need for Word Sense Disambiguation, Concept extraction (using gazetteers and NER models), and Syntactic (constituent) and Semantic Parses (dependency) for relation extraction. It also includes Coreference Resolution, which is also quite important but usually omitted from pipelines because of its complexity. The Lymba product provides a turnkey solution plus consulting for various industries.

I had to catch a flight back, and having heard about the Chicago traffic and having faced the zero tolerance for lateness in large airports such as LAX, I didn't want to miss it. So I ended up skipping the last panel discussion on Graphs vs Tables. Turns out I didn't need to, but better safe than sorry.

Conclusions

As conferences go, this was quite luxurious -- attendees were treated to a sumptous buffet breakfast every day, and a 3 course sit-down lunch for 3 of the 4 days (1 of the days was build-your-own sandwiches, but even that was quite nice). One observation is that sit-down lunches can foster really good and insightful conversations. In addition, there was coffee and snacks throughout the day, and (limited) free drinks for 2 of the 3 evenings. Swag included a Dataversity branded backpack to hold your conference materials, wool hats with the Dataversity logo, stickers, and a pen which contained a USB drive with all the presentation slides, as well as the swag vendors give out at their stalls (to potential clients).

Of course, the nicest thing about conferences (after the presentations) are the conversations with fellow attendees, and the chance to learn from their insights, and what they are doing with the tech under consideration (in this case graphs). I met people from aviation, health (insurance), finance, consulting, the government (from both the covert and the overt branches), as well as scientific publishing. In addition, it was a chance to interact with people from the vendor companies, and bounce ideas against them about specific things they do well. Two insights, both gained at lunch table conversations -- first, RDF has better inter-operability and tooling, but LPGs are easier to work with; second, certain Asian cultures believe that you can never define an object fully, which seems to warrant more of a triple-store structure than the more efficient but constrained graph structure.

Overall, it was good to see Graphs being treated with so much importance. The previous Graph conferences I have attended were much smaller affairs, rarely lasting more than a day. I suppose this might partly be because of the focus on explainable AI, advances in Knowledge Graphs, Graph CNNs and embeddings, as well as the realization that graph structure provides useful exploitable signal, all of which are causing graphs to become more and more important, and graph conferences to become more popular.

If I had to suggest an improvement, I would suggest streamlining the evaluation process. I don't know how many feedback forms were returned (I returned all four that were provided in my conference materials, but not the last global one). Each form takes approximately 5 minutes to complete, so it is tempting to skip it and go to the next session instead. And by the evening, it is harder to do, since you have to refer to your notes instead of relying on short term memory. On the other hand, someone at the door with an iPad who scans your badge and asks you to tap on a smiley versus a frowney icon provides much better coverage (although you would have to interpret the meaning of the feedback). I guess its the tension between explicit versus implicit feedback, there are tradeoffs either way.

Friday, August 09, 2019

KDD 2019: Trip Report

I had the good fortune last week to attend KDD 2019, or more formally, the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, that was held at downtown Anchorage, AK from August 4-8, 2019. Approximately 3000 participants descended on the city (boosting the population by 1%, as Mayor Berkowitz pointed out in his keynote). This was my first KDD, and I found it somewhat different from other conferences I have attended in the past. First, there seems to be a more equitable representation from academia and industry. To some extent this is understandable, since data is a big part of KDD, and typically, it is industry that collects, and has access to, large and interesting datasets. A side effect is that there is as much emphasis on tools and frameworks as on methods, so by judiciously selecting the sessions you attend, you could load up on the combination that works best for you. Second, there is a surfeit of choice -- at any point in time during the conference, there were 5-8 parallel tracks spread across two conference buildings. While there were some difficult choices to make as to which track to attend and which to drop, I felt I got a lot of value out of the conference. Of course, this does mean that each attendee's experience of the conference is likely to be somewhat personalized by their background and their choices. In this post, I will write about my experience at KDD 2019.

Day 1 (Sunday, August 4) -- Lecture Style Tutorials

My choices here were slightly non-obvious, since I haven't worked directly with hospital data (unless you count the i2b2 and MIMIC datasets), and don't forsee doing so any time soon. However, I had just finished reading the book Deep Medicine by Dr. Eric Topol, and was quite excited about all the cool ideas in the book around using machine learning to assist medical practitioners in hospital settings. There were tutorials that were closer to my work interests, but I figured that it might be good to mix in some exploration with the exploitation, and I decided to attend the tutorials on Mining and model understanding on medical data, and Data mining methods for Drug Discovery and Development.

The first tutorial provided a very broad coverage of Medical Data Mining, starting with sources and coding schemes (ICD-10, ATC, etc.), various interesting strategies for extracting temporal features from Electronic Health Records (EHR), such as the use of Allen's temporal logic and Itemset disproportionality. Also covered were Learning from Cohorts and the use of Randomized Control Trials, and the application of the Dawid-Skene algorithm to Data Fusion, i.e., a process of creating clean features from multiple noisy features, which reminded me of the Snorkel generative model. I also learned that the Dawid-Skene algorithm is equivalent to a one-layer Restricted Boltzmann Machine (RBM) (example code in Tensorflow). Another interesting insight provided by one of the presenters is the timeline for ML techniques -- starting with rules, then simple matrix based techniques (logistic regression, decision trees, SVM, XGBoost, etc), then very briefly probabilistic statistical techniques, rapidly supplanted by Deep Learning techniques. There is of course a move to merge the last two techniques nowadays, so probabilistic techniques are coming back to the forefront. The tutorial was run by Prof. Myra Spiliopoulou and her team from the Otto-von-Guericke-Universität Magdeburg.

The second tutorial provided a broad overview of in-silico drug development. Primary tasks here are Molecular Representation Learning (for example, mol2vec) to allow molecules to be represented in a semantic vector space similar to word embeddings; Molecular Property Prediction that takes a drug molecule and outputs its property; Drug Repositioning that takes a (molecule, protein) pair and outputs an affinity score to indicate how the molecule will interact with the (disease) protein; Adverse Drug Interation that takes a (molecule, molecule) pair and predicts their interaction; and finally De Novo drug development, which takes a chemical property and outputs a drug molecule. We also learned various schemes for encoding molecules as text, such as 1D, 2D, and 3D encoding, circular fingerprints (ECFPx), SMILES, and adjacency matrix (Bond Adjacency). The main takeaways for me were the ways in which molecules can be encoded and embedded, and the use of Variational Autoencoders (VAE) and grammar constraints to restrict generated drugs to valid ones. The tutorial was run by Cao Xiao and Prof. Jimeng Sun of IQVIA and Georgia Tech respectively.

Day 2 (Monday, August 5) -- Workshops

The workshop I attended was Mining and Learning with Graphs Workshop (MLG 2019). This was an all-day event, with 5 keynote speakers. The first keynote speaker was Prof. Lisa Getoor of University of California, Santa Cruz - she gave a shorter version of her RecSys 2018 keynote and mentioned the Probabilistic Soft Logic (PSL) Framework, a framework for developing (specifying rules and optimizing) probabilistic models. Prof. Austin Benson from Cornell spoke of Higher Order Link Prediction (slides). Lada Adamic spoke about interesting Social Network findings based on crunching county-level US demographics data from Facebook. She works for Facebook now, but I remember her as the University of Michigan professor who re-introduced me to Graph Theory after college, through her (now no longer active) course on Social Network Analysis on Coursera. Prof. Vagelis Papalexakis, from the University of Riverside, talked about Tensor Decomposition for Multi-aspect Graph Analytics, a talk that even a fellow workshop attendee and PhD student who had a poster accepted thought was heavy. Finally, Prof Huan Liu of Arizona State University, and the author of the (free to download) book Social Media Mining, spoke very entertainingly about the challenges in mining Big Social Media data, mostly related to feature sparsity and privacy, and possible solutions to these. He also pointed the audience to an open source feature selection library called scikit-feature.

There were quite a few papers in there (as well as posters) in the workshop that I found interesting. The paper Graph-Based Recommendation with Personalized Diffusions uses random walks to generate personalized diffusion features for an item-based recommender. The Sparse + Low Rank trick for Matrix Factorization-based Graph Algorithms based on Halko's randomized algoithm, describes a simple way to make matrix factorization more scalable by decomposing the matrix into a sparse and a low-rank component. Graph Embeddings at Scale proposes a distributed infrastructure to build graph embeddings that avoids graph partitioning. The Temporal Link Prediction in Dynamic Networks (poster) uses a SiameseLSTM network to compare pairs of sequences of node embeddings over time. When to Remember where you came from: Node Representation Learning in Higher-Order Networks uses historical links to predict future links.

Finally, I also went round looking at posters from other workshops. Of these, I found Automatic Construction and Natural-Language Description of Nonparametric Regression Models that attempts to classify time series trends against a library of reference patterns, and then create a vector that can be used to generate a set of explanations for the trend.

This was followed by the KDD opening session, where after various speeches by committee members and invited dignitaries, awards for various activities were given out. Of note was the ACM SIGKDD Innovation Award awarded to Charu Aggarwal, the ACM SIGKDD Service Award for Balaji Krishnapuram, and the SIGKDD Test of Time Award to Christos Faloutsos, Natalie Glance, Carlos Guestrin, Andreas Krause, Jure Leskovec, and Jeanne VanBriesen.

There was another poster session that evening, where I had the chance to see quite a few more posters. Some of these that I found interesting are as follows. Revisiting kd-tree for Nearest Neighbor Search, which uses randomized partition trees and Fast Fourier Transforms (FFT) to more efficiently build kd-trees with the same level of query accuracy. It caught my interest because I saw something about randomized partition trees, and I ended up learning something interesting. Another one was Riker: Mining Rich Keyword Representations for Interpretable Product Question answering, which involves creating word vectors for questions and using attention maps to predict the importance of each of these words for a given product.

Day 3 (Tuesday, August 6) -- Oral Presentations

The day started with a keynote presentation titled The Unreasonable Effectiveness and Difficulty of Data in Healthcare by Dr Peter Lee of Microsoft Research. To a large extent, his points mirror those made by Dr. Eric Topol in Deep Medicine in terms of what is possible in medicine with the help of ML/AI, but he also looks at the challenges that must be overcome before this vision becomes reality.

Following that, I attended two sessions of Applied Data Science Oral Presentations, one on Auto-ML and Development Frameworks, and the other on Language Models and Text Mining, and then one session of Research Track Oral Presentation on Neural Networks.

I found the following papers interesting in the first Applied Data Science session. Auto-Keras: An Efficient Neural Architecture Search System uses Bayesian Optimization to find the most efficient Dense Keras network for your application. To the user, calling this is a one-liner. Currently this works on legacy Keras, but the authors are working with Google to have this ported to tf.keras as well. A more interesting framework keras-tuner currently works with tf.keras, and while invoking keras-tuner involves more lines of code, it does seem to be more flexible as well. TF-Ranking: Scalable Tensorflow Library for Learning-to-Rank is another Learning to Rank (LTR) framework that is meant to be used instead of libraries like RankLib or LambdaMART. It provides pointwise, pairwise based, and listwise ranking functions. FDML: A Collaborative Machine Learning Framework for Distributed Learning is meant to be used where learning needs to happen across platforms which are unable to share data either because of volume or privacy reasons. The idea is to learn local models with diverse local features, which will output local results, then combine local results to get the final prediction. In addition, there was a talk on Pythia: AI assisted code completion system that is used in the VSCode editor, and Shrinkage Estimators in Online Experiments, which mentions the Pytorch based Adaptive Experimentation Platform for Bayesian Parameter Optimization.

The second Applied Data Science session was on Language Models. The papers I found interesting in this session are as follows. Unsupervised Clinical Language Translation (paper) which uses an unsupervised technique to induce a dictionary between clinical phrases and corresponding layman phrases, then uses a standard Machine Translation (MT) pipeline to translate one to the other. A reverse pipeline is also constructed, which can be used to generate more training data for the MT pipeline. GMail Smart Compose: Real-Time Assisted Writing underlies the phrase completion feature most GMail users are familiar with. It is achieved by interpolating predictions from a large global language model and a smaller per-user language model. As part of this work, they have open sourced Lingvo, a Tensorflow based framework for building sequence models. And finally, Naranjo Question Answering using End-to-End Multi-task Learning Model attempts to infer adverse drug reactions (ADR) from EHRs by answering the Naranho questionnaire using automated question answering. There was also Automatic Dialog Summary Generation for Customer Service uses key point sequences to guide the summarization process, and uses a novel Leader-Writer network for the purpose.

The final oral presentation session for the day was the Research Track on Neural Networks. Unfortunately, I did not find any of the papers useful, in terms of techniques I could borrow for my own work. I did get the impression that Graph based Neural Networks were the new thing, since almost every paper used some form of Graph network. Apart from graph embeddings that are derived from node properties or conducting random walks on graphs, there is the graph convolution network (GCN) which uses graph local features instead of spatially local features. The GCN-MF: Disease-Gene Association Identification by Graph Convolutional Networks and Matrix Factorization uses this kind of architecture to detect associations between diseases and genes. Similarly, the Origin-destination Matrix prediction via Graph Convolution: A new perspective of Passenger Demand Modeling uses GCNs to predict demand for ride-hailing services.

The exhibition booth had also opened earlier that day, so I spent some time wandering the stalls, meeting a few people and asking questions about their products. There were a couple of publishers, MIT Press and Springer, selling Data Science books. There were some graph database companies, TigerGraph and Neo4j. Microsoft and Amazon were the two cloud providers with booths, but Google wasn't present (not my observation, it was pointed out to me by someone else). Indeed and LinkedIn were also there. NVIDIA was promoting its RAPIDS GPU-based acceleration framework, along with its GPUs. There were a bunch of smaller data science / analytics companies as well. I picked up a couple of stickers and some literature from the National Security Agency (NSA) and the NVIDIA booths.

I then wandered over to the poster area. I managed to talk to a few people and listen to a few presentations. Notable among them was the poster on Chainer: a Deep Learning Framework for Accelerating the Research Cycle. I haven't used Chainer, but looking at the code in the poster, it looked a bit like Pytorch (or more correctly perhaps, Pytorch looks a bit like Chainer). Another framework to pick up when time permits, hopefully.

Day 4 (Wednesday, August 7) -- Hands-on Tutorials

I came in bright and early, hoping to attend the day's keynote presentation, but ended up having a great conversation with a data scientist from Minnesota instead, as we watched the sun rise across the Alaska range from the third floor terrace of the conference building. In any case, the keynote I planned on attending ended up getting cancelled, so it was all good. For my activity that day, I had decided on attending two hands-on tutorials, one about Deep Learning for Natural Language Processing with Tensorflow, and the other about Deep Learning at Scale on Databricks.

The Deep Learning for NLP with Tensorflow was taught by a team from Google. It uses the Tensorflow 2.x style of eager execution and tf.keras. It covers basics, then rapidly moves on to sequence models (RNN, LSTM), embeddings, sequence to sequence models, attention, and transformers. As far as teachability goes, I have spent a fair bit of time trying to figure this stuff out myself, then trying to express it in the cleanest possible way to others, and I thought this was the most intuitive explanation of attention I have seen so far. The slide deck is here, they contain links to various Collab notebooks. The Collab notebooks can also be found at this github link. The tutorial then covers the transformer architecture, and students (in an ideal world with enough internet bandwidth and time) are taught how to construct a transformer encoder-decoder architecture from scratch. They also teach you how to user the pre-trained BERT model from TF-Hub and optionally fine tune it. Because we were not in an ideal world, after the initial few Collab notebooks, it was basically a lecture, where we are encouraged to run the notebooks on our own time.

The Deep Learning at Scale on Databricks was taught by a team from Databricks, and was apparently Part-II in a two part session. But quite a few of us showed up based on the session being marked as a COPY of the morning session, so the instructor was kind enough to run through the material again for our benefit. The slide deck can be found here. Unfortunately, I can no longer locate the URL for the notebook file archive to be imported into Databricks, but I am guessing these notebooks will soon be available as a Databricks tutorial. We used the Databricks platform provided by Microsoft Azure. In any case, the class schedule was supposed to cover Keras basics, MLFlow, Horovod for distributed model training, HyperOpt for simultaneously training models on workers with different hyperparameters. We ended up running through the Keras basics very fast, then spending some time on MLFlow, and finally run distributed training with Horovod on Spark. Most people had come to get some hands-on with Horovod anyway, so not being able to cover HyperOpt was not a big deal for most of us.

That evening was also the KDD dinner. I guess lot of people (including me, based on past ACL conferences) had expected something more formal, but it turned out to be a standup with drinks and hors-d'oeuvres. To be fair, the stand-up model does give you more opportunities to network. However, it was also quite crowded, so after a fairly long time spent in lines with correspondingly little profit, I decided to hit the nearby Gumbo House where I enjoyed a bowl of gumbo and some excellent conversation with a couple of AWS engineers, also KDD attendees who decided to eat out rather than braving the lines. Talking of food, other good places to eat at Anchorage downtown are the Orso, Pangea, and Fletcher's (good pizza). I am sure there are others, but these are the ones I went to and can recommend.

Day 5 (Thursday, August 8) -- More Hands-on Tutorial

This was the last day of the conference. I had a slightly early flight (3 pm) which meant that I would be able to attend only sessions in the first half. In the morning keynote, Prof. Cynthia Rudin of Duke University spoke about her experience with smaller simpler models versus large complex ones, and made the point that it is harder to come up with a simple model because the additional constraints are harder to satisfy. She then shows that it is possible to empirically test for whether one or more simple models are available by looking at accuracies from multiple ML models. Overall, a very thought provoking and useful talk.

For the rest of the day, I chose another hands-on tutorial titled From Graph to Knowledge Graph: Mining Large-scale Heterogeneous Networks using Spark taught by a team from Microsoft. As with the previous hands-on, we used Databricks provided by Azure. The objective was to learn to operate on subsets of the Microsoft Academic Graph, using Databricks notebooks available on this github site. However, since we were all sharing a cluster, there wasn't enough capacity for the students to do any hands-on, so we ended up watching the instructor run through the notebooks on the projector. The initial notebooks (run before lunch) seemed fairly basic, with standard DataFrame operators being used. I am guessing the fun stuff happened in the afternoon after I left, but in any case, Microsoft also offers a longer course From Graph to Knowledge Graph - Algorithms and Applications on edX, which I plan on auditing.

Closing Thoughts

There were some logistical issues, that in hindsight perhaps, could be avoided. While Anchorage is a beautiful city and I thoroughly enjoyed my time there, for some attendees it was perhaps not as great an experience. One particularly scary problem was that some people's hotel bookings got cancelled due to a mixup with their online travel agents, which meant that they had no place to sleep when they arrived here. Apparently some people had to sleep on park benches -- I thought that was particularly scary, at least until the University of Alaska opened up their dormitory to accommodate the attendees who had nowhere to go. I didn't get accommodation at the "KDD approved" hotels listed on their site either, but I did end up getting a place to stay that was only a 7 minute walk from the conference venue, so I count myself as one of the lucky ones. However, apart from this one major mishap, I think the conference went mostly smoothly.

At RecSys 2018, which I attended last year, one of the people in the group I found myself in said that he had finally "found his people". While my conference experience has been improving steadily over time with respect to the social aspect, and I did end up making lot more friends at RecSys 2018 than I did here (partly due to the network effect of my colleague and his friends being die-hard extroverts), I do think I have finally found mine at KDD.

Salmon Run